Skip to content

DRA: competition between schedulers + allocators #128980

Open
@pohly

Description

@pohly

What would you like to be added?

At the moment, DRA uses the approach that one scheduler instance "owns" all resources on a node or available for a node (in the case of network-attached devices). This is the same approach that is used for other resources. It enables faster scheduling because allocation can happen without coordination with other entities.

This approach breaks down when there are multiple schedulers in the cluster such that each scheduler instance is responsible for its own subset of the nodes ("sharding") and there are network-attached devices that are available for more than one set of nodes.

Also, sometimes users run additional schedulers for the same nodes as the system scheduler. While that is already problematic regarding CPU and memory, with devices it might be even worse.

/sig scheduling
/wg device-management

Why is this needed?

For more advanced cluster setups.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/featureCategorizes issue or PR as related to a new feature.lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.sig/schedulingCategorizes an issue or PR as relevant to SIG Scheduling.wg/device-managementCategorizes an issue or PR as relevant to WG Device Management.

    Type

    No type

    Projects

    • Status

      📋 Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions