Description
What would you like to be added?
At the moment, DRA uses the approach that one scheduler instance "owns" all resources on a node or available for a node (in the case of network-attached devices). This is the same approach that is used for other resources. It enables faster scheduling because allocation can happen without coordination with other entities.
This approach breaks down when there are multiple schedulers in the cluster such that each scheduler instance is responsible for its own subset of the nodes ("sharding") and there are network-attached devices that are available for more than one set of nodes.
Also, sometimes users run additional schedulers for the same nodes as the system scheduler. While that is already problematic regarding CPU and memory, with devices it might be even worse.
/sig scheduling
/wg device-management
Why is this needed?
For more advanced cluster setups.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
📋 Backlog
Activity