Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topology Aware Scheduling (Alpha) #2724

Closed
3 tasks done
Tracked by #3192
mimowo opened this issue Jul 30, 2024 · 14 comments
Closed
3 tasks done
Tracked by #3192

Topology Aware Scheduling (Alpha) #2724

mimowo opened this issue Jul 30, 2024 · 14 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@mimowo
Copy link
Contributor

mimowo commented Jul 30, 2024

What would you like to be added:

Ability to control how closely the pods are packed on nodes in a data center.

Currently, a user of Kueue, like AI/ML researcher, has no way of telling "run this workload so that all pods are on nodes within a rack (or block)". Running a workload with Pods scattered across a data center results in longer runtimes, and thus costs.

Why is this needed:

To reduce the codes of running AI/ML workloads which require exchanging huge amounts of data over network.

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

@mimowo mimowo added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 30, 2024
@mimowo
Copy link
Contributor Author

mimowo commented Jul 30, 2024

/assign

@mimowo
Copy link
Contributor Author

mimowo commented Jul 30, 2024

/cc @mwielgus

@tenzen-y
Copy link
Member

tenzen-y commented Jul 30, 2024

@mimowo What is the reason that you do not prefer ResourceFlavor taints instead of dedicated fields?
If I am missing any context, please let me know.

@mimowo
Copy link
Contributor Author

mimowo commented Jul 30, 2024

@mimowo What is the reason that you do not prefer ResourceFlavor taints instead of dedicated fields?
If I am missing any context, please let me know.

Sure, I will be happy to explain, but I'm not sure I understand: which fields do you mean?

Maybe this is related to your question (I'm not sure 100%), but a RF can have a set of labels which have nothing to do with topology. For example, they can be to choose a GPU family.

@tenzen-y
Copy link
Member

@mimowo What is the reason that you do not prefer ResourceFlavor taints instead of dedicated fields?
If I am missing any context, please let me know.

Sure, I will be happy to explain, but I'm not sure I understand: which fields do you mean?

Maybe this is related to your question (I'm not sure 100%), but a RF can have a set of labels which have nothing to do with topology. For example, they can be to choose a GPU family.

Let me check the "GPU family" mean. Which K8s features can be represented the GPU family? Node Label? or Node Taints? or other features?

@mimowo
Copy link
Contributor Author

mimowo commented Jul 31, 2024

Let me check the "GPU family" mean. Which K8s features can be represented the GPU family? Node Label? or Node Taints? or other features?

This was just an example, what I meant is that nodes have labels. Some labels correspond to topology (the new ones, for example cloud-provider.com/topology-block, or cloud-provider.com/topology-rack), and some don't (like cloud.google.com/machine-family).

Maybe it can be clearer when looking at the example table in: https://github.com/kubernetes-sigs/kueue/blob/5d7847bed87ffa353732164de229b0f94aeab8bd/keps/2724-topology-aware-schedling/README.md#hierarchy-representation.

I think two things are important for design choice:

  • it is not feasible for an admin to create RFs per rack to match it using the existing API if you have thousands or racks in a cluster
  • some workloads may not fit within a single rack. Still, we want Kueue to compactify the placement of pods so that the number of used racks is minimal. So, some pods with have the value of the label cloud-provider.com/topology-rack: rack1 while others cloud-provider.com/topology-rack: rack2. This is not expressible with the current API.

I think we can discuss specific details of the API or alternatives in the KEP.

@KPostOffice
Copy link
Contributor

@tenzen-y, how quickly will this slam the queuing algorithm if each rack needs to be treated as a different flavor? I know there's limits on the number of flavors that can be defined by a ClusterQueue currently at around 8 or so. @mimowo mentioned thousands of racks. I get the feeling that this should be handled at the scheduler level not at the queuing level.

@tenzen-y
Copy link
Member

@tenzen-y, how quickly will this slam the queuing algorithm if each rack needs to be treated as a different flavor? I know there's limits on the number of flavors that can be defined by a ClusterQueue currently at around 8 or so. @mimowo mentioned thousands of racks. I get the feeling that this should be handled at the scheduler level not at the queuing level.

@KPostOffice Thank you for catching up and giving me your feedback. I added a similar concern here: #2725 (comment)

Let's discuss that in the KEP PR.

@mimowo
Copy link
Contributor Author

mimowo commented Oct 22, 2024

FYI @tenzen-y @gabesaba @PBundyra @mwielgus
I have opened a spreadsheet to keep track of the remaining work (planned in KEP and follow ups): spreadsheet

It is shared with wg-batch@kubernetes.io, a couple of folks who are involved in reviews, and on-demand.

@mimowo
Copy link
Contributor Author

mimowo commented Nov 4, 2024

@tenzen-y when the alpha phase is ready do you think we should split the issue into "Topology Aware Scheduling (Alpha)" and "Topology Aware Scheduling (Beta)" and close the one for Alpha, or we reuse the issue for Beta graduation?

@tenzen-y
Copy link
Member

tenzen-y commented Nov 4, 2024

@tenzen-y when the alpha phase is ready do you think we should split the issue into "Topology Aware Scheduling (Alpha)" and "Topology Aware Scheduling (Beta)" and close the one for Alpha, or we reuse the issue for Beta graduation?

I'm ok with either way.

@mimowo mimowo changed the title Topology Aware Scheduling Topology Aware Scheduling (Alpha) Nov 5, 2024
@mimowo
Copy link
Contributor Author

mimowo commented Nov 5, 2024

I decided to split so that Alpha is visible as closed on the list here: #3192 (I will close it soon before the release as we still have some small improvements pending like #3445)

@mimowo
Copy link
Contributor Author

mimowo commented Nov 5, 2024

/close

@k8s-ci-robot
Copy link
Contributor

@mimowo: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

4 participants