-
Notifications
You must be signed in to change notification settings - Fork 277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Topology Aware Scheduling (Alpha) #2724
Comments
/assign |
/cc @mwielgus |
@mimowo What is the reason that you do not prefer ResourceFlavor taints instead of dedicated fields? |
Sure, I will be happy to explain, but I'm not sure I understand: which fields do you mean? Maybe this is related to your question (I'm not sure 100%), but a RF can have a set of labels which have nothing to do with topology. For example, they can be to choose a GPU family. |
Let me check the "GPU family" mean. Which K8s features can be represented the GPU family? Node Label? or Node Taints? or other features? |
This was just an example, what I meant is that nodes have labels. Some labels correspond to topology (the new ones, for example Maybe it can be clearer when looking at the example table in: https://github.com/kubernetes-sigs/kueue/blob/5d7847bed87ffa353732164de229b0f94aeab8bd/keps/2724-topology-aware-schedling/README.md#hierarchy-representation. I think two things are important for design choice:
I think we can discuss specific details of the API or alternatives in the KEP. |
@tenzen-y, how quickly will this slam the queuing algorithm if each |
@KPostOffice Thank you for catching up and giving me your feedback. I added a similar concern here: #2725 (comment) Let's discuss that in the KEP PR. |
FYI @tenzen-y @gabesaba @PBundyra @mwielgus It is shared with wg-batch@kubernetes.io, a couple of folks who are involved in reviews, and on-demand. |
@tenzen-y when the alpha phase is ready do you think we should split the issue into "Topology Aware Scheduling (Alpha)" and "Topology Aware Scheduling (Beta)" and close the one for Alpha, or we reuse the issue for Beta graduation? |
I'm ok with either way. |
/close |
@mimowo: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What would you like to be added:
Ability to control how closely the pods are packed on nodes in a data center.
Currently, a user of Kueue, like AI/ML researcher, has no way of telling "run this workload so that all pods are on nodes within a rack (or block)". Running a workload with Pods scattered across a data center results in longer runtimes, and thus costs.
Why is this needed:
To reduce the codes of running AI/ML workloads which require exchanging huge amounts of data over network.
Completion requirements:
This enhancement requires the following artifacts:
The artifacts should be linked in subsequent comments.
The text was updated successfully, but these errors were encountered: