-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inter-pod topological affinity/anti-affinity proposal #18265
Conversation
Labelling this PR as size/L |
GCE e2e build/test failed for commit 43ef32055bd8420e0517f74f47dd05c9f8ae4311. |
GCE e2e build/test failed for commit d82a185df71c718d47a26a2a8f4f5908a444c3fe. |
HardAffinityExpressions: {{PodSelector: P1, TopologyKey: "node"}}, | ||
HardAntiAffinityExpressions: {{PodSelector: P2, TopologyKey: "rack"}}, | ||
SoftAffinityExpressions: {{PodSelector: P3, TopologyKey: "zone"}}, | ||
SoftAntiAffinityExpressions: {{PodSelector: P4, TopologyKey: "power"}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think multiple TopologyKey also make sense here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's necessary; you should just be able to write two expressions separate with the same PodSelector to get the same result. See for example the example "Spread the pods of this service S across nodes and zones" later in the doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bgrant0607 suggested that as a special case, we can make empty TopologyKey mean "all topologies". But if you want multiple specific topologies, you just use multiple rules.
GCE e2e test build/test passed for commit 73cdb41ed54d08f75316dc12c155c988d91296cc. |
@quinton-hoole this is relevant to both flavors of Ubernetes |
When `HardAffinityExpressions` has multiple elements, the requirements are ANDed. Likewise for `HardAntiAffinityExpressions`. | ||
For `SoftAffinityExpressions` and `SoftAntiAffinityExpressions`, nodes that satisfy more of the rules are preferred over those that satisfy fewer. | ||
|
||
## A comment on symmetry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to craft expressions that are sane we may want to call out the creation of an analysis tool that can run as if it were the scheduler.
e.g. you have 0 possible matches.
xref: https://research.cs.wisc.edu/htcondor/HTCondorWeek2013/presentations/KnoellerJ_QAnalyze.pdf
Comments from meeting with @bgrant0607 today follow. I will update the proposal to reflect them shortly:
BTW one thing that occurred to me is that in theory, Kubelet could enforce node-level hard affinity and node-level hard anti-affinity rules, but I think this would be confusing, since it can't enforce rules at any other level of the topology. Better to just leave this feature to the scheduler. |
in this doc later, in the section "Examples." | ||
* Affinity | ||
* Co-locate the pods from a particular Service or Job in the same availability zone | ||
* When scheduling a pod that uses a license for a particular piece of software, try |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that here you're trying to express a generalization of something like the classic IAAS use case where e.g. if a Windows VM lands on a physical machine, that physical machine incurs a monthly windows license fee, even if the VM runs for less than a month (e.g. an hour)? (so subsequent Windows VM's landing on that physical machine within the license period/month can inherit their predecessor's license? If that's the use case, I think that node labelling is more appropriate (specifically because the predecessor pod/VM may no longer be on the machine/node, so pod affinity won't actually work). If you're meaning some other use case, I'm struggling to imagine a good example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO you would need dynamic labeling (a.k.a - attributes, or machine_ad) for persistence in the windows license. But you may also want it for hot-tracking of images or data, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah my example was wrong. As @quinton-hoole points out, you want the node property to persist even after the pod goes away, so the affinity mechanism I've described in this doc doesn't solve this case. What you need for this use case is a special version of "node preferences" (#18261) -- pod prefers nodes with a "windows" label; node gets a windows label when it starts a pod that has a windows preference; node loses the label 30 days later if there are no pods with windows preference running on the node at that time. I know how you could implement this but I don't think we should do it now.
I think that if we do node affinity as an annotation in 1.2, and things go well with that, it would be OK to do pod affinity as a field in 1.3. I agree if we do pod affinity in 1.2 we should do it as an annotation.
Yeah I wasn't planning to do the cleanup stuff like squash commits and running update-generated-docs.sh until the content was LGTMd. |
PR changed after LGTM, removing LGTM. |
@bgrant0607 : Fixed to reflect new API syntax, and added substantial new sections discussing the corner-cases/dangers that were brought up in the review comments (and updated implementation plan to reflect mitigation strategies for them). PTAL. |
GCE e2e test build/test passed for commit e89a094dc7a757ecb59579441a88bc10ec449a9f. |
GCE e2e test build/test passed for commit 265e101034b19247a57c1578c95c540c38239454. |
GCE e2e test build/test passed for commit 10fd20e2036448707f00437a7d6485aa0df91609. |
GCE e2e test build/test passed for commit f5a1fc59a8f7ff48c63eeb86a0316199f860ac0b. |
GCE e2e test build/test passed for commit cea5cf4. |
// nil list means "this pod's namespace," empty list means "all namespaces" | ||
// The json tag here is not "omitempty" since we need to distinguish nil and empty. | ||
// See https://golang.org/pkg/encoding/json/#Marshal for more details. | ||
Namespaces []api.Namespace `json:"namespaces"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should have omitempty. Nil is default
LGTM. In the future, please don't squash until the review is done. |
Automatic merge from submit-queue |
Auto commit by PR queue bot
Auto commit by PR queue bot
Automatic merge from submit-queue (batch tested with PRs 18295, 18265). UPSTREAM: 58408: switch hyperkube to cobra This gives us a hyperkube that works. Since we only ship hyperkube in our images (not the individual commands), we really need this. Origin-commit: 4762d6a6e11c9f3c7d7a7d3ac8bdbbaa7c603548
Automatic merge from submit-queue (batch tested with PRs 18295, 18265). UPSTREAM: 58408: switch hyperkube to cobra This gives us a hyperkube that works. Since we only ship hyperkube in our images (not the individual commands), we really need this. Origin-commit: 4762d6a6e11c9f3c7d7a7d3ac8bdbbaa7c603548
Fixes #14816.
It's helpful to read #18261 and #18263 before this one.
ref/ #367
@bgrant0607 @timothysc @resouer @ravigadde @kubernetes/rh-cluster-infra @kevin-wangzefeng @alfred-huangjian