Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inter-pod topological affinity/anti-affinity proposal #18265

Merged
merged 1 commit into from
Jan 25, 2016

Conversation

davidopp
Copy link
Member

@davidopp davidopp commented Dec 6, 2015

Fixes #14816.

It's helpful to read #18261 and #18263 before this one.

ref/ #367

@bgrant0607 @timothysc @resouer @ravigadde @kubernetes/rh-cluster-infra @kevin-wangzefeng @alfred-huangjian

@k8s-github-robot
Copy link

Labelling this PR as size/L

@k8s-github-robot k8s-github-robot added kind/design Categorizes issue or PR as related to design. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 6, 2015
@k8s-bot
Copy link

k8s-bot commented Dec 6, 2015

GCE e2e build/test failed for commit 43ef32055bd8420e0517f74f47dd05c9f8ae4311.

@k8s-bot
Copy link

k8s-bot commented Dec 7, 2015

GCE e2e build/test failed for commit d82a185df71c718d47a26a2a8f4f5908a444c3fe.

@davidopp davidopp changed the title Inter-pod topological affinity/anti-affinity proposal initial draft Inter-pod topological affinity/anti-affinity proposal Dec 7, 2015
HardAffinityExpressions: {{PodSelector: P1, TopologyKey: "node"}},
HardAntiAffinityExpressions: {{PodSelector: P2, TopologyKey: "rack"}},
SoftAffinityExpressions: {{PodSelector: P3, TopologyKey: "zone"}},
SoftAntiAffinityExpressions: {{PodSelector: P4, TopologyKey: "power"}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think multiple TopologyKey also make sense here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary; you should just be able to write two expressions separate with the same PodSelector to get the same result. See for example the example "Spread the pods of this service S across nodes and zones" later in the doc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bgrant0607 suggested that as a special case, we can make empty TopologyKey mean "all topologies". But if you want multiple specific topologies, you just use multiple rules.

@k8s-bot
Copy link

k8s-bot commented Dec 7, 2015

GCE e2e test build/test passed for commit 73cdb41ed54d08f75316dc12c155c988d91296cc.

@davidopp
Copy link
Member Author

davidopp commented Dec 7, 2015

@quinton-hoole this is relevant to both flavors of Ubernetes

When `HardAffinityExpressions` has multiple elements, the requirements are ANDed. Likewise for `HardAntiAffinityExpressions`.
For `SoftAffinityExpressions` and `SoftAntiAffinityExpressions`, nodes that satisfy more of the rules are preferred over those that satisfy fewer.

## A comment on symmetry
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to craft expressions that are sane we may want to call out the creation of an analysis tool that can run as if it were the scheduler.

e.g. you have 0 possible matches.
xref: https://research.cs.wisc.edu/htcondor/HTCondorWeek2013/presentations/KnoellerJ_QAnalyze.pdf

@davidopp
Copy link
Member Author

davidopp commented Dec 8, 2015

Comments from meeting with @bgrant0607 today follow. I will update the proposal to reflect them shortly:

  • It's useful to allow user to specify "all topologies" without having to know all of their names; so interpret empty TopologyKey as meaning "this rule applies to all topology domains"
  • The proposal suggests to specially interpret PodSelector in AffinityTerm so as to interpret empty PodSelector to mean "all pods globally" rather than the normal interpretation which would be "all pods in the namespace of the pod." Brian suggested instead to add a field to AffinityTerm that specifies which Namespace(s) the PodSelector should apply to. The field would be a list of Namespaces that the PodSelector should match against; a nil list would mean use the default behavior (default to same as pod's Namespace) and an empty list would mean interpret as applying to all Namespaces.
  • Should tell users not to use these features until they have been in Kubelet and master for enough binary versions that we feel comfortable we will not need to roll back either to a version that does not support them; longer-term would be good to have a way to enforce this (Cluster versioning #4855)

BTW one thing that occurred to me is that in theory, Kubelet could enforce node-level hard affinity and node-level hard anti-affinity rules, but I think this would be confusing, since it can't enforce rules at any other level of the topology. Better to just leave this feature to the scheduler.

in this doc later, in the section "Examples."
* Affinity
* Co-locate the pods from a particular Service or Job in the same availability zone
* When scheduling a pod that uses a license for a particular piece of software, try
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that here you're trying to express a generalization of something like the classic IAAS use case where e.g. if a Windows VM lands on a physical machine, that physical machine incurs a monthly windows license fee, even if the VM runs for less than a month (e.g. an hour)? (so subsequent Windows VM's landing on that physical machine within the license period/month can inherit their predecessor's license? If that's the use case, I think that node labelling is more appropriate (specifically because the predecessor pod/VM may no longer be on the machine/node, so pod affinity won't actually work). If you're meaning some other use case, I'm struggling to imagine a good example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO you would need dynamic labeling (a.k.a - attributes, or machine_ad) for persistence in the windows license. But you may also want it for hot-tracking of images or data, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah my example was wrong. As @quinton-hoole points out, you want the node property to persist even after the pod goes away, so the affinity mechanism I've described in this doc doesn't solve this case. What you need for this use case is a special version of "node preferences" (#18261) -- pod prefers nodes with a "windows" label; node gets a windows label when it starts a pod that has a windows preference; node loses the label 30 days later if there are no pods with windows preference running on the node at that time. I know how you could implement this but I don't think we should do it now.

@davidopp
Copy link
Member Author

Our current plan is to implement this as an alpha annotation. For that level of maturity, LGTM.

I think that if we do node affinity as an annotation in 1.2, and things go well with that, it would be OK to do pod affinity as a field in 1.3. I agree if we do pod affinity in 1.2 we should do it as an annotation.

I think all of these proposals are failing the generated doc check. Please rebase them all and run hack/update-generated-docs.sh.

Yeah I wasn't planning to do the cleanup stuff like squash commits and running update-generated-docs.sh until the content was LGTMd.

@k8s-github-robot
Copy link

PR changed after LGTM, removing LGTM.

@k8s-github-robot k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 24, 2016
@davidopp
Copy link
Member Author

@bgrant0607 : Fixed to reflect new API syntax, and added substantial new sections discussing the corner-cases/dangers that were brought up in the review comments (and updated implementation plan to reflect mitigation strategies for them).

PTAL.

@k8s-bot
Copy link

k8s-bot commented Jan 24, 2016

GCE e2e test build/test passed for commit e89a094dc7a757ecb59579441a88bc10ec449a9f.

@k8s-github-robot k8s-github-robot removed the kind/design Categorizes issue or PR as related to design. label Jan 24, 2016
@k8s-bot
Copy link

k8s-bot commented Jan 24, 2016

GCE e2e test build/test passed for commit 265e101034b19247a57c1578c95c540c38239454.

@k8s-bot
Copy link

k8s-bot commented Jan 24, 2016

GCE e2e test build/test passed for commit 10fd20e2036448707f00437a7d6485aa0df91609.

@k8s-bot
Copy link

k8s-bot commented Jan 24, 2016

GCE e2e test build/test passed for commit f5a1fc59a8f7ff48c63eeb86a0316199f860ac0b.

@k8s-bot
Copy link

k8s-bot commented Jan 24, 2016

GCE e2e test build/test passed for commit cea5cf4.

@davidopp davidopp added the kind/design Categorizes issue or PR as related to design. label Jan 24, 2016
// nil list means "this pod's namespace," empty list means "all namespaces"
// The json tag here is not "omitempty" since we need to distinguish nil and empty.
// See https://golang.org/pkg/encoding/json/#Marshal for more details.
Namespaces []api.Namespace `json:"namespaces"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should have omitempty. Nil is default

@bgrant0607
Copy link
Member

LGTM. In the future, please don't squash until the review is done.

@bgrant0607 bgrant0607 added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 24, 2016
@k8s-github-robot
Copy link

Automatic merge from submit-queue

k8s-github-robot pushed a commit that referenced this pull request Jan 25, 2016
Auto commit by PR queue bot
@k8s-github-robot k8s-github-robot merged commit 33d1f6c into kubernetes:master Jan 25, 2016
xingzhou pushed a commit to xingzhou/kubernetes that referenced this pull request Dec 15, 2016
openshift-publish-robot pushed a commit to openshift/kubernetes that referenced this pull request Jan 29, 2018
Automatic merge from submit-queue (batch tested with PRs 18295, 18265).

UPSTREAM: 58408: switch hyperkube to cobra

This gives us a hyperkube that works.  Since we only ship hyperkube in our images (not the individual commands), we really need this.

Origin-commit: 4762d6a6e11c9f3c7d7a7d3ac8bdbbaa7c603548
openshift-publish-robot pushed a commit to openshift/kubernetes that referenced this pull request Feb 27, 2018
Automatic merge from submit-queue (batch tested with PRs 18295, 18265).

UPSTREAM: 58408: switch hyperkube to cobra

This gives us a hyperkube that works.  Since we only ship hyperkube in our images (not the individual commands), we really need this.

Origin-commit: 4762d6a6e11c9f3c7d7a7d3ac8bdbbaa7c603548
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/design Categorizes issue or PR as related to design. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.