-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When scheduling, treat zero limit-pods as having nonzero limit when calculating priorities #10661
Conversation
GCE e2e build/test passed for commit 3417cdc79ba09031a177b2db106be0be601c81b3. |
GCE e2e build/test passed for commit 415ce247a8c9a949bc162173360bc851156be871. |
// (dumbSpreadingDenominator - number of pods already on the node + 1) / dumbSpreadingDenominator. | ||
// dumbSpreadingDenominator serves like the machine capacity in LeasRequestedPriority but is chosen | ||
// so that we equate one pod with a reasonable amount of resources when we combine all the scores together. | ||
func DumbSpreadingPriority(pod *api.Pod, podLister algorithm.PodLister, minionLister algorithm.MinionLister) (algorithm.HostPriorityList, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add a test for spreading (at least a unit test) so that noone will silently break it in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to do anything too detailed (like "given N pods and M nodes, exactly N/M pods should land on each node) because that makes it hard to make changes later. But I'll add a test that checks for the pathological behavior that motivated this PR (all the pods going to the same node).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, some test would be nice (unit, integration or e2e). I think spreading is important enough that we should eventually write an e2e for it, because node overcommit issues are notoriously hard to debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidopp - I completely agree that having a very strict test is not the best option. However, having a test that e.g. has 5 nodes, 2 of then with something simulating addons on them and scheduling 50 pods to check that e.g. at least 5 pods were scheduled on each node is reasonable. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a very simple test. Let me know what you think. I decided to do this in the style of priorities_test.go rather than generic_scheduler_test.go because it seemed slightly easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with this test (although I think it's pretty implementation-driven). But I think it's fine for now.
// it uses 10 * percentage of machine free by pod, with "percentage of machine free by pod" claculated as | ||
// (dumbSpreadingDenominator - number of pods already on the node + 1) / dumbSpreadingDenominator. | ||
// dumbSpreadingDenominator serves like the machine capacity in LeasRequestedPriority but is chosen | ||
// so that we equate one pod with a reasonable amount of resources when we combine all the scores together. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't clear why this is better than leastrequester with:
if pod.resource == 0 {
pod.resource = 10
}
One reason is that with the suggested pod.resource defaulting you can never have > 10 unspecified resource pods per node, while your algorithm will just stop contributing to node score in the >10 pods sitaution.
However, I get the feeling we can end still end up with an imbalanced set of pods for exactly this reason. For example if I start off with an imbalanced distribution of cluster addons and I create a 100 pod rc in a limitless namespace, DumbSpreading will do the right thing till there are roughly 8-10 pods on each node, at which point whatever contribution it makes will not outweight the other 2.
This is why I initially considered dynamic weights, tell this algorithm to outweigh everything else in the no resource specified case and keep completely quiet otherwise. If I've misunderstood something and the situation described can never happen (or if we don't care about it), please elaborate the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I agree with everything you said, but you said many things, so there are many things to agree or disagree with. :)
(1) I think one thing you said is that I should be using LeastRequestedPriority ("It isn't clear why this is better than leastrequester"). If so, I don't understand -- I basically am using LeastRequestedPriority, just splitting it into a separate function and using pods instead of resource values (same as your PR did). I might be misinterpreting this comment though.
(2) You said "tell this algorithm to outweigh everything else in the no resource specified case and keep completely quiet otherwise." I don't agree with that. If one machine has one pod that we know takes up 99% of the machine (by limit) and another machine has one pod with 1% limit, I don't think we want to evenly distribute the next N arriving 0-limit pods between those two machines. Conversely, as @bgrant0607 mentioned earlier, we probably want to consider dumb spreading even when scheduling non-0-limit pods. So I don't think we ever want to "turn off" either the resource-based priority or the pod-spreading-based priority. Both resources and spreading should always contribute something when scheduling any kind of pod.
(3) You pointed out that the approach I'm using "maxes out" at contributing 50% towards the final score when there are 10 pods. So even if a node has 1000 pods that each request 0 or some tiny limit, we might continue to schedule more pods onto it. This is true but I think this only happens in extreme corner cases. For example in a case like
node1: total limit 95% and 1 pod
node2: total limit 20% and 10 pods
we will still put the next pod on node1. So I think that within reasonable ranges the simple weighting I'm using works OK. I'm really reluctant to go to a dynamic weighting mechanism so close to the code deadline. As long as we handle the case where the user starts out with the standard cluster addon pods, and then schedules a bunch of 0-limit pods, those pods should be reasonably evenly spread, then I think it's good enough for now. I do need to verify that's the case (I need to see what limits we're giving the addon pods and calculate) but I suspect it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mostly agree with @davidopp (especially with (2)). Also I think it's really close to 1.0 and we should make dramatic changes now. After 1.0 we should revisit the whole scheduling algorithm anyway and then we can come up with something more sophisticated. But I think it's fine for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not asking for dynamic weights, just for clarification around the following scenario.
Your score function:
(dumbSpreadingDenominator - number of pods already on the node + 1) / dumbSpreadingDenominator
When used with:
calculateScore(min(npods+1, dumbSpreadingDenominator), dumbSpreadingDenominator, node.Name)
Will result in the score:
int(((capacity - requested) * 10) / capacity) = 0
for npods >=10.
So if I start off with 4 nodes that have 2,2,2,1 addons (we currently have 7 addons, just to give you a realistic scenario), and create a 100 pod rc in a namespace without limits, will I end up with ~70 pods on the last node?
If the answer is yes, maybe we should acknowledge that in the docs. If the answer is no, please correct me :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with your observation that in the current version, once you reach 10 (or more) pods, DumbSpreadPriority == 0 and we schedule only based on resources. We could make the denominator larger, which would cause the DumbSpreadPriority to decline more slowly as # pods increases, but that can also lead to giving too much weight to the resources (limit).
For example, today a machine that's 80% full with 1 pod has the same score (call it 2 + 8 = 10) as a machine that's 20% full with 7 pods (8 + 2). This feels vaguely in the right ballpark, I guess. But say we make the denominator 30. Now the first machine has a score of 2 + 9.33 = 11.33 and the second machine has a score of 8 + 7.33 = 15.33 so we're strongly favoring the second machine--we're giving too much weight to limit, and not enough to number of pods. In fact they won't have an equal score until there are 20 pods on the second machine. (8 + 3.33). That seems bad.
I'm open to suggestions on how to address the problem you brought up. Is there a different calculateScore() function we could use? It's too late at night for me to think of one right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hate discrete functions... I also mostly agree with @davidopp, but I also believe that @bprashanth observation about maxing out score pretty quickly is justified. IIUC you just zero the score with 10th pod running and scheduler won't distinguish Node with 10 running pods from one with 99 running ones (assuming a number of 0-limit ones and the same resource requests for the rest). In continuous world we could just apply some log or other cot function to it and be happy, for discrete ones it's harder, but still I'd consider doing something similar. We'll lose some resolution at the bottom, but we'll gain some differentiation between loaded nodes.
I guess we should flatten at 30 Pods, as this is officially supported number, but I believe we should distinguish 10 pods/node from 25. E.g. something like 10-[2*log_2(x+1)]. Which has some "bad" properties (big drop at the beginning: 1->9, 2->7, 3->6), but in general I believe values it gives are sane for < 63 pods/node..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option is to calculate the denominator as a function of pods on the node (start off assuming we can only fit 10 pods on a node, refine to 20, and so on till max-pods -- maybe we should also lower max-pods?). This simulates an actual feedback loop from the kubelet, and feels like we're saying: If we have enough space we'll given unspecified pods 10% room to grow, but as we get squeezed for room we'll give less and less to a single pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to stick any function into calculateScore as long as its inputs and outputs are the same as the current calculateScore. So having the denominator vary with the number of pods on the node is fine with me, if that can actually give the behavior we want. I didn't think through @bprashanth's suggestion above yet to know whether it does that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bprashanth I re-read your initial comment and now I understand what you were saying -- instead of adding a spreading algorithm, just treat limit-0 pods as if they had some non-zero limit (maybe 5% or 10% of the capacity of the machine or something). I remember we did talk about that on the issue, but I think it could still have drawbacks (for example if the limit difference between two machines is 50%, it will put 5 or 10 pods on one machine before moving to the other machine).
An alternative would be something you suggested earlier: to use dumb spreading for limit-0 pods and the resource-based priority algorithms for the pods with limits. I'm starting to think that might be the way to go.
I've added some comments, but in general I think it's a good approach for now (just before 1.0). |
GCE e2e build/test passed for commit 53518e3. |
GCE e2e build/test passed for commit 44ed229. |
@@ -62,7 +62,8 @@ func calculateOccupancy(pod *api.Pod, node api.Node, pods []*api.Pod) algorithm. | |||
|
|||
cpuScore := calculateScore(totalMilliCPU, capacityMilliCPU, node.Name) | |||
memoryScore := calculateScore(totalMemory, capacityMemory, node.Name) | |||
glog.V(10).Infof( | |||
// glog.V(10).Infof( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert and probably change V(10) to V(4) or something more reasonable than 10.
Please ignore the logging lines I modified. I will change them back to V(10) before this PR is done. |
*/ | ||
pod: &api.Pod{Spec: noResources}, | ||
nodes: []api.Node{makeMinion("machine1", 1000, 10000), makeMinion("machine2", 1000, 10000)}, | ||
expectedList: []algorithm.HostPriority{{"machine1", 39}, {"machine2", 40}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's very implementation-driven, but I think it's fine for now. But maybe add a TODO to write something more e2e-like after 1.0.
I've decided to dump all of this and just count pods with limit=0 as pods with limit=10% of the node. We can debate the 10% number; I'll just make it a constant. But after many hours of thinking about this I think that's better than any of the approaches that involve an explicit spreading-based-on-number-of-pods priority function. I'll update the PR later today. |
pods as having a constant non-zero memory and CPU limit.
OK, I've updated the PR to get rid of DumbSpreadingPriority and just treat pods with nonzero limit as having a small constant limit for the purposes of priority functions. PTAL. BTW the logging was using V(10) and I left it that way. I can change it to V(4) if people want. |
GCE e2e build/test failed for commit 4ea8b8a. |
if capacity == 0 { | ||
return 0 | ||
} | ||
if requested > capacity { | ||
glog.Infof("Combined requested resources from existing pods exceeds capacity on minion: %s", node) | ||
glog.Infof("Combined requested resources %d from existing pods exceeds capacity %d on minion: %s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: please reword minion->node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@k8s-bot ok to test |
GCE e2e build/test passed for commit 2e3f2ea. |
GCE e2e build/test passed for commit 9fbccb4. |
@bprashanth @dchen1107 @quinton-hoole could I get an LGTM unless you see something remaining that needs to be done? |
@davidopp - we should change one thing before merging this one - basically our scalability tests are not setting limits at all, which means that after this change we won't be able to start 30 of them on the same node. We should change this test to set the limits before this change. |
I will send an appropriate PR later today. |
@wojtek-t IIUC it won't cause any node to be unschedulable. This PR impacts only computation of priorities, not predicates. |
That's a good point - I agree. So this isn't a blocker, but I think we should change it anyway. |
if capacity == 0 { | ||
return 0 | ||
} | ||
if requested > capacity { | ||
glog.Infof("Combined requested resources from existing pods exceeds capacity on minion: %s", node) | ||
glog.Infof("Combined requested resources %d from existing pods exceeds capacity %d on node %s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really sure how helpful this additional data info will be, as one will need to guess what resource it's about. In addition I'm not sure if plain Info won't spam normal logs - we'll probably get over capacity pretty often in case of 0-limit Pods.
If this goes in, need to update https://github.com/erictune/kubernetes/blob/resources/docs/compute_resources.md#how-pods-with-resource-limits-are-scheduled |
LGTM, still need 2 leads I guess. Also not sure why shippable hasn't run yet. I'd say this is medium risk, relatively high value, since it "fixes" spreading by applying defaults at priority computation time. We could get the doc fix in as a follow up since it only needs a single lgtm and won't race for a deadline. |
@dchen1107 has said that this is critical for 1.0. I'll re-trigger shippable now. |
OK, shippable passed. Could I get some LGTMs and an ok-to-merge please? I'd very much like to get this into 0.21. |
LGTM |
ok to merge |
Merging so it can test for 0.21. (Heads up @yujuhong.) |
When scheduling, treat zero limit-pods as having nonzero limit when calculating priorities
Fixes #10242
@gmarek @bprashanth