When scheduling, treat zero limit-pods as having nonzero limit when calculating priorities #10661

davidopp · 2015-07-02T05:05:37Z

Fixes #10242

@gmarek @bprashanth

k8s-bot · 2015-07-02T05:25:39Z

GCE e2e build/test passed for commit 3417cdc79ba09031a177b2db106be0be601c81b3.

k8s-bot · 2015-07-02T05:33:14Z

GCE e2e build/test passed for commit 415ce247a8c9a949bc162173360bc851156be871.

wojtek-t · 2015-07-02T06:39:56Z

plugin/pkg/scheduler/algorithm/priorities/priorities.go

+// (dumbSpreadingDenominator - number of pods already on the node + 1) / dumbSpreadingDenominator.
+// dumbSpreadingDenominator serves like the machine capacity in LeasRequestedPriority but is chosen
+// so that we equate one pod with a reasonable amount of resources when we combine all the scores together.
+func DumbSpreadingPriority(pod *api.Pod, podLister algorithm.PodLister, minionLister algorithm.MinionLister) (algorithm.HostPriorityList, error) {


I think we should add a test for spreading (at least a unit test) so that noone will silently break it in the future.

I don't want to do anything too detailed (like "given N pods and M nodes, exactly N/M pods should land on each node) because that makes it hard to make changes later. But I'll add a test that checks for the pathological behavior that motivated this PR (all the pods going to the same node).

I agree, some test would be nice (unit, integration or e2e). I think spreading is important enough that we should eventually write an e2e for it, because node overcommit issues are notoriously hard to debug.

@davidopp - I completely agree that having a very strict test is not the best option. However, having a test that e.g. has 5 nodes, 2 of then with something simulating addons on them and scheduling 50 pods to check that e.g. at least 5 pods were scheduled on each node is reasonable. What do you think?

Added a very simple test. Let me know what you think. I decided to do this in the style of priorities_test.go rather than generic_scheduler_test.go because it seemed slightly easier.

I'm fine with this test (although I think it's pretty implementation-driven). But I think it's fine for now.

bprashanth · 2015-07-02T20:54:30Z

plugin/pkg/scheduler/algorithm/priorities/priorities.go

+// it uses 10 * percentage of machine free by pod, with "percentage of machine free by pod" claculated as
+// (dumbSpreadingDenominator - number of pods already on the node + 1) / dumbSpreadingDenominator.
+// dumbSpreadingDenominator serves like the machine capacity in LeasRequestedPriority but is chosen
+// so that we equate one pod with a reasonable amount of resources when we combine all the scores together.


It isn't clear why this is better than leastrequester with:

if pod.resource == 0 { pod.resource = 10 }

One reason is that with the suggested pod.resource defaulting you can never have > 10 unspecified resource pods per node, while your algorithm will just stop contributing to node score in the >10 pods sitaution.

However, I get the feeling we can end still end up with an imbalanced set of pods for exactly this reason. For example if I start off with an imbalanced distribution of cluster addons and I create a 100 pod rc in a limitless namespace, DumbSpreading will do the right thing till there are roughly 8-10 pods on each node, at which point whatever contribution it makes will not outweight the other 2.

This is why I initially considered dynamic weights, tell this algorithm to outweigh everything else in the no resource specified case and keep completely quiet otherwise. If I've misunderstood something and the situation described can never happen (or if we don't care about it), please elaborate the comment.

I don't think I agree with everything you said, but you said many things, so there are many things to agree or disagree with. :)

(1) I think one thing you said is that I should be using LeastRequestedPriority ("It isn't clear why this is better than leastrequester"). If so, I don't understand -- I basically am using LeastRequestedPriority, just splitting it into a separate function and using pods instead of resource values (same as your PR did). I might be misinterpreting this comment though.

(2) You said "tell this algorithm to outweigh everything else in the no resource specified case and keep completely quiet otherwise." I don't agree with that. If one machine has one pod that we know takes up 99% of the machine (by limit) and another machine has one pod with 1% limit, I don't think we want to evenly distribute the next N arriving 0-limit pods between those two machines. Conversely, as @bgrant0607 mentioned earlier, we probably want to consider dumb spreading even when scheduling non-0-limit pods. So I don't think we ever want to "turn off" either the resource-based priority or the pod-spreading-based priority. Both resources and spreading should always contribute something when scheduling any kind of pod.

(3) You pointed out that the approach I'm using "maxes out" at contributing 50% towards the final score when there are 10 pods. So even if a node has 1000 pods that each request 0 or some tiny limit, we might continue to schedule more pods onto it. This is true but I think this only happens in extreme corner cases. For example in a case like

node1: total limit 95% and 1 pod
node2: total limit 20% and 10 pods
we will still put the next pod on node1. So I think that within reasonable ranges the simple weighting I'm using works OK. I'm really reluctant to go to a dynamic weighting mechanism so close to the code deadline. As long as we handle the case where the user starts out with the standard cluster addon pods, and then schedules a bunch of 0-limit pods, those pods should be reasonably evenly spread, then I think it's good enough for now. I do need to verify that's the case (I need to see what limits we're giving the addon pods and calculate) but I suspect it is.

I mostly agree with @davidopp (especially with (2)). Also I think it's really close to 1.0 and we should make dramatic changes now. After 1.0 we should revisit the whole scheduling algorithm anyway and then we can come up with something more sophisticated. But I think it's fine for now.

I'm not asking for dynamic weights, just for clarification around the following scenario.

Your score function:

(dumbSpreadingDenominator - number of pods already on the node + 1) / dumbSpreadingDenominator

When used with:

calculateScore(min(npods+1, dumbSpreadingDenominator), dumbSpreadingDenominator, node.Name)

Will result in the score:

int(((capacity - requested) * 10) / capacity) = 0

for npods >=10.

So if I start off with 4 nodes that have 2,2,2,1 addons (we currently have 7 addons, just to give you a realistic scenario), and create a 100 pod rc in a namespace without limits, will I end up with ~70 pods on the last node?

If the answer is yes, maybe we should acknowledge that in the docs. If the answer is no, please correct me :)

I agree with your observation that in the current version, once you reach 10 (or more) pods, DumbSpreadPriority == 0 and we schedule only based on resources. We could make the denominator larger, which would cause the DumbSpreadPriority to decline more slowly as # pods increases, but that can also lead to giving too much weight to the resources (limit).

For example, today a machine that's 80% full with 1 pod has the same score (call it 2 + 8 = 10) as a machine that's 20% full with 7 pods (8 + 2). This feels vaguely in the right ballpark, I guess. But say we make the denominator 30. Now the first machine has a score of 2 + 9.33 = 11.33 and the second machine has a score of 8 + 7.33 = 15.33 so we're strongly favoring the second machine--we're giving too much weight to limit, and not enough to number of pods. In fact they won't have an equal score until there are 20 pods on the second machine. (8 + 3.33). That seems bad.

I'm open to suggestions on how to address the problem you brought up. Is there a different calculateScore() function we could use? It's too late at night for me to think of one right now.

I hate discrete functions... I also mostly agree with @davidopp, but I also believe that @bprashanth observation about maxing out score pretty quickly is justified. IIUC you just zero the score with 10th pod running and scheduler won't distinguish Node with 10 running pods from one with 99 running ones (assuming a number of 0-limit ones and the same resource requests for the rest). In continuous world we could just apply some log or other cot function to it and be happy, for discrete ones it's harder, but still I'd consider doing something similar. We'll lose some resolution at the bottom, but we'll gain some differentiation between loaded nodes.

I guess we should flatten at 30 Pods, as this is officially supported number, but I believe we should distinguish 10 pods/node from 25. E.g. something like 10-[2*log_2(x+1)]. Which has some "bad" properties (big drop at the beginning: 1->9, 2->7, 3->6), but in general I believe values it gives are sane for < 63 pods/node..

Another option is to calculate the denominator as a function of pods on the node (start off assuming we can only fit 10 pods on a node, refine to 20, and so on till max-pods -- maybe we should also lower max-pods?). This simulates an actual feedback loop from the kubelet, and feels like we're saying: If we have enough space we'll given unspecified pods 10% room to grow, but as we get squeezed for room we'll give less and less to a single pod.

I'm happy to stick any function into calculateScore as long as its inputs and outputs are the same as the current calculateScore. So having the denominator vary with the number of pods on the node is fine with me, if that can actually give the behavior we want. I didn't think through @bprashanth's suggestion above yet to know whether it does that.

@bprashanth I re-read your initial comment and now I understand what you were saying -- instead of adding a spreading algorithm, just treat limit-0 pods as if they had some non-zero limit (maybe 5% or 10% of the capacity of the machine or something). I remember we did talk about that on the issue, but I think it could still have drawbacks (for example if the limit difference between two machines is 50%, it will put 5 or 10 pods on one machine before moving to the other machine).

An alternative would be something you suggested earlier: to use dumb spreading for limit-0 pods and the resource-based priority algorithms for the pods with limits. I'm starting to think that might be the way to go.

wojtek-t · 2015-07-03T06:14:35Z

I've added some comments, but in general I think it's a good approach for now (just before 1.0).

k8s-bot · 2015-07-03T08:53:11Z

GCE e2e build/test passed for commit 53518e3.

k8s-bot · 2015-07-03T08:59:13Z

GCE e2e build/test passed for commit 44ed229.

gmarek · 2015-07-03T09:16:15Z

plugin/pkg/scheduler/algorithm/priorities/priorities.go

@@ -62,7 +62,8 @@ func calculateOccupancy(pod *api.Pod, node api.Node, pods []*api.Pod) algorithm.

 	cpuScore := calculateScore(totalMilliCPU, capacityMilliCPU, node.Name)
 	memoryScore := calculateScore(totalMemory, capacityMemory, node.Name)
-	glog.V(10).Infof(
+//	glog.V(10).Infof(


Please revert and probably change V(10) to V(4) or something more reasonable than 10.

davidopp · 2015-07-03T10:15:34Z

Please ignore the logging lines I modified. I will change them back to V(10) before this PR is done.

wojtek-t · 2015-07-03T10:33:56Z

plugin/pkg/scheduler/algorithm/priorities/priorities_test.go

+                         */
+			pod:          &api.Pod{Spec: noResources},
+			nodes:        []api.Node{makeMinion("machine1", 1000, 10000), makeMinion("machine2", 1000, 10000)},
+			expectedList: []algorithm.HostPriority{{"machine1", 39}, {"machine2", 40}},


It's very implementation-driven, but I think it's fine for now. But maybe add a TODO to write something more e2e-like after 1.0.

davidopp · 2015-07-04T19:03:19Z

I've decided to dump all of this and just count pods with limit=0 as pods with limit=10% of the node. We can debate the 10% number; I'll just make it a constant. But after many hours of thinking about this I think that's better than any of the approaches that involve an explicit spreading-based-on-number-of-pods priority function.

I'll update the PR later today.

pods as having a constant non-zero memory and CPU limit.

davidopp · 2015-07-05T18:45:08Z

OK, I've updated the PR to get rid of DumbSpreadingPriority and just treat pods with nonzero limit as having a small constant limit for the purposes of priority functions. PTAL.

BTW the logging was using V(10) and I left it that way. I can change it to V(4) if people want.

@dchen1107 @gmarek @bprashanth

k8s-bot · 2015-07-05T18:58:37Z

GCE e2e build/test failed for commit 4ea8b8a.

bprashanth · 2015-07-05T21:53:27Z

plugin/pkg/scheduler/algorithm/priorities/priorities.go

 	if capacity == 0 {
 		return 0
 	}
 	if requested > capacity {
-		glog.Infof("Combined requested resources from existing pods exceeds capacity on minion: %s", node)
+		glog.Infof("Combined requested resources %d from existing pods exceeds capacity %d on minion: %s",


Nit: please reword minion->node

davidopp · 2015-07-06T04:29:50Z

@k8s-bot ok to test

k8s-bot · 2015-07-06T04:50:19Z

GCE e2e build/test passed for commit 2e3f2ea.

k8s-bot · 2015-07-06T04:51:08Z

GCE e2e build/test passed for commit 9fbccb4.

davidopp · 2015-07-06T04:53:16Z

@bprashanth @dchen1107 @quinton-hoole could I get an LGTM unless you see something remaining that needs to be done?

wojtek-t · 2015-07-06T06:18:09Z

@davidopp - we should change one thing before merging this one - basically our scalability tests are not setting limits at all, which means that after this change we won't be able to start 30 of them on the same node. We should change this test to set the limits before this change.

wojtek-t · 2015-07-06T06:24:37Z

I will send an appropriate PR later today.

gmarek · 2015-07-06T06:28:54Z

@wojtek-t IIUC it won't cause any node to be unschedulable. This PR impacts only computation of priorities, not predicates.

wojtek-t · 2015-07-06T06:29:56Z

That's a good point - I agree. So this isn't a blocker, but I think we should change it anyway.

gmarek · 2015-07-06T07:42:12Z

plugin/pkg/scheduler/algorithm/priorities/priorities.go

 	if capacity == 0 {
 		return 0
 	}
 	if requested > capacity {
-		glog.Infof("Combined requested resources from existing pods exceeds capacity on minion: %s", node)
+		glog.Infof("Combined requested resources %d from existing pods exceeds capacity %d on node %s",


I'm not really sure how helpful this additional data info will be, as one will need to guess what resource it's about. In addition I'm not sure if plain Info won't spam normal logs - we'll probably get over capacity pretty often in case of 0-limit Pods.

erictune · 2015-07-06T18:40:46Z

If this goes in, need to update https://github.com/erictune/kubernetes/blob/resources/docs/compute_resources.md#how-pods-with-resource-limits-are-scheduled

bprashanth · 2015-07-06T18:51:20Z

LGTM, still need 2 leads I guess. Also not sure why shippable hasn't run yet.

I'd say this is medium risk, relatively high value, since it "fixes" spreading by applying defaults at priority computation time.

We could get the doc fix in as a follow up since it only needs a single lgtm and won't race for a deadline.

davidopp · 2015-07-06T19:08:11Z

@dchen1107 has said that this is critical for 1.0.

I'll re-trigger shippable now.

davidopp · 2015-07-06T19:56:48Z

OK, shippable passed. Could I get some LGTMs and an ok-to-merge please? I'd very much like to get this into 0.21.

@dchen1107 @quinton-hoole

bgrant0607 · 2015-07-06T20:13:48Z

LGTM

bgrant0607 · 2015-07-06T20:14:10Z

ok to merge

zmerlynn · 2015-07-06T20:18:49Z

Merging so it can test for 0.21. (Heads up @yujuhong.)

When scheduling, treat zero limit-pods as having nonzero limit when calculating priorities

googlebot added the cla: yes label Jul 2, 2015

wojtek-t reviewed Jul 2, 2015
View reviewed changes

nikhiljindal assigned wojtek-t Jul 2, 2015

bprashanth reviewed Jul 2, 2015
View reviewed changes

Add DumbSpreadingPriority, which tries to spread pods across nodes.

cb9d515

erictune mentioned this pull request Jul 3, 2015

Add user-oriented compute resource doc. #10573

Merged

David Oppenheimer added 2 commits July 3, 2015 01:34

Add a test for DumbSpreadingPriority.

53518e3

Foo.

44ed229

gmarek reviewed Jul 3, 2015
View reviewed changes

wojtek-t reviewed Jul 3, 2015
View reviewed changes

Get rid of separate DumbSpreading function and just treat zero-limit

4ea8b8a

pods as having a constant non-zero memory and CPU limit.

bprashanth reviewed Jul 5, 2015
View reviewed changes

davidopp changed the title ~~Add DumbSpreadingPriority, which tries to spread pods across nodes.~~ When scheduling, treat zero limit-pods as having nonzero limit when calculating priorities Jul 6, 2015

davidopp closed this Jul 6, 2015

davidopp reopened this Jul 6, 2015

gofmt

2e3f2ea

gmarek reviewed Jul 6, 2015
View reviewed changes

gmarek mentioned this pull request Jul 6, 2015

Add spreading by controllers #10667

Merged

davidopp closed this Jul 6, 2015

davidopp reopened this Jul 6, 2015

bgrant0607 added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 6, 2015

bgrant0607 added this to the v1.0 milestone Jul 6, 2015

bgrant0607 added the ok-to-merge label Jul 6, 2015

zmerlynn added a commit that referenced this pull request Jul 6, 2015

Merge pull request #10661 from davidopp/somebranch

8278928

When scheduling, treat zero limit-pods as having nonzero limit when calculating priorities

zmerlynn merged commit 8278928 into kubernetes:master Jul 6, 2015

erictune mentioned this pull request Jul 6, 2015

Scheduler needs to deal with pods without resource limits #10242

Closed

davidopp mentioned this pull request Jul 7, 2015

Increase zero-limit pod RAM for spreading to 200 MB to match cluster #10857

Merged

When scheduling, treat zero limit-pods as having nonzero limit when calculating priorities #10661

When scheduling, treat zero limit-pods as having nonzero limit when calculating priorities #10661

Conversation

davidopp commented Jul 2, 2015

k8s-bot commented Jul 2, 2015

k8s-bot commented Jul 2, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wojtek-t commented Jul 3, 2015

k8s-bot commented Jul 3, 2015

k8s-bot commented Jul 3, 2015

Choose a reason for hiding this comment

davidopp commented Jul 3, 2015

Choose a reason for hiding this comment

davidopp commented Jul 4, 2015

davidopp commented Jul 5, 2015

k8s-bot commented Jul 5, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidopp commented Jul 6, 2015

k8s-bot commented Jul 6, 2015

k8s-bot commented Jul 6, 2015

davidopp commented Jul 6, 2015

wojtek-t commented Jul 6, 2015

wojtek-t commented Jul 6, 2015

gmarek commented Jul 6, 2015

wojtek-t commented Jul 6, 2015

Choose a reason for hiding this comment

erictune commented Jul 6, 2015

bprashanth commented Jul 6, 2015

davidopp commented Jul 6, 2015

davidopp commented Jul 6, 2015

bgrant0607 commented Jul 6, 2015

bgrant0607 commented Jul 6, 2015

zmerlynn commented Jul 6, 2015