DaemonSet pods should be scheduled by default scheduler, not DaemonSet controller #42002

davidopp · 2017-02-23T20:27:51Z

DaemonSet controller being its own scheduler has caused lots of confusion. Once we have tolerations and preemption, I think we won't need DaemonSet-specific scheduling logic. See also #42001.

cc/ @kubernetes/sig-scheduling-misc @kubernetes/sig-cluster-lifecycle-misc

davidopp · 2017-02-23T20:49:17Z

cc/ @kubernetes/sig-apps-misc

smarterclayton · 2017-02-23T23:45:27Z

Agree

timothysc · 2017-02-24T00:03:54Z

So the part where @jayunit100 was refactoring portions of the scheduler to be linked externally without causing a dependency train wreck was to support issues like this.

davidopp · 2017-02-24T10:08:23Z

It's much better for DaemonSet to just produce pods like every other controller, and let them be scheduled by the regular scheduler, than to be its own scheduler. There are definitely use cases for linking in scheduler code to a component, but I don't think this is one of them. Even if it were easy to do that, DaemonSet shouldn't be a scheduler.

0xmichalis · 2017-02-24T16:22:44Z

+1

Trying to make the DS controller sync with the scheduler feels like a nightmare so far.

timothysc · 2017-02-24T16:37:10Z

@davidopp Totally agree...
But then daemonset just becomes a special anti-affinity based on node... More of a macro than a primitive, which was my original position in the 1st place.

k82cn · 2017-02-25T08:46:24Z

+1

resouer · 2017-02-25T15:11:09Z

A naive question, what happens if there's no scheduler running? (for example, this DS is used to deploy master components).

0xmichalis · 2017-02-25T15:46:05Z

Good question, cluster admins can always run one-off pods and bypass the scheduler manually.

davidopp · 2017-02-26T06:50:59Z

This would presumably also fix #35228.

timothysc · 2017-02-27T14:41:08Z

A naive question, what happens if there's no scheduler running? (for example, this DS is used to deploy master components).

@mikedanese proposed 1-shot scheduling for boot-strapping ~year ago.

0xmichalis · 2017-04-04T16:33:08Z

@jbeda @luxas @roberthbailey you guys will be interested in this one

luxas · 2017-04-04T18:36:30Z

Yeah, thanks @Kargakis. This was exactly the "fear" @jbeda mentioned.
If we do this, I'd rather have a "bootstrapping mode" that bypasses this (in case this breaks people, we don't know that yet)

roberthbailey · 2017-04-05T05:39:49Z

I not sure I fully understand the bootstrapping concerns. You can always bypass the scheduler if you want (by specifying a node explicitly), so why not start with that for bootstrapping and then pivot into using a DS once the scheduler is running?

jayunit100 · 2017-04-05T11:11:34Z

This brings up the question of wether it's possible to protect certain things from being written to the API server by certain unauthorized components in general? I don't know the answer to this question.

smarterclayton · 2017-04-05T11:55:51Z

We can also just add a new core controller that satisfies long running scheduler failures. However, that's not enough, since you usually need a node, replication controller, and deployment controller in order to run. Failsafe self-hosted components is really a discussion in itself, and deserves a longer design. I don't think daemonset should be doing its own scheduling either, especially with sophisticated taints in play.

k82cn · 2017-04-05T12:37:08Z

Not sure why we support bypass the scheduler; scheduler should be the only component to place the pods on Node (static pod is considered to be out of scheduling. Can we update the Node.allocable to not include static pod's resource?).

If placing Pod by admin, I doubt whether he knows the detail of cluster to make the "right" decision
If placing Pod by external-scheduler, it should be moved to scheduler or an extender/plugin of scheduler
If any requirement to scheduler, suggest to enhance scheduler

jayunit100 · 2017-04-05T12:40:55Z

Are we all in agreement here that the scheduler should be the one doing the scheduling in all cases ?

k82cn · 2017-04-05T12:44:44Z

the scheduler should be the one doing the scheduling in all cases

IMO: +1 , and it should be one of "kubernetes core". I'd like to see others' comments :).

smarterclayton · 2017-04-05T18:25:02Z

Having the scheduler doing the scheduling means we can take away bind permission from components besides the scheduler, and then introduce openshift's admission controller that requires that permission to modify node name on pods, which then prevents bypassing scheduling for "normal" tenants and makes tolerations a viable security mechanism

luxas · 2017-04-06T13:42:01Z

The primary issue we discussed in sig-cluster-lifecycle was that currently a pod network DS is deployed regardless of Node Ready/NotReady. When we switch to the scheduler this won't happen.

But I think switching to tolerations for reporting Ready/NotReady status will fix that issue

timothysc · 2017-04-06T15:54:22Z

@davidopp enqueue for next sig meeting.

davidopp · 2017-04-07T05:53:55Z

I think it's reasonable to ship by default with "only default scheduler can set NodeName" (assuming there aren't any bootstrapping issues) but I think we need to make it possible for cluster admins to optionally add to the set of components that can set NodeName, so that people can write custom schedulers. (Not all user customizations to the scheduler can be cast into the extender model.)

wojtek-t · 2017-07-05T06:46:40Z

For the others, my point is just that maybe continuing to do sequential scheduling would be OK (mix dry-runs and real scheduling, but never both at the same time, in sequence) if we limit the rate of the dry-run requessts.

Maybe - we just need to list the usecases and see if rate-limiting wouldn't make it unusable due to too small number of possible requests.

davidopp · 2017-07-08T06:06:40Z

I'm not sure where we should go with this. If we rely on DaemonSet controller to manage critical system pods (e.g. run kube-proxy on every node), then the system may break if the cluster admin substitutes a scheduler that doesn't schedule the DaemonSet pods as expected.

I discussed this a little with @timothysc and his perspective (I'll let him elaborate more on it) was that there are many things a cluster admin can do to break things, and they should just make sure that if they use a custom scheduler, it should handle DaemonSet pods correctly.

I think the "simulation" question (how components that need to "simulate" where pods will schedule should work in a world of replaceable and multiple schedulers) is a good one but should be discussed in a separate issue (#48657) since it applies to more than just DaemonSet. I think that even if we had a perfect solution for "simulation" of scheduler behavior, we might still want DaemonSet to schedule its own pods, so that critical system pods can always rely on being scheduled in a predictable way regardless of what replacement scheduler might be used.

But I'll let @timothysc chime in with his perspective.

(Also, I came across #29276, which is related to this one.)

k82cn · 2017-07-10T15:42:28Z

I think that even if we had a perfect solution for "simulation" of scheduler behavior, we might still want DaemonSet to schedule its own pods, so that critical system pods can always rely on being scheduled in a predictable way regardless of what replacement scheduler might be used.

So we're going to still let DaemonSet schedule its pods?

I think the replacement scheduler should handle Pods's attributes well, e.g. mirror pod, critical pods.

bsalamat · 2017-07-10T19:25:14Z

This is a tough one and is hard to say which one is a better approach. If I had to decide, I would still let the main scheduler schedule DaemonSet pods. I agree with @timothysc that admin can do many things to break a system. Even if we let the DaemonSet controller schedule its pods, a substitute scheduler can still break the system badly by, for example, not scheduling ReplicaSet pods. I know that many critical system pods are DaemonSet pods and if they are not scheduled properly the cluster will not function, but the net outcome of a cluster with a broken scheduler that supposed to run, for example, a replicated web service is the same whether DeamonSet pods are scheduled by the DaemonSet controller or not. In both cases, the cluster will not perform what it is supposed to do. So, it is imperative that the substitute scheduler perform all of its expected work correctly and part of the "expectation" can be scheduling DaemonSet pods.

davidopp · 2017-07-10T19:56:48Z

One thing I forgot that @timothysc had suggested was that we could have scheduler conformance tests that verify that a replacement scheduler meets whatever minimum criteria we have. (This is kinda an automated version of #17208.) This is an excellent idea regardless of what we do with DaemonSet. For example it could verify that the scheduler implements the predicates that are built into kubelet admission.

I'm still not sure about DaemonSet. If we decide we want it to use the default scheduler, we may have to implement #48657.

k82cn · 2017-07-11T02:59:54Z

+1 to conformance tests for replacement scheduler

After went through the discussion again, I'm also not sure about DaemonSet :). I think it dependents on our expectation to the replacement scheduler, e.g. will replacement scheduler support NodeAffinity?

regarding dry-run, prefer to add some annotations to Pod instead of exposing REST from default-scheduler; it introduce extras dependency between components (we introduce HTTPExtender because Golang does not support dynamic lab)

erictune · 2017-07-11T18:36:07Z

To allow bootstrapping a scheduler pod onto a cluster, there needs to be some type of controller that does not depend on the scheduler itself. Given that there has to be something that does it, I don't see why DaemonSet can't be that thing.

davidopp · 2017-07-11T19:45:47Z

Bootstrapping was discussed earlier in the thread (though perhaps not thoroughly), see #42002 (comment)

k82cn · 2017-07-28T09:35:56Z

To allow bootstrapping a scheduler pod onto a cluster, there needs to be some type of controller that does not depend on the scheduler itself. Given that there has to be something that does it, I don't see why DaemonSet can't be that thing.

If using DaemonSet, we need to enhance DaemonSet to support start only one daemon (scheduler) on a set of nodes (master nodes).

luxas · 2017-07-28T09:40:43Z

That's not necessary, the scheduler is active/passive, so it can be deployed as a DS on all master nodes.

…

On 28 Jul 2017, at 12:36, Klaus Ma ***@***.***> wrote: To allow bootstrapping a scheduler pod onto a cluster, there needs to be some type of controller that does not depend on the scheduler itself. Given that there has to be something that does it, I don't see why DaemonSet can't be that thing. If using DaemonSet, we need to enhance DaemonSet to support start only one daemon (scheduler) on a set of nodes (master nodes). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

k82cn · 2017-07-29T00:24:42Z

I think it's usually part of master nodes, e.g. 3 scheduler in 5 master nodes, with other add-on, e.g. dashboard.

aveshagarwal · 2017-09-25T20:14:33Z

@k82cn I am not sure if this has been discussed. How will the scenario, where a DS' pod's node selector is updated in an admission plugin like PodNodeSelector, be handled? See this issue #51788

k82cn · 2017-09-26T01:00:47Z

@aveshagarwal , nop, I think there's no finial decision for now; maybe we can continue the discuss in 1.9. According to the sig-apps backlog, maybe @erictune have some suggestion to move forward :).

erictune · 2017-10-11T16:52:47Z

It is not going to be a good user experience if the daemonSet controller makes one pod for every node, and lets the scheduler decide which ones fit. Then you end up with a bunch of pending pods, which is confusing to users, and perhaps extra load on the scheduler, and perhaps confusing for cluster auto scalers.

So, somehow, the daemonSet needs to know (modulo race conditions, taints, etc), how many pods to create.

Is there some way that the daemonSet can know which nodes likely match, and make pods for each node, and then still let the scheduler do a final fit check, etc.

Also, isn't scheduler performance going to be poor if it has to deal with a large number of pods with anti-affinity?

What can we do.

(BTW, I no longer thing the bootstrapping use case matters, so ignore this comment: #42002 (comment))

bsalamat · 2017-10-11T19:14:02Z

Given all the discussions around this topic, I am fine with letting DaemonSet controller keep its current behavior and schedule DaemonSet pods. As @erictune said, this has the benefit of reducing scheduler and auto-scaler load. It will also provide better guarantees that critical DaemonSet pods will be scheduled in a more predictable manner even in clusters where the default scheduler is replaced by a custom scheduler.
The obvious drawbacks are: 1) the DeamonSet controller needs to keep some of the cluster state in memory, as does scheduler, to check certain scheduling rules with reasonable throughput, 2) despite the fact that scheduler predicates are like a library that can be called by DaemonSet controller, there will be still more work maintaining some of the scheduling logic in both the scheduler and DaemonSet controller. We should especially be careful when adding new predicate functions and make sure they are added to the DaemonSet controller if needed.

davidopp · 2017-10-11T20:41:20Z

Just an observation: if we keep DaemonSet scheduling pods, then we will need to add some of the preemption logic to it -- getting rid of the "critical pod rescheduler" is one of the motivations for having a general preemption mechanism. I say "some" because it doesn't need to decide which node to preempt on (since it wants to run on every node that meets the other constraints), but it does need to decide the correct minimal set of victims. Also, it is one more component that can delete pods.

k82cn · 2017-10-12T07:21:36Z

It is not going to be a good user experience if the daemonSet controller makes one pod for every node, and lets the scheduler decide which ones fit. Then you end up with a bunch of pending pods, which is confusing to users, and perhaps extra load on the scheduler, and perhaps confusing for cluster auto scalers.

IMO, as an admin, I'll plan DaemonSet based on "static" setting, for example, I plan to start DaemonSet on 5 nodes (by nodeSelector), and maxUnavailable is 3; if it only start 2 pod because of resource, the rolling upgrade should be pending (it will break my QoS/SLA if terminating the only 2 available pod).

Also, isn't scheduler performance going to be poor if it has to deal with a large number of pods with anti-affinity?

Maybe NodeAffinity :). Regarding the scheduler performance, I think there're some work we can do to improve it, e.g. #42002 (comment) vs #42002 (comment)

Our doc said k8s support multiple and user-provided schedulers but we did not give a suggestion on how those scheduler work together; there maybe several conflicts if dependent on kubelet.

bsalamat · 2017-10-19T01:00:28Z

As discussed in the mailing lists of sig-scheduling and sig-architecture, the final decision is to keep the current behavior of DaemonSet controller and let it schedule its own pods.

davidopp · 2017-10-19T06:10:57Z

In that case I guess we should close this.

BTW the thread @bsalamat refers to is here:
https://groups.google.com/d/msg/kubernetes-sig-scheduling/kMG7yfONwY4/Nx3abXuNAAAJ

davidopp added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Feb 23, 2017

0xmichalis mentioned this issue Feb 24, 2017

Make DaemonSet respect critical pods annotation when scheduling #42028

Merged

janetkuo added the area/workload-api/daemonset label Feb 24, 2017

timothysc added this to the v1.7 milestone Mar 16, 2017

bgrant0607 mentioned this issue Mar 21, 2017

Workload API v1 requirements umbrella issue #42752

Closed

wojtek-t mentioned this issue Jul 6, 2017

Make nodes upload logs to GCS directly instead of going through jenkins worker #48513

Closed

davidopp mentioned this issue Jul 8, 2017

"Simulation" of scheduling behavior needs to work with modified scheduler policies, replacement schedulers, and multiple schedulers #48657

Open

kow3ns mentioned this issue Jul 17, 2017

DaemonSet to support "add first, then delete" rolling update #48841

Closed

mumoshu mentioned this issue Sep 4, 2017

Self hosted scheduller kubernetes-retired/kube-aws#911

Closed

davidopp closed this as completed Oct 19, 2017

k82cn mentioned this issue Oct 23, 2017

Change interface of priority functions for easier "framework-based" parallelization #24246

Closed

kow3ns mentioned this issue Feb 2, 2018

Schedule DaemonSet Pods by kube-scheduler kubernetes/enhancements#548

Closed

davidopp mentioned this issue Mar 24, 2018

DaemonSet scheduling conflates hostname and nodename #61410

Closed

tnozicka mentioned this issue Apr 5, 2018

UPSTREAM: <carry>: filter daemonset nodes by namespace node selectors openshift/origin#18989

Merged

guevara mentioned this issue Apr 29, 2019

详解 Kubernetes DaemonSet 的实现原理 guevara/read-it-later#3507

Open

yf2008 mentioned this issue May 8, 2020

详解 Kubernetes DaemonSet 的实现原理 yf2008/duty-machine#132

Closed

DaemonSet pods should be scheduled by default scheduler, not DaemonSet controller #42002

DaemonSet pods should be scheduled by default scheduler, not DaemonSet controller #42002

Comments

davidopp commented Feb 23, 2017

davidopp commented Feb 23, 2017

smarterclayton commented Feb 23, 2017 via email

timothysc commented Feb 24, 2017

davidopp commented Feb 24, 2017

0xmichalis commented Feb 24, 2017 • edited Loading

timothysc commented Feb 24, 2017

k82cn commented Feb 25, 2017

resouer commented Feb 25, 2017

0xmichalis commented Feb 25, 2017

davidopp commented Feb 26, 2017

timothysc commented Feb 27, 2017

0xmichalis commented Apr 4, 2017

luxas commented Apr 4, 2017

roberthbailey commented Apr 5, 2017

jayunit100 commented Apr 5, 2017 • edited Loading

smarterclayton commented Apr 5, 2017

k82cn commented Apr 5, 2017

jayunit100 commented Apr 5, 2017 • edited Loading

k82cn commented Apr 5, 2017

smarterclayton commented Apr 5, 2017

luxas commented Apr 6, 2017

timothysc commented Apr 6, 2017

davidopp commented Apr 7, 2017

wojtek-t commented Jul 5, 2017

davidopp commented Jul 8, 2017

k82cn commented Jul 10, 2017

bsalamat commented Jul 10, 2017

davidopp commented Jul 10, 2017

k82cn commented Jul 11, 2017 • edited Loading

erictune commented Jul 11, 2017

davidopp commented Jul 11, 2017

k82cn commented Jul 28, 2017

luxas commented Jul 28, 2017 via email

k82cn commented Jul 29, 2017

aveshagarwal commented Sep 25, 2017

k82cn commented Sep 26, 2017

erictune commented Oct 11, 2017

bsalamat commented Oct 11, 2017

davidopp commented Oct 11, 2017

k82cn commented Oct 12, 2017 • edited Loading

bsalamat commented Oct 19, 2017

davidopp commented Oct 19, 2017

0xmichalis commented Feb 24, 2017 •

edited

Loading

jayunit100 commented Apr 5, 2017 •

edited

Loading

jayunit100 commented Apr 5, 2017 •

edited

Loading

k82cn commented Jul 11, 2017 •

edited

Loading

k82cn commented Oct 12, 2017 •

edited

Loading