Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DaemonSet pods should be scheduled by default scheduler, not DaemonSet controller #42002

Closed
davidopp opened this issue Feb 23, 2017 · 73 comments
Closed
Assignees
Labels
area/workload-api/daemonset sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@davidopp
Copy link
Member

DaemonSet controller being its own scheduler has caused lots of confusion. Once we have tolerations and preemption, I think we won't need DaemonSet-specific scheduling logic. See also #42001.

cc/ @kubernetes/sig-scheduling-misc @kubernetes/sig-cluster-lifecycle-misc

@davidopp davidopp added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Feb 23, 2017
@davidopp
Copy link
Member Author

cc/ @kubernetes/sig-apps-misc

@smarterclayton
Copy link
Contributor

smarterclayton commented Feb 23, 2017 via email

@timothysc
Copy link
Member

So the part where @jayunit100 was refactoring portions of the scheduler to be linked externally without causing a dependency train wreck was to support issues like this.

@davidopp
Copy link
Member Author

It's much better for DaemonSet to just produce pods like every other controller, and let them be scheduled by the regular scheduler, than to be its own scheduler. There are definitely use cases for linking in scheduler code to a component, but I don't think this is one of them. Even if it were easy to do that, DaemonSet shouldn't be a scheduler.

@0xmichalis
Copy link
Contributor

0xmichalis commented Feb 24, 2017

+1

Trying to make the DS controller sync with the scheduler feels like a nightmare so far.

@timothysc
Copy link
Member

@davidopp Totally agree...
But then daemonset just becomes a special anti-affinity based on node... More of a macro than a primitive, which was my original position in the 1st place.

@k82cn
Copy link
Member

k82cn commented Feb 25, 2017

+1

@resouer
Copy link
Contributor

resouer commented Feb 25, 2017

A naive question, what happens if there's no scheduler running? (for example, this DS is used to deploy master components).

@0xmichalis
Copy link
Contributor

Good question, cluster admins can always run one-off pods and bypass the scheduler manually.

@davidopp
Copy link
Member Author

This would presumably also fix #35228.

@timothysc
Copy link
Member

A naive question, what happens if there's no scheduler running? (for example, this DS is used to deploy master components).

@mikedanese proposed 1-shot scheduling for boot-strapping ~year ago.

@0xmichalis
Copy link
Contributor

@jbeda @luxas @roberthbailey you guys will be interested in this one

@luxas
Copy link
Member

luxas commented Apr 4, 2017

Yeah, thanks @Kargakis. This was exactly the "fear" @jbeda mentioned.
If we do this, I'd rather have a "bootstrapping mode" that bypasses this (in case this breaks people, we don't know that yet)

@roberthbailey
Copy link
Contributor

I not sure I fully understand the bootstrapping concerns. You can always bypass the scheduler if you want (by specifying a node explicitly), so why not start with that for bootstrapping and then pivot into using a DS once the scheduler is running?

@jayunit100
Copy link
Member

jayunit100 commented Apr 5, 2017

This brings up the question of wether it's possible to protect certain things from being written to the API server by certain unauthorized components in general? I don't know the answer to this question.

@smarterclayton
Copy link
Contributor

We can also just add a new core controller that satisfies long running scheduler failures. However, that's not enough, since you usually need a node, replication controller, and deployment controller in order to run. Failsafe self-hosted components is really a discussion in itself, and deserves a longer design. I don't think daemonset should be doing its own scheduling either, especially with sophisticated taints in play.

@k82cn
Copy link
Member

k82cn commented Apr 5, 2017

Not sure why we support bypass the scheduler; scheduler should be the only component to place the pods on Node (static pod is considered to be out of scheduling. Can we update the Node.allocable to not include static pod's resource?).

  1. If placing Pod by admin, I doubt whether he knows the detail of cluster to make the "right" decision
  2. If placing Pod by external-scheduler, it should be moved to scheduler or an extender/plugin of scheduler
  3. If any requirement to scheduler, suggest to enhance scheduler

@jayunit100
Copy link
Member

jayunit100 commented Apr 5, 2017

Are we all in agreement here that the scheduler should be the one doing the scheduling in all cases ?

@k82cn
Copy link
Member

k82cn commented Apr 5, 2017

the scheduler should be the one doing the scheduling in all cases

IMO: +1 , and it should be one of "kubernetes core". I'd like to see others' comments :).

@smarterclayton
Copy link
Contributor

Having the scheduler doing the scheduling means we can take away bind permission from components besides the scheduler, and then introduce openshift's admission controller that requires that permission to modify node name on pods, which then prevents bypassing scheduling for "normal" tenants and makes tolerations a viable security mechanism

@luxas
Copy link
Member

luxas commented Apr 6, 2017

The primary issue we discussed in sig-cluster-lifecycle was that currently a pod network DS is deployed regardless of Node Ready/NotReady. When we switch to the scheduler this won't happen.

But I think switching to tolerations for reporting Ready/NotReady status will fix that issue

@timothysc
Copy link
Member

@davidopp enqueue for next sig meeting.

@davidopp
Copy link
Member Author

davidopp commented Apr 7, 2017

I think it's reasonable to ship by default with "only default scheduler can set NodeName" (assuming there aren't any bootstrapping issues) but I think we need to make it possible for cluster admins to optionally add to the set of components that can set NodeName, so that people can write custom schedulers. (Not all user customizations to the scheduler can be cast into the extender model.)

@wojtek-t
Copy link
Member

wojtek-t commented Jul 5, 2017

For the others, my point is just that maybe continuing to do sequential scheduling would be OK (mix dry-runs and real scheduling, but never both at the same time, in sequence) if we limit the rate of the dry-run requessts.

Maybe - we just need to list the usecases and see if rate-limiting wouldn't make it unusable due to too small number of possible requests.

@davidopp
Copy link
Member Author

davidopp commented Jul 8, 2017

I'm not sure where we should go with this. If we rely on DaemonSet controller to manage critical system pods (e.g. run kube-proxy on every node), then the system may break if the cluster admin substitutes a scheduler that doesn't schedule the DaemonSet pods as expected.

I discussed this a little with @timothysc and his perspective (I'll let him elaborate more on it) was that there are many things a cluster admin can do to break things, and they should just make sure that if they use a custom scheduler, it should handle DaemonSet pods correctly.

I think the "simulation" question (how components that need to "simulate" where pods will schedule should work in a world of replaceable and multiple schedulers) is a good one but should be discussed in a separate issue (#48657) since it applies to more than just DaemonSet. I think that even if we had a perfect solution for "simulation" of scheduler behavior, we might still want DaemonSet to schedule its own pods, so that critical system pods can always rely on being scheduled in a predictable way regardless of what replacement scheduler might be used.

But I'll let @timothysc chime in with his perspective.

(Also, I came across #29276, which is related to this one.)

@k82cn
Copy link
Member

k82cn commented Jul 10, 2017

I think that even if we had a perfect solution for "simulation" of scheduler behavior, we might still want DaemonSet to schedule its own pods, so that critical system pods can always rely on being scheduled in a predictable way regardless of what replacement scheduler might be used.

So we're going to still let DaemonSet schedule its pods?

I think the replacement scheduler should handle Pods's attributes well, e.g. mirror pod, critical pods.

@bsalamat
Copy link
Member

This is a tough one and is hard to say which one is a better approach. If I had to decide, I would still let the main scheduler schedule DaemonSet pods. I agree with @timothysc that admin can do many things to break a system. Even if we let the DaemonSet controller schedule its pods, a substitute scheduler can still break the system badly by, for example, not scheduling ReplicaSet pods. I know that many critical system pods are DaemonSet pods and if they are not scheduled properly the cluster will not function, but the net outcome of a cluster with a broken scheduler that supposed to run, for example, a replicated web service is the same whether DeamonSet pods are scheduled by the DaemonSet controller or not. In both cases, the cluster will not perform what it is supposed to do. So, it is imperative that the substitute scheduler perform all of its expected work correctly and part of the "expectation" can be scheduling DaemonSet pods.

@davidopp
Copy link
Member Author

One thing I forgot that @timothysc had suggested was that we could have scheduler conformance tests that verify that a replacement scheduler meets whatever minimum criteria we have. (This is kinda an automated version of #17208.) This is an excellent idea regardless of what we do with DaemonSet. For example it could verify that the scheduler implements the predicates that are built into kubelet admission.

I'm still not sure about DaemonSet. If we decide we want it to use the default scheduler, we may have to implement #48657.

@k82cn
Copy link
Member

k82cn commented Jul 11, 2017

+1 to conformance tests for replacement scheduler

After went through the discussion again, I'm also not sure about DaemonSet :). I think it dependents on our expectation to the replacement scheduler, e.g. will replacement scheduler support NodeAffinity?

regarding dry-run, prefer to add some annotations to Pod instead of exposing REST from default-scheduler; it introduce extras dependency between components (we introduce HTTPExtender because Golang does not support dynamic lab)

@erictune
Copy link
Member

To allow bootstrapping a scheduler pod onto a cluster, there needs to be some type of controller that does not depend on the scheduler itself. Given that there has to be something that does it, I don't see why DaemonSet can't be that thing.

@davidopp
Copy link
Member Author

Bootstrapping was discussed earlier in the thread (though perhaps not thoroughly), see #42002 (comment)

@k82cn
Copy link
Member

k82cn commented Jul 28, 2017

To allow bootstrapping a scheduler pod onto a cluster, there needs to be some type of controller that does not depend on the scheduler itself. Given that there has to be something that does it, I don't see why DaemonSet can't be that thing.

If using DaemonSet, we need to enhance DaemonSet to support start only one daemon (scheduler) on a set of nodes (master nodes).

@luxas
Copy link
Member

luxas commented Jul 28, 2017 via email

@k82cn
Copy link
Member

k82cn commented Jul 29, 2017

I think it's usually part of master nodes, e.g. 3 scheduler in 5 master nodes, with other add-on, e.g. dashboard.

@aveshagarwal
Copy link
Member

@k82cn I am not sure if this has been discussed. How will the scenario, where a DS' pod's node selector is updated in an admission plugin like PodNodeSelector, be handled? See this issue #51788

@k82cn
Copy link
Member

k82cn commented Sep 26, 2017

@aveshagarwal , nop, I think there's no finial decision for now; maybe we can continue the discuss in 1.9. According to the sig-apps backlog, maybe @erictune have some suggestion to move forward :).

@erictune
Copy link
Member

It is not going to be a good user experience if the daemonSet controller makes one pod for every node, and lets the scheduler decide which ones fit. Then you end up with a bunch of pending pods, which is confusing to users, and perhaps extra load on the scheduler, and perhaps confusing for cluster auto scalers.

So, somehow, the daemonSet needs to know (modulo race conditions, taints, etc), how many pods to create.

Is there some way that the daemonSet can know which nodes likely match, and make pods for each node, and then still let the scheduler do a final fit check, etc.

Also, isn't scheduler performance going to be poor if it has to deal with a large number of pods with anti-affinity?

What can we do.

(BTW, I no longer thing the bootstrapping use case matters, so ignore this comment: #42002 (comment))

@bsalamat
Copy link
Member

Given all the discussions around this topic, I am fine with letting DaemonSet controller keep its current behavior and schedule DaemonSet pods. As @erictune said, this has the benefit of reducing scheduler and auto-scaler load. It will also provide better guarantees that critical DaemonSet pods will be scheduled in a more predictable manner even in clusters where the default scheduler is replaced by a custom scheduler.
The obvious drawbacks are: 1) the DeamonSet controller needs to keep some of the cluster state in memory, as does scheduler, to check certain scheduling rules with reasonable throughput, 2) despite the fact that scheduler predicates are like a library that can be called by DaemonSet controller, there will be still more work maintaining some of the scheduling logic in both the scheduler and DaemonSet controller. We should especially be careful when adding new predicate functions and make sure they are added to the DaemonSet controller if needed.

@davidopp
Copy link
Member Author

Just an observation: if we keep DaemonSet scheduling pods, then we will need to add some of the preemption logic to it -- getting rid of the "critical pod rescheduler" is one of the motivations for having a general preemption mechanism. I say "some" because it doesn't need to decide which node to preempt on (since it wants to run on every node that meets the other constraints), but it does need to decide the correct minimal set of victims. Also, it is one more component that can delete pods.

@k82cn
Copy link
Member

k82cn commented Oct 12, 2017

It is not going to be a good user experience if the daemonSet controller makes one pod for every node, and lets the scheduler decide which ones fit. Then you end up with a bunch of pending pods, which is confusing to users, and perhaps extra load on the scheduler, and perhaps confusing for cluster auto scalers.

IMO, as an admin, I'll plan DaemonSet based on "static" setting, for example, I plan to start DaemonSet on 5 nodes (by nodeSelector), and maxUnavailable is 3; if it only start 2 pod because of resource, the rolling upgrade should be pending (it will break my QoS/SLA if terminating the only 2 available pod).

Also, isn't scheduler performance going to be poor if it has to deal with a large number of pods with anti-affinity?

Maybe NodeAffinity :). Regarding the scheduler performance, I think there're some work we can do to improve it, e.g. #42002 (comment) vs #42002 (comment)

Our doc said k8s support multiple and user-provided schedulers but we did not give a suggestion on how those scheduler work together; there maybe several conflicts if dependent on kubelet.


@bsalamat
Copy link
Member

As discussed in the mailing lists of sig-scheduling and sig-architecture, the final decision is to keep the current behavior of DaemonSet controller and let it schedule its own pods.

@davidopp
Copy link
Member Author

In that case I guess we should close this.

BTW the thread @bsalamat refers to is here:
https://groups.google.com/d/msg/kubernetes-sig-scheduling/kMG7yfONwY4/Nx3abXuNAAAJ

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/workload-api/daemonset sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet