-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DaemonSet pods should be scheduled by default scheduler, not DaemonSet controller #42002
Comments
cc/ @kubernetes/sig-apps-misc |
Agree
|
So the part where @jayunit100 was refactoring portions of the scheduler to be linked externally without causing a dependency train wreck was to support issues like this. |
It's much better for DaemonSet to just produce pods like every other controller, and let them be scheduled by the regular scheduler, than to be its own scheduler. There are definitely use cases for linking in scheduler code to a component, but I don't think this is one of them. Even if it were easy to do that, DaemonSet shouldn't be a scheduler. |
+1 Trying to make the DS controller sync with the scheduler feels like a nightmare so far. |
@davidopp Totally agree... |
+1 |
A naive question, what happens if there's no scheduler running? (for example, this DS is used to deploy master components). |
Good question, cluster admins can always run one-off pods and bypass the scheduler manually. |
This would presumably also fix #35228. |
@mikedanese proposed 1-shot scheduling for boot-strapping ~year ago. |
@jbeda @luxas @roberthbailey you guys will be interested in this one |
I not sure I fully understand the bootstrapping concerns. You can always bypass the scheduler if you want (by specifying a node explicitly), so why not start with that for bootstrapping and then pivot into using a DS once the scheduler is running? |
This brings up the question of wether it's possible to protect certain things from being written to the API server by certain unauthorized components in general? I don't know the answer to this question. |
We can also just add a new core controller that satisfies long running scheduler failures. However, that's not enough, since you usually need a node, replication controller, and deployment controller in order to run. Failsafe self-hosted components is really a discussion in itself, and deserves a longer design. I don't think daemonset should be doing its own scheduling either, especially with sophisticated taints in play. |
Not sure why we support bypass the scheduler; scheduler should be the only component to place the pods on Node (static pod is considered to be out of scheduling. Can we update the
|
Are we all in agreement here that the scheduler should be the one doing the scheduling in all cases ? |
IMO: +1 , and it should be one of "kubernetes core". I'd like to see others' comments :). |
Having the scheduler doing the scheduling means we can take away bind permission from components besides the scheduler, and then introduce openshift's admission controller that requires that permission to modify node name on pods, which then prevents bypassing scheduling for "normal" tenants and makes tolerations a viable security mechanism |
The primary issue we discussed in sig-cluster-lifecycle was that currently a pod network DS is deployed regardless of Node Ready/NotReady. When we switch to the scheduler this won't happen. But I think switching to tolerations for reporting Ready/NotReady status will fix that issue |
@davidopp enqueue for next sig meeting. |
I think it's reasonable to ship by default with "only default scheduler can set NodeName" (assuming there aren't any bootstrapping issues) but I think we need to make it possible for cluster admins to optionally add to the set of components that can set NodeName, so that people can write custom schedulers. (Not all user customizations to the scheduler can be cast into the extender model.) |
Maybe - we just need to list the usecases and see if rate-limiting wouldn't make it unusable due to too small number of possible requests. |
I'm not sure where we should go with this. If we rely on DaemonSet controller to manage critical system pods (e.g. run kube-proxy on every node), then the system may break if the cluster admin substitutes a scheduler that doesn't schedule the DaemonSet pods as expected. I discussed this a little with @timothysc and his perspective (I'll let him elaborate more on it) was that there are many things a cluster admin can do to break things, and they should just make sure that if they use a custom scheduler, it should handle DaemonSet pods correctly. I think the "simulation" question (how components that need to "simulate" where pods will schedule should work in a world of replaceable and multiple schedulers) is a good one but should be discussed in a separate issue (#48657) since it applies to more than just DaemonSet. I think that even if we had a perfect solution for "simulation" of scheduler behavior, we might still want DaemonSet to schedule its own pods, so that critical system pods can always rely on being scheduled in a predictable way regardless of what replacement scheduler might be used. But I'll let @timothysc chime in with his perspective. (Also, I came across #29276, which is related to this one.) |
So we're going to still let DaemonSet schedule its pods? I think the replacement scheduler should handle Pods's attributes well, e.g. mirror pod, critical pods. |
This is a tough one and is hard to say which one is a better approach. If I had to decide, I would still let the main scheduler schedule DaemonSet pods. I agree with @timothysc that admin can do many things to break a system. Even if we let the DaemonSet controller schedule its pods, a substitute scheduler can still break the system badly by, for example, not scheduling ReplicaSet pods. I know that many critical system pods are DaemonSet pods and if they are not scheduled properly the cluster will not function, but the net outcome of a cluster with a broken scheduler that supposed to run, for example, a replicated web service is the same whether DeamonSet pods are scheduled by the DaemonSet controller or not. In both cases, the cluster will not perform what it is supposed to do. So, it is imperative that the substitute scheduler perform all of its expected work correctly and part of the "expectation" can be scheduling DaemonSet pods. |
One thing I forgot that @timothysc had suggested was that we could have scheduler conformance tests that verify that a replacement scheduler meets whatever minimum criteria we have. (This is kinda an automated version of #17208.) This is an excellent idea regardless of what we do with DaemonSet. For example it could verify that the scheduler implements the predicates that are built into kubelet admission. I'm still not sure about DaemonSet. If we decide we want it to use the default scheduler, we may have to implement #48657. |
+1 to conformance tests for replacement scheduler After went through the discussion again, I'm also not sure about DaemonSet :). I think it dependents on our expectation to the replacement scheduler, e.g. will replacement scheduler support NodeAffinity? regarding dry-run, prefer to add some annotations to Pod instead of exposing REST from default-scheduler; it introduce extras dependency between components (we introduce HTTPExtender because Golang does not support dynamic lab) |
To allow bootstrapping a scheduler pod onto a cluster, there needs to be some type of controller that does not depend on the scheduler itself. Given that there has to be something that does it, I don't see why DaemonSet can't be that thing. |
Bootstrapping was discussed earlier in the thread (though perhaps not thoroughly), see #42002 (comment) |
If using DaemonSet, we need to enhance DaemonSet to support start only one daemon (scheduler) on a set of nodes (master nodes). |
That's not necessary, the scheduler is active/passive, so it can be deployed as a DS on all master nodes.
… On 28 Jul 2017, at 12:36, Klaus Ma ***@***.***> wrote:
To allow bootstrapping a scheduler pod onto a cluster, there needs to be some type of controller that does not depend on the scheduler itself. Given that there has to be something that does it, I don't see why DaemonSet can't be that thing.
If using DaemonSet, we need to enhance DaemonSet to support start only one daemon (scheduler) on a set of nodes (master nodes).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
I think it's usually part of master nodes, e.g. 3 scheduler in 5 master nodes, with other add-on, e.g. dashboard. |
@aveshagarwal , nop, I think there's no finial decision for now; maybe we can continue the discuss in 1.9. According to the sig-apps backlog, maybe @erictune have some suggestion to move forward :). |
It is not going to be a good user experience if the daemonSet controller makes one pod for every node, and lets the scheduler decide which ones fit. Then you end up with a bunch of pending pods, which is confusing to users, and perhaps extra load on the scheduler, and perhaps confusing for cluster auto scalers. So, somehow, the daemonSet needs to know (modulo race conditions, taints, etc), how many pods to create. Is there some way that the daemonSet can know which nodes likely match, and make pods for each node, and then still let the scheduler do a final fit check, etc. Also, isn't scheduler performance going to be poor if it has to deal with a large number of pods with anti-affinity? What can we do. (BTW, I no longer thing the bootstrapping use case matters, so ignore this comment: #42002 (comment)) |
Given all the discussions around this topic, I am fine with letting DaemonSet controller keep its current behavior and schedule DaemonSet pods. As @erictune said, this has the benefit of reducing scheduler and auto-scaler load. It will also provide better guarantees that critical DaemonSet pods will be scheduled in a more predictable manner even in clusters where the default scheduler is replaced by a custom scheduler. |
Just an observation: if we keep DaemonSet scheduling pods, then we will need to add some of the preemption logic to it -- getting rid of the "critical pod rescheduler" is one of the motivations for having a general preemption mechanism. I say "some" because it doesn't need to decide which node to preempt on (since it wants to run on every node that meets the other constraints), but it does need to decide the correct minimal set of victims. Also, it is one more component that can delete pods. |
IMO, as an admin, I'll plan DaemonSet based on "static" setting, for example, I plan to start DaemonSet on 5 nodes (by nodeSelector), and maxUnavailable is 3; if it only start 2 pod because of resource, the rolling upgrade should be pending (it will break my QoS/SLA if terminating the only 2 available pod).
Maybe NodeAffinity :). Regarding the scheduler performance, I think there're some work we can do to improve it, e.g. #42002 (comment) vs #42002 (comment) Our doc said k8s support multiple and user-provided schedulers but we did not give a suggestion on how those scheduler work together; there maybe several conflicts if dependent on kubelet.
|
As discussed in the mailing lists of sig-scheduling and sig-architecture, the final decision is to keep the current behavior of DaemonSet controller and let it schedule its own pods. |
In that case I guess we should close this. BTW the thread @bsalamat refers to is here: |
DaemonSet controller being its own scheduler has caused lots of confusion. Once we have tolerations and preemption, I think we won't need DaemonSet-specific scheduling logic. See also #42001.
cc/ @kubernetes/sig-scheduling-misc @kubernetes/sig-cluster-lifecycle-misc
The text was updated successfully, but these errors were encountered: