-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DaemonSet scheduling is broken in combination with admission plugins #61886
Comments
cc @k82cn |
@tnozicka As you already know we are moving away from the model that DS controller creates DS Pods at a time it thinks they can be scheduled. Instead, we plan to let DS controller create DS Pods sets their node affinity and leave them for the scheduler to schedule them. That addresses part of the problem, but as you also noted one problem will still remain: DS controller may create excessive Pods that always remain pending. Or worse, if the admission controller loosens the node selection restrictions, there will be inadequate Pods created. Generally speaking, DS controller is not designed to run Pods whose scheduling requirements are changed during admission. As the name implies, the main goal of DS controller is to run daemons that are needed on all or a group of nodes which are statically determined (at least in theory). In other words, the group of nodes are identifiable statically and as a result, the DS template can specify them. Could you please give us more information about your use-case? Basically, I would like to know why these Pods should be created as DS Pods, as opposed to ReplicaSet, StatefulSet, etc. |
It seems DS user try to start more Pods than he can use, which are restricted by the cluster admin :). Should k8s handle this for user, or let user 'correct' his configuration? Anyway, we should avoid ping-pong case: DS controller creates pods, kubelet reject it; handled by scheduling DS pods by default scheduler. |
Given the nodes can change or be relabeled I don't think statically is the important word here. Although the admission change is static in a way as the admission is run before the pod is created and pod template is immutable. It is the fact that DS implementation is a scheduler that schedules (assigns nodes) to imaginary pods - it does scheduling before those Pods are created and that's causing these issues. Moving it to the default scheduler isn't really fixing this as it still determines the node, now with an affinity, for an imaginary pod.
OpenShift uses similar version of admission setting default project node selector to forbid scheduling to masters. Additionally infra nodes as well. So you may have a DS that needs to run only on infra nodes like a fluentd or an infra database. Or only on masters like a DS to collect logs from master controllers and send it for aggregation.
I am open to other approaches on how to fix that. b) requires significant changes as well.
Kubernetes should handle this, DaemonSet controller shouldn't create pods for nodes that are not matched by the nodeSelector including admission modifications. (Yes, it is a result of misconfiguration but with different people involved.) AFAIK even a regular user can create a DS, although I think he shouldn't be allowed to. Say you have 5 infra nodes in 1000 node cluster. If your namespace has node selector restriction to only those 5 nodes and the DS doesn't see that, I will create 5 valid pods and 995 invalid ones stuck in pending. At least after change to scheduler it won't fail in a loop it is still bad and we need to fix this. |
If this's the case, I'd like to let user correct his configuration instead of handling by k8s; the pending pods provide monitor info to correct it. If just 'ignore' part of target (by nodeSelector) nodes, that maybe hard to check the healthy of system; for example, the DS's status is correct, but other component is unhealthy (as that mismatch with the 'design').
The other option in my mind is to check Pod's condition; if it's unschedulable, DS controller check Pods' nodeSelector against the node: if unmatched, move the node to black list of DS (refresh it when DS/node updated). |
One solution to address this issue is that I have been thinking is as follows:
I am just giving the above solution as a food for thought, if it works good, if it does not, may be some modification of the above could be used. Still need to think through what (if any) needs to be done if node labels change. |
Creating potentially hundreds or thousands of pods or not creating half of them at all just to "let" user know he needs to update his configurations seems like a naive approach, not really intuitive either :/ I wonder how many of them would even figure it out, or just say DSs are broken. Creating a warning event is the preferred communication channel to a user, possibly reporting in the DS status how many nodes were restricted - not misbehaving by creating excessive/insufficient amount of pods.
I if I am getting you right this would be after the fact that it has created those 1k of pods? Also when you restart the cluster, wouldn't you get 1k of newly created Pods (and writes to etcd?). (Possibly multiplying it for all the DSs in your cluster at once.) Managing this blacklist cache w.r.t. to nodes, DS and possibly admission seems error prone. |
My major concern is to create a 'special' code path for such a case; not showing configuration error. |
Just some more food for thought: An admission controller on Pods can control what Pods ultimately get created, but it can't control what Pods other controllers want/try to create. Maybe we should consider a pattern for letting admission controllers validate/modify the Pod controller's template, before it ever tries to stamp out Pods from that template. As a workaround, for example, you could perhaps add an admission plugin on DaemonSet objects that applies the same rules to In the long term, we might want something more general, like making Pod controllers create explicit PodTemplate objects (#170), which you could then validate/modify in a PodTemplate admission plugin. |
I would say the difference is that RCs have specifically set replicas (scale) upfront while this modifies the scale for the DS |
@enisoc I have though about this as well - it is problematic w.r.t. to changes. If you directly modify a template and the namespace default node selector changes you are back where you've started, now with conflict. |
I had discussed similar approaches here: #51788 (comment) You can see comments on them. |
Thats why I proposed a different approach in my previous comment: #61886 (comment) and would like to get feedback what are the drawbacks to this approach and if it does not make sense at all. |
@aveshagarwal I think this is similar to what I propose with b), but slightly different
Not anyways, it might require to schedule 0 pods, but 1 excessive isn't that bad. One of my original ideas was that we could extend API endpoint for pod creation to support dry-run. (Actually also fixing
Would it be bound to a node or you would bound it later? Might need some kind of an annotation to avoid the default scheduler binding it and running somewhere.
So you would bound it to nodes before those pods are created? Likely we should do it in batches.
We need to leave that to admission or we get races on changes. And admission will do it anyways, no point in duplicating that.
I think we either need dry-run, or some form of reacting to a Pod created and checking the nodeSelector created. I slightly more prefer b) as that avoids breakage when the nodeSelector changes and reacts faster and avoids binding to a node where it shouldn't. And we should probably do it in batches anyways. Also I think by scheduling real pods (binding them after actual creation) instead of imaginary ones DS would be closer to reuse the default scheduler logic as a library as it would have a similar queue of pods. or at least I think that was one of the concerns for not sharing the scheduling library implementation.
What do you have in mind for "node labels"? I am not yet not sure how we implement the sync loop without creating 1 pod to test on every pass with this approach. With dry-run that is straight forward. |
Agree, That's why later in my approach, I mentioned that if number of pods turns (
If anti-affinity is used, then we can just leave it to the default scheduler to do the scheduling.
I don't think we need this. There are 2 cases here:
I mean if we use anti-affinity, then we can just leave it to default scheduler. We can surely use batching for
I agree. Was just proposing as an option.
I need to understand how your
As far as we use anit-affinity, my approach can be used with default scheduler. And there is no imaginary pods, all are real that would get scheduled as normal.
I would be happy to discuss |
@aveshagarwal thanks for discussing the options in detail.
In practice setting anti-affinity to a particular node equals to binding it to that node, minus the extra checks and permission delegation to the default scheduler. My issue is determining a node (binding/binding by delegation) for a pod that doesn't exist yet. That's the root cause of the issue we are seeing.
Depends on if you set a node for that first pod using anti-affinity or not. If you set if on creation you could have chosen one that would be restricted by admission. If not this is close enough to b).
Yeah, I wasn't were descriptive about that option but I have implied the batching and starting from 1 pod there. So you would: b)
What I mean here is that you are binding it by delegation to the default scheduler using affinity before the pod is created. That's why I call it scheduling imaginary pods.
I don't see how we can make the sync loop without dry-run. Creating and deleting 1 real pod every sync for every DS would kill the etcd, created unnecessary events, ... a) Dry-run
This still has a slight downside of binding pods before they are created so let's combine both approaches: a) + b)
|
Yes I meant node affinity. In my understanding, I dont see any issue with it. Node affinity allows the default scheduler to do the actual binding/scheduling, and DS controller only assigns node affinities.
I am not sure I understand here, once the previous step, if you have found how a final pod looks, then why this situation arises?
Pardon my ignorance, but why batching is so important here? As the main issue we are trying to solve here is knowing a set of final nodes (based on final node label selectors. And please note this set could be all nodes too), So in my understanding, it should be (as I put in my approach previously):
3a. If the first pod gets scheduled/pending to one of final nodes, only n-1 pods needs to be created and node affinities should be assigned accordingly. If the first pod does not get scheduled, that means no node matches final node selectors, DS controller can just kill the pod or whatever else it wants to do to convey the user/admin the error. 3b. Batching as you have been mentioning: I would say, its just an optimization, where remaining pods could get scheduled in batches or any other algorithm/heuristic could be used for rate limiting to not overwhelm the cluster. |
@tnozicka My first thought was to use dry-run create (server side) to simulate the admission chain before creating pods, too. It should solve this issue. Would you elaborate more on what problem "batch" solves and why dry-run isn't enough? |
IMO, DS use nodeSelector in Anyway, sig-apps makes the final decission :) |
@janetkuo Yes, doing dry-run, an API call returning you a Pod that would have been created except for writing it to storage, and using that to find the real nodeSelector solves 98% of the issue. There is still a subtle race though. I know it's not a deal breaker, although the separate pod creation and then setting the desired node (using node affinity) as its consequence seems a bit cleaner to me, and more similar to a scheduler. I see the DS as a delegating scheduler now - it creates a pod, sets node affinity to a specific host based on DS nodeSelector and let's the default scheduler match the other predicates. And yes if those don't match there will be pods waiting in Pending but that's correct in this case because those nodes are targeted by the DS just waiting for other (dynamic) conditions to be met. But back to the race:
Note that thouse would have been reconciled eventually, but excessive invalid pods would have been created. I'd prefer if we would avoid creating a Pod that has node affinity set to a Node that isn't matched by its own nodeSelector, and I like the design solving this leads to. It's even fairly easy to implement I think. The main time will likely be spend on implementing dry-run in the API (maybe that will need to be done in generic way so all resources and kubectl benefits).
Batching is actually a separate issue, so it isn't solving this. I was just trying to demonstrate in my previous examples (mainly without dry run) that you'd start with 1 pod to find out if it matches and then continue with more. As it's common for a DS to target the whole cluster, and we have clusters of thousands of nodes, creating all pods at once isn't likely what we we want. Doing batches or generally having some kind of rate limit for pod creation deletion to soften the startup or other situation when extensive pod creation/deletion is necessary (including bug) would be more sensible to the cluster. |
We need dry run first and from what I was told it won't be easy to get it in reasonable time. I have created a separate issue for dry run here #62197 |
Side note: in @lavalamp's |
on a side note, i have used PodNodeSelector to only allow a single namespace pods to be created on gpu nodes, but now when i am creating daemonset they are not getting scheduled on gpu nodes. they go in pending state & terminate and so on. can someone please guide if it is related or is there any solution for this. |
@narender-singh likely, although there is always an easy solution of fixing your DS to have the same node selector as the pods would have after admission. But that only fixes the hotloop of pending-> terminate->create. The reason pods are evicted by kubelet is that they shouldn't run there at all. (If those pods are actually meant to run on those nodes, the root cause is elsewhere.) |
This has conformance implications - we have scheduling tests that assume they can schedule on all nodes, but openshift runs PodNodeSelector by default to prevent regular users from scheduling to masters (security), which causes these e2e conformance tests
to fail. Since it doesn't look like we're close to a fix here, I think that rather than removing these from conformance the two conformance tests need to be able to tolerate PodNodeSelector being set. A cluster using PodNodeSelector shouldn't fail conformance. |
@aveshagarwal if you can look at those e2e tests and determine whether they can be made to tolerate podnodeselector that would be good. In the long run, a per namespace scheduling policy object would also help resolve the specific failure that triggered this issue (limiting what nodes users can schedule to) but not the general "admission plugins and scheduler can go rogue". Has there been enough discussion around dry run to know whether the approach will work to solve this problem yet? |
Sure, looking. |
There is one more:
|
How about taint?
Can we handle DaemonSet in admission controller? With dry-run, DaemonSet need to check whether pod can be scheduled, which should be done by scheduler :)
Assign pod to node is decided by scheduler, prefer not to use such kind logic outside scheduler. |
If you modify the DaemonSet in the admission you get inconsistencies - if e.g. the nodeSelector is changed, you have already modified the DaemonSet and can't fix it as you can't know if it was the admission plugin setting that part of the nodeSelector or a user. Also we don't own all the admission controllers, users do.
I think it gets ambiguous with DaemonSet as it is delegating scheduler. It doesn't need to check if pod can be scheduled, only how the nodeSelector would look like.
DaemonSet is the one determining and "assigning" pod to a node, in the sense of choosing the node, scheduler is the one deciding to if that choice should be applied or waited. We should not create pod objects for nodes not matched by the nodeSelector - DaemonSet is responsible for that part.
I am gonna revisit the design when we have dry run in master. I think dry-run will help, we just need a smart way to avoid calling it on every sync. |
Dry run is in alpha now, so you can try it and see if it helps. I personally am not convinced it's a silver bullet for this problem. Recall that admission controllers are not required to be idempotent. The pod you get back in a dry run isn't guaranteed to be exactly like one made for real: At the very least, UID, a generated name, creation timestamp, and any allocated ports (for service objects) will be different. Webhooks with side effects may also generate dummy data. Poorly written webhooks may generate different changes with each invocation. Webhooks may get turned on or off between the dry run and non dry run request. Perhaps most importantly, if a webhook with side effects is installed for pods, it will cause dry run creations to fail. So, you need to tolerate that condition. |
The admission controller only append nodeSelector to DaemonSet, it does not need to care about whether it's changed. The selector will get an ANDed result.
We can suggest user to handle DaemonSet; it's also highlight what's done in admission to the user.
The owner of DaemonSet and admission controller maybe different, so they have different target node set; that's a similar case for other kind/types, e.g. no node can be scheduled because incorrect nodeSelector. |
If the admission is reconfigured not to add the nodeSelelctor anymore and it had already modified DS previously you are broken and can't recover. That's why it modifies only Pods. |
Thanks, that makes sense to me :) |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Let's take a specific example of admission plugin PodNodeSelector but this applies in general for admission plugins modifying pod properties having an influence on scheduling.
The issue is that DaemonSet doesn't see those admission modifications when determining to which nodes it should schedule pods resulting in scheduling pods even to nodes where it shouldn't. Or in theory not scheduling pods where it should have if the admission plugin were loosening the nodeSelector instead of tightening it.
We need to account for this case as those admission plugins are important part of Kubernetes security model.
In case the DS specifies nodeName directly (which is still the current scheduling for DS) this results in kubelet failing the pod because of failure to match nodeSelector. So a pod with
restartPolicy: Always
gets failed and DS removes it and creates a new one, creating a loop that can cripple the cluster.In case we move to scheduling by affinity this results in excessive pod creation left in pending state. It can be thousands of pods stuck in pending based on the cluster size and how much the nodeSelector is restrictive, degrading performance, possibly exhausting quota.
The issue in both cases is that DaemonSet make scheduling decisions before the pod is created based purely on it's template with no idea on how the pod will look like when created. Like in the case when the nodeSelector will be actually different that what it accounted for.
We either need:
a) DS to be able to simulate the admission chain for that pod to be able to see the modifications
b) Creating the pods first and assigning nodes only after that which would still result in excessive pod(s) being created, but possibly only 1 (or batch_size) as you could stop at the point where no more nodes needed the DS pod to be placed based on the actual nodeSelector (and other properties) in the pod created.
c) ...
This might also affect the current move of part of DS scheduling to the default scheduler.
Here are some of the issues showing collisions between admission plugins and DS scheduling:
/kind bug
/priority important-soon
/sig apps
/sig schedulling
@kow3ns @janetkuo @deads2k @liggitt @smarterclayton @mfojtik
@kubernetes/sig-apps-bugs @kubernetes/sig-scheduling-bugs
The text was updated successfully, but these errors were encountered: