Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DaemonSet scheduling is broken in combination with admission plugins #61886

Closed
tnozicka opened this issue Mar 29, 2018 · 36 comments
Closed

DaemonSet scheduling is broken in combination with admission plugins #61886

tnozicka opened this issue Mar 29, 2018 · 36 comments
Labels
area/workload-api/daemonset kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@tnozicka
Copy link
Contributor

Let's take a specific example of admission plugin PodNodeSelector but this applies in general for admission plugins modifying pod properties having an influence on scheduling.

The issue is that DaemonSet doesn't see those admission modifications when determining to which nodes it should schedule pods resulting in scheduling pods even to nodes where it shouldn't. Or in theory not scheduling pods where it should have if the admission plugin were loosening the nodeSelector instead of tightening it.

We need to account for this case as those admission plugins are important part of Kubernetes security model.

In case the DS specifies nodeName directly (which is still the current scheduling for DS) this results in kubelet failing the pod because of failure to match nodeSelector. So a pod with restartPolicy: Always gets failed and DS removes it and creates a new one, creating a loop that can cripple the cluster.

In case we move to scheduling by affinity this results in excessive pod creation left in pending state. It can be thousands of pods stuck in pending based on the cluster size and how much the nodeSelector is restrictive, degrading performance, possibly exhausting quota.

The issue in both cases is that DaemonSet make scheduling decisions before the pod is created based purely on it's template with no idea on how the pod will look like when created. Like in the case when the nodeSelector will be actually different that what it accounted for.

We either need:
a) DS to be able to simulate the admission chain for that pod to be able to see the modifications
b) Creating the pods first and assigning nodes only after that which would still result in excessive pod(s) being created, but possibly only 1 (or batch_size) as you could stop at the point where no more nodes needed the DS pod to be placed based on the actual nodeSelector (and other properties) in the pod created.
c) ...

This might also affect the current move of part of DS scheduling to the default scheduler.

Here are some of the issues showing collisions between admission plugins and DS scheduling:

/kind bug
/priority important-soon
/sig apps
/sig schedulling
@kow3ns @janetkuo @deads2k @liggitt @smarterclayton @mfojtik
@kubernetes/sig-apps-bugs @kubernetes/sig-scheduling-bugs

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. sig/apps Categorizes an issue or PR as relevant to SIG Apps. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Mar 29, 2018
@yastij
Copy link
Member

yastij commented Mar 29, 2018

cc @k82cn

@bsalamat
Copy link
Member

@tnozicka As you already know we are moving away from the model that DS controller creates DS Pods at a time it thinks they can be scheduled. Instead, we plan to let DS controller create DS Pods sets their node affinity and leave them for the scheduler to schedule them. That addresses part of the problem, but as you also noted one problem will still remain: DS controller may create excessive Pods that always remain pending. Or worse, if the admission controller loosens the node selection restrictions, there will be inadequate Pods created.
The idea of letting DS controller simulate the admission chain does not seem like a viable solution to me. It is heavy weight and requires a lot of changes to the structure of the DS controller.

Generally speaking, DS controller is not designed to run Pods whose scheduling requirements are changed during admission. As the name implies, the main goal of DS controller is to run daemons that are needed on all or a group of nodes which are statically determined (at least in theory). In other words, the group of nodes are identifiable statically and as a result, the DS template can specify them. Could you please give us more information about your use-case? Basically, I would like to know why these Pods should be created as DS Pods, as opposed to ReplicaSet, StatefulSet, etc.

@k82cn
Copy link
Member

k82cn commented Mar 30, 2018

It seems DS user try to start more Pods than he can use, which are restricted by the cluster admin :). Should k8s handle this for user, or let user 'correct' his configuration?

Anyway, we should avoid ping-pong case: DS controller creates pods, kubelet reject it; handled by scheduling DS pods by default scheduler.

@tnozicka
Copy link
Contributor Author

tnozicka commented Apr 5, 2018

As the name implies, the main goal of DS controller is to run daemons that are needed on all or a group of nodes which are statically determined (at least in theory).

Given the nodes can change or be relabeled I don't think statically is the important word here. Although the admission change is static in a way as the admission is run before the pod is created and pod template is immutable.

It is the fact that DS implementation is a scheduler that schedules (assigns nodes) to imaginary pods - it does scheduling before those Pods are created and that's causing these issues. Moving it to the default scheduler isn't really fixing this as it still determines the node, now with an affinity, for an imaginary pod.

Basically, I would like to know why these Pods should be created as DS Pods, as opposed to ReplicaSet, StatefulSet, etc.

OpenShift uses similar version of admission setting default project node selector to forbid scheduling to masters. Additionally infra nodes as well.

So you may have a DS that needs to run only on infra nodes like a fluentd or an infra database. Or only on masters like a DS to collect logs from master controllers and send it for aggregation.

The idea of letting DS controller simulate the admission chain does not seem like a viable solution to me. It is heavy weight and requires a lot of changes to the structure of the DS controller.

I am open to other approaches on how to fix that. b) requires significant changes as well.

It seems DS user try to start more Pods than he can use, which are restricted by the cluster admin :).
Should k8s handle this for user, or let user 'correct' his configuration?

Kubernetes should handle this, DaemonSet controller shouldn't create pods for nodes that are not matched by the nodeSelector including admission modifications. (Yes, it is a result of misconfiguration but with different people involved.) AFAIK even a regular user can create a DS, although I think he shouldn't be allowed to.

Say you have 5 infra nodes in 1000 node cluster. If your namespace has node selector restriction to only those 5 nodes and the DS doesn't see that, I will create 5 valid pods and 995 invalid ones stuck in pending. At least after change to scheduler it won't fail in a loop it is still bad and we need to fix this.

@k82cn
Copy link
Member

k82cn commented Apr 5, 2018

DS user try to start more Pods than he can use, which are restricted by the cluster admin

If this's the case, I'd like to let user correct his configuration instead of handling by k8s; the pending pods provide monitor info to correct it. If just 'ignore' part of target (by nodeSelector) nodes, that maybe hard to check the healthy of system; for example, the DS's status is correct, but other component is unhealthy (as that mismatch with the 'design').

The idea of letting DS controller simulate the admission chain does not seem like a viable solution to me. It is heavy weight and requires a lot of changes to the structure of the DS controller.

I am open to other approaches on how to fix that. b) requires significant changes as well.

The other option in my mind is to check Pod's condition; if it's unschedulable, DS controller check Pods' nodeSelector against the node: if unmatched, move the node to black list of DS (refresh it when DS/node updated).

@aveshagarwal
Copy link
Member

aveshagarwal commented Apr 5, 2018

One solution to address this issue is that I have been thinking is as follows:

  1. DS controller creates one pod (called first pod/anchor pod) only. (In any case one pod has to be created anyway).

  2. The first pod goes through the admission plugin chain (or anything else) as normal, and get scheduled (or may not get scheduled, perhaps does not matter much here). This means this pod has seen all changes to it that are supposed to happen to it including node selector changes.

  3. DS controller checks changes in the first pod (mostly node selectors changes) and based on that creates other pods. For example, if based on (modified) node selectors in the first pod, the DS controller computes that n copies need to be created, it creates n-1 pods since one pod (the first pod) is created already. If n=0 (that means no nodes matches), then it can kill the first pod too.

  4. It could be decided if DS controller changes node selectors to all new remaining (n-1) pods, or does not do anything, as the n-1 pods will get node selectors changes anyway.

I am just giving the above solution as a food for thought, if it works good, if it does not, may be some modification of the above could be used.

Still need to think through what (if any) needs to be done if node labels change.

@tnozicka
Copy link
Contributor Author

tnozicka commented Apr 5, 2018

If this's the case, I'd like to let user correct his configuration instead of handling by k8s; the pending pods provide monitor info to correct it. If just 'ignore' part of target (by nodeSelector) nodes, that maybe hard to check the healthy of system; for example, the DS's status is correct, but other component is unhealthy (as that mismatch with the 'design').

Creating potentially hundreds or thousands of pods or not creating half of them at all just to "let" user know he needs to update his configurations seems like a naive approach, not really intuitive either :/ I wonder how many of them would even figure it out, or just say DSs are broken.

Creating a warning event is the preferred communication channel to a user, possibly reporting in the DS status how many nodes were restricted - not misbehaving by creating excessive/insufficient amount of pods.

The other option in my mind is to check Pod's condition; if it's unschedulable, DS controller check Pods' nodeSelector against the node: if unmatched, move the node to black list of DS (refresh it when DS/node updated).

I if I am getting you right this would be after the fact that it has created those 1k of pods? Also when you restart the cluster, wouldn't you get 1k of newly created Pods (and writes to etcd?). (Possibly multiplying it for all the DSs in your cluster at once.)

Managing this blacklist cache w.r.t. to nodes, DS and possibly admission seems error prone.

@k82cn
Copy link
Member

k82cn commented Apr 5, 2018

Creating potentially hundreds or thousands of pods or not creating half of them at all just to "let" user know he needs to update his configurations seems like a naive approach, not really intuitive either :/ I wonder how many of them would even figure it out, or just say DSs are broken.

My major concern is to create a 'special' code path for such a case; not showing configuration error.
Prefer to handle DaemonSet similar to other Kinds, DaemonSet's character is one Pod per node. For example, if RC's nodeSelector did not match any node after admission controller, we just let Pod pending here.

@enisoc
Copy link
Member

enisoc commented Apr 5, 2018

Just some more food for thought:

An admission controller on Pods can control what Pods ultimately get created, but it can't control what Pods other controllers want/try to create. Maybe we should consider a pattern for letting admission controllers validate/modify the Pod controller's template, before it ever tries to stamp out Pods from that template.

As a workaround, for example, you could perhaps add an admission plugin on DaemonSet objects that applies the same rules to spec.template that you enforce directly on Pods in the PodNodeSelector.

In the long term, we might want something more general, like making Pod controllers create explicit PodTemplate objects (#170), which you could then validate/modify in a PodTemplate admission plugin.

@tnozicka
Copy link
Contributor Author

tnozicka commented Apr 5, 2018

if RC's nodeSelector did not match any node after admission controller, we just let Pod pending here.

I would say the difference is that RCs have specifically set replicas (scale) upfront while this modifies the scale for the DS

@tnozicka
Copy link
Contributor Author

tnozicka commented Apr 5, 2018

Maybe we should consider a pattern for letting admission controllers validate/modify the Pod controller's template, before it ever tries to stamp out Pods from that template.
As a workaround, for example, you could perhaps add an admission plugin on DaemonSet objects that applies the same rules to spec.template that you enforce directly on Pods in the PodNodeSelector.

@enisoc I have though about this as well - it is problematic w.r.t. to changes. If you directly modify a template and the namespace default node selector changes you are back where you've started, now with conflict.

@aveshagarwal
Copy link
Member

Just some more food for thought:

An admission controller on Pods can control what Pods ultimately get created, but it can't control what Pods other controllers want/try to create. Maybe we should consider a pattern for letting admission controllers validate/modify the Pod controller's template, before it ever tries to stamp out Pods from that template.

As a workaround, for example, you could perhaps add an admission plugin on DaemonSet objects that applies the same rules to spec.template that you enforce directly on Pods in the PodNodeSelector.

In the long term, we might want something more general, like making Pod controllers create explicit PodTemplate objects (#170), which you could then validate/modify in a PodTemplate admission plugin.

I had discussed similar approaches here: #51788 (comment)

You can see comments on them.

@aveshagarwal
Copy link
Member

Thats why I proposed a different approach in my previous comment: #61886 (comment) and would like to get feedback what are the drawbacks to this approach and if it does not make sense at all.

@tnozicka
Copy link
Contributor Author

tnozicka commented Apr 5, 2018

@aveshagarwal I think this is similar to what I propose with b), but slightly different

DS controller creates one pod (called first pod/anchor pod) only. (In any case one pod has to be created anyway).

Not anyways, it might require to schedule 0 pods, but 1 excessive isn't that bad. One of my original ideas was that we could extend API endpoint for pod creation to support dry-run. (Actually also fixing oc create --dry-run.)

The first pod goes through the admission plugin chain (or anything else) as normal, and get scheduled (or may not get scheduled, perhaps does not matter much here). This means this pod has seen all changes to it that are supposed to happen to it including node selector changes.

Would it be bound to a node or you would bound it later? Might need some kind of an annotation to avoid the default scheduler binding it and running somewhere.

DS controller checks changes in the first pod (mostly node selectors changes) and based on that creates other pods. For example, if based on (modified) node selectors in the first pod, the DS controller computes that n copies need to be created, it creates n-1 pods since one pod (the first pod) is created already. If n=0 (that means no nodes matches), then it can kill the first pod too.

So you would bound it to nodes before those pods are created? Likely we should do it in batches.

It could be decided if DS controller changes node selectors to all new remaining (n-1) pods, or does not do anything, as the n-1 pods will get node selectors changes anyway.

We need to leave that to admission or we get races on changes. And admission will do it anyways, no point in duplicating that.

I am just giving the above solution as a food for thought, if it works good, if it does not, may be some modification of the above could be used.

I think we either need dry-run, or some form of reacting to a Pod created and checking the nodeSelector created. I slightly more prefer b) as that avoids breakage when the nodeSelector changes and reacts faster and avoids binding to a node where it shouldn't. And we should probably do it in batches anyways.

Also I think by scheduling real pods (binding them after actual creation) instead of imaginary ones DS would be closer to reuse the default scheduler logic as a library as it would have a similar queue of pods. or at least I think that was one of the concerns for not sharing the scheduling library implementation.

Still need to think through what (if any) needs to be done if node labels change.

What do you have in mind for "node labels"?

I am not yet not sure how we implement the sync loop without creating 1 pod to test on every pass with this approach. With dry-run that is straight forward.

@aveshagarwal
Copy link
Member

Not anyways, it might require to schedule 0 pods, but 1 excessive isn't that bad.

Agree, That's why later in my approach, I mentioned that if number of pods turns (n) out 0, DS controller can just kill the first pod, so not that bad.

Would it be bound to a node or you would bound it later?

If anti-affinity is used, then we can just leave it to the default scheduler to do the scheduling.

Might need some kind of an annotation to avoid the default scheduler binding it and running somewhere.

I don't think we need this. There are 2 cases here:

  1. If final number of pods dont turn out to be zero 0, that mean even the first pod wont be scheduled, and DS controller can just kill the first pod.

  2. If final number of pods >=1, the first pod would need to be scheduled anyways.

So you would bound it to nodes before those pods are created? Likely we should do it in batches.

I mean if we use anti-affinity, then we can just leave it to default scheduler. We can surely use batching for rate limiting to not overwhelm the cluster.

We need to leave that to admission or we get races on changes. And admission will do it anyways, no point in duplicating that.

I agree. Was just proposing as an option.

I think we either need dry-run, or some form of reacting to a Pod created and checking the nodeSelector created. I slightly more prefer b) as that avoids breakage when the nodeSelector changes and reacts faster and avoids binding to a node where it shouldn't. And we should probably do it in batches anyways.

I need to understand how your b approach solves nodeSelector changes not sure now. But one thing that seems an issue with b is that creating all pods in the beginning. Also not sure how it helps with lot of pods being created, unless you monitor nodeSelector updates to pods. That's why my approach creates only one pod first to know what changes are going to happen to the first pod to compute the number of pods later.

Also I think by scheduling real pods (binding them after actual creation) instead of imaginary ones DS would be closer to reuse the default scheduler logic as a library as it would have a similar queue of pods. or at least I think that was one of the concerns for not sharing the scheduling library implementation.

As far as we use anit-affinity, my approach can be used with default scheduler. And there is no imaginary pods, all are real that would get scheduled as normal.

What do you have in mind for "node labels"? I am not yet not sure how we implement the sync loop without creating 1 pod to test on every pass with this approach. With dry-run that is straight forward.

I would be happy to discuss dry-run approach with you. If dry-run approach works better, I would be happy to go with it.

@tnozicka
Copy link
Contributor Author

tnozicka commented Apr 5, 2018

@aveshagarwal thanks for discussing the options in detail.

If anti-affinity is used, then we can just leave it to the default scheduler to do the scheduling.

In practice setting anti-affinity to a particular node equals to binding it to that node, minus the extra checks and permission delegation to the default scheduler. My issue is determining a node (binding/binding by delegation) for a pod that doesn't exist yet. That's the root cause of the issue we are seeing.

I don't think we need this. There are 2 cases here: ...

Depends on if you set a node for that first pod using anti-affinity or not. If you set if on creation you could have chosen one that would be restricted by admission. If not this is close enough to b).

I need to understand how your b approach solves nodeSelector changes not sure now. But one thing that seems an issue with b is that creating all pods in the beginning. Also not sure how it helps with lot of pods being created, unless you monitor nodeSelector updates to pods. That's why my approach creates only one pod first to know what changes are going to happen to the first pod to compute the number of pods later.

Yeah, I wasn't were descriptive about that option but I have implied the batching and starting from 1 pod there. So you would:

b)

  1. batch_size=1
  2. create pods in batch
  3. Check which nodes match the nodeSelector (and other properties) for pods in batch and bind (or bind by delegation) to nodes
  4. Estimate how many pods are needed based on properties of the last pod in batch
  5. Use an exponential function to determine next batch size, (upper bound is from 3.)
  6. goto 1.

As far as we use anit-affinity, my approach can be used with default scheduler. And there is no imaginary pods, all are real that would get scheduled as normal.

What I mean here is that you are binding it by delegation to the default scheduler using affinity before the pod is created. That's why I call it scheduling imaginary pods.

What do you have in mind for "node labels"? I am not yet not sure how we implement the sync loop without creating 1 pod to test on every pass with this approach. With dry-run that is straight forward.

I would be happy to discuss dry-run approach with you. If dry-run approach works better, I would be happy to go with it.

I don't see how we can make the sync loop without dry-run. Creating and deleting 1 real pod every sync for every DS would kill the etcd, created unnecessary events, ...

a) Dry-run

  1. Assume batching :)
  2. Call Pod create API with dry-run to find out how pod would look like after admission
  3. Determine if pods need to be created and to which nodes [as is done now]
  4. Bind it or bind it by delegation (before CREATE) [as is done now]
  5. Create Pods [as is done now]

This still has a slight downside of binding pods before they are created so let's combine both approaches:

a) + b)

  1. batch_size=1
  2. Call Pod create API with dry-run to find out how pod would look like after admission
  3. Determine if such pods needs to be scheduled somewhere they are not running
  4. Create batch_size of pods
  5. Determine to which nodes the pods should be bound and bind them (or bind them by delegation)
  6. Estimate how many pods are needed based on properties of the newest Pod in the batch
  7. Use an exponential function to determine next batch size, (upper bound is from 6.)
  8. goto 3.

@aveshagarwal
Copy link
Member

In practice setting anti-affinity to a particular node equals to binding it to that node, minus the extra checks and permission delegation to the default scheduler. My issue is determining a node (binding/binding by delegation) for a pod that doesn't exist yet. That's the root cause of the issue we are seeing.

Yes I meant node affinity. In my understanding, I dont see any issue with it. Node affinity allows the default scheduler to do the actual binding/scheduling, and DS controller only assigns node affinities.

a) + b)
batch_size=1
Call Pod create API with dry-run to find out how pod would look like after admission
Determine if such pods needs to be scheduled somewhere they are not running

I am not sure I understand here, once the previous step, if you have found how a final pod looks, then why this situation arises?

Create batch_size of pods
Determine to which nodes the pods should be bound and bind them (or bind them by delegation)
Estimate how many pods are needed based on properties of the newest Pod in the batch
Use an exponential function to determine next batch size, (upper bound is from 6.)
goto 3.

Pardon my ignorance, but why batching is so important here? As the main issue we are trying to solve here is knowing a set of final nodes (based on final node label selectors. And please note this set could be all nodes too), So in my understanding, it should be (as I put in my approach previously):

  1. Start with one pod (or as I called it first pod/anchor pod or whatever). You could call it dry-run or anything else.
  2. Once the first pod gets scheduled/or not does get scheduled, DS controller obtaints the first pod's final nodeSelectors.
  3. DS controller then calculates final nodes and creates pods with node affinity. So that the default scheduler can schedule them. Now 2 things, we need to consider here:

3a. If the first pod gets scheduled/pending to one of final nodes, only n-1 pods needs to be created and node affinities should be assigned accordingly. If the first pod does not get scheduled, that means no node matches final node selectors, DS controller can just kill the pod or whatever else it wants to do to convey the user/admin the error.

3b. Batching as you have been mentioning: I would say, its just an optimization, where remaining pods could get scheduled in batches or any other algorithm/heuristic could be used for rate limiting to not overwhelm the cluster.

@janetkuo
Copy link
Member

janetkuo commented Apr 5, 2018

@tnozicka My first thought was to use dry-run create (server side) to simulate the admission chain before creating pods, too. It should solve this issue. Would you elaborate more on what problem "batch" solves and why dry-run isn't enough?

@k82cn
Copy link
Member

k82cn commented Apr 6, 2018

if RC's nodeSelector did not match any node after admission controller, we just let Pod pending here.

I would say the difference is that RCs have specifically set replicas (scale) upfront while this modifies the scale for the DS

IMO, DS use nodeSelector in YAML file (without admission) to define the replicas dynamically; for admission modification, prefer to handle it similar with other Kind, e.g. RC.

Anyway, sig-apps makes the final decission :)

@tnozicka
Copy link
Contributor Author

tnozicka commented Apr 6, 2018

@tnozicka My first thought was to use dry-run create (server side) to simulate the admission chain before creating pods, too. It should solve this issue.

@janetkuo Yes, doing dry-run, an API call returning you a Pod that would have been created except for writing it to storage, and using that to find the real nodeSelector solves 98% of the issue.

There is still a subtle race though. I know it's not a deal breaker, although the separate pod creation and then setting the desired node (using node affinity) as its consequence seems a bit cleaner to me, and more similar to a scheduler. I see the DS as a delegating scheduler now - it creates a pod, sets node affinity to a specific host based on DS nodeSelector and let's the default scheduler match the other predicates. And yes if those don't match there will be pods waiting in Pending but that's correct in this case because those nodes are targeted by the DS just waiting for other (dynamic) conditions to be met.

But back to the race:

  1. Do a dry-run on a Pod, see the real Pod as it would have been created and its nodeSelector
  2. Admin changes the namespace annotation (or any other configuration in general) affecting the admission
  3. You create all the Pods with node affinity set to nodes where you think they were targeted but their nodeSelector will be different and you have created Pod(s) with node affinity contradicting the nodeSelector.

Note that thouse would have been reconciled eventually, but excessive invalid pods would have been created.

I'd prefer if we would avoid creating a Pod that has node affinity set to a Node that isn't matched by its own nodeSelector, and I like the design solving this leads to. It's even fairly easy to implement I think. The main time will likely be spend on implementing dry-run in the API (maybe that will need to be done in generic way so all resources and kubectl benefits).

Would you elaborate more on what problem "batch" solves and why dry-run isn't enough?

Batching is actually a separate issue, so it isn't solving this. I was just trying to demonstrate in my previous examples (mainly without dry run) that you'd start with 1 pod to find out if it matches and then continue with more.

As it's common for a DS to target the whole cluster, and we have clusters of thousands of nodes, creating all pods at once isn't likely what we we want. Doing batches or generally having some kind of rate limit for pod creation deletion to soften the startup or other situation when extensive pod creation/deletion is necessary (including bug) would be more sensible to the cluster.
(Like this bug One single daemonset controller is responsible for 1/2 of all write workload on the cluster

@tnozicka
Copy link
Contributor Author

tnozicka commented Apr 6, 2018

We need dry run first and from what I was told it won't be easy to get it in reasonable time. I have created a separate issue for dry run here #62197

@janetkuo
Copy link
Member

janetkuo commented Apr 6, 2018

Side note: in @lavalamp's kubectl apply reboot (server-side apply) design doc, supporting dry-run is one of the required changes.

@narender-singh
Copy link

on a side note, i have used PodNodeSelector to only allow a single namespace pods to be created on gpu nodes, but now when i am creating daemonset they are not getting scheduled on gpu nodes. they go in pending state & terminate and so on. can someone please guide if it is related or is there any solution for this.

@tnozicka
Copy link
Contributor Author

tnozicka commented Jul 2, 2018

@narender-singh likely, although there is always an easy solution of fixing your DS to have the same node selector as the pods would have after admission. But that only fixes the hotloop of pending-> terminate->create. The reason pods are evicted by kubelet is that they shouldn't run there at all. (If those pods are actually meant to run on those nodes, the root cause is elsewhere.)

@smarterclayton
Copy link
Contributor

smarterclayton commented Sep 18, 2018

This has conformance implications - we have scheduling tests that assume they can schedule on all nodes, but openshift runs PodNodeSelector by default to prevent regular users from scheduling to masters (security), which causes these e2e conformance tests

[sig-apps] Daemon set [Serial] should run and stop simple daemon [Conformance] 5m17s
[sig-apps] Daemon set [Serial] should retry creating failed daemon pods [Conformance] 5m19s
[sig-apps] Daemon set [Serial] should update pod when spec was updated and update strategy is RollingUpdate [Conformance] 5m14s

to fail. Since it doesn't look like we're close to a fix here, I think that rather than removing these from conformance the two conformance tests need to be able to tolerate PodNodeSelector being set. A cluster using PodNodeSelector shouldn't fail conformance.

@smarterclayton
Copy link
Contributor

@aveshagarwal if you can look at those e2e tests and determine whether they can be made to tolerate podnodeselector that would be good.

In the long run, a per namespace scheduling policy object would also help resolve the specific failure that triggered this issue (limiting what nodes users can schedule to) but not the general "admission plugins and scheduler can go rogue".

Has there been enough discussion around dry run to know whether the approach will work to solve this problem yet?

@aveshagarwal
Copy link
Member

@aveshagarwal if you can look at those e2e tests and determine whether they can be made to tolerate podnodeselector that would be good.

Sure, looking.

@aveshagarwal
Copy link
Member

This has conformance implications - we have two scheduling tests that assume they can schedule on all nodes, but openshift runs PodNodeSelector by default to prevent regular users from scheduling to masters (security), which causes two e2e conformance tests

There is one more:

[sig-apps] Daemon set [Serial] should update pod when spec was updated and update strategy is RollingUpdate [Conformance]

@k82cn
Copy link
Member

k82cn commented Sep 19, 2018

but openshift runs PodNodeSelector by default to prevent regular users from scheduling to masters (security), which causes these e2e conformance tests

How about taint?

My first thought was to use dry-run create (server side) to simulate the admission chain before creating pods, too. It should solve this issue.

Can we handle DaemonSet in admission controller? With dry-run, DaemonSet need to check whether pod can be scheduled, which should be done by scheduler :)

I'd prefer if we would avoid creating a Pod that has node affinity set to a Node that isn't matched by its own nodeSelector, and I like the design solving this leads to.

Assign pod to node is decided by scheduler, prefer not to use such kind logic outside scheduler.

@tnozicka
Copy link
Contributor Author

Can we handle DaemonSet in admission controller?

If you modify the DaemonSet in the admission you get inconsistencies - if e.g. the nodeSelector is changed, you have already modified the DaemonSet and can't fix it as you can't know if it was the admission plugin setting that part of the nodeSelector or a user.

Also we don't own all the admission controllers, users do.

With dry-run, DaemonSet need to check whether pod can be scheduled, which should be done by scheduler :)

I think it gets ambiguous with DaemonSet as it is delegating scheduler. It doesn't need to check if pod can be scheduled, only how the nodeSelector would look like.

Assign pod to node is decided by scheduler, prefer not to use such kind logic outside scheduler.

DaemonSet is the one determining and "assigning" pod to a node, in the sense of choosing the node, scheduler is the one deciding to if that choice should be applied or waited. We should not create pod objects for nodes not matched by the nodeSelector - DaemonSet is responsible for that part.

Has there been enough discussion around dry run to know whether the approach will work to solve this problem yet?

I am gonna revisit the design when we have dry run in master. I think dry-run will help, we just need a smart way to avoid calling it on every sync.

@lavalamp
Copy link
Member

Dry run is in alpha now, so you can try it and see if it helps.

I personally am not convinced it's a silver bullet for this problem. Recall that admission controllers are not required to be idempotent. The pod you get back in a dry run isn't guaranteed to be exactly like one made for real: At the very least, UID, a generated name, creation timestamp, and any allocated ports (for service objects) will be different. Webhooks with side effects may also generate dummy data. Poorly written webhooks may generate different changes with each invocation. Webhooks may get turned on or off between the dry run and non dry run request.

Perhaps most importantly, if a webhook with side effects is installed for pods, it will cause dry run creations to fail. So, you need to tolerate that condition.

@k82cn
Copy link
Member

k82cn commented Sep 20, 2018

If you modify the DaemonSet in the admission you get inconsistencies - if e.g. the nodeSelector is changed, you have already modified the DaemonSet and can't fix it as you can't know if it was the admission plugin setting that part of the nodeSelector or a user.

The admission controller only append nodeSelector to DaemonSet, it does not need to care about whether it's changed. The selector will get an ANDed result.

Also we don't own all the admission controllers, users do.

We can suggest user to handle DaemonSet; it's also highlight what's done in admission to the user.

We should not create pod objects for nodes not matched by the nodeSelector - DaemonSet is responsible for that part.

The owner of DaemonSet and admission controller maybe different, so they have different target node set; that's a similar case for other kind/types, e.g. no node can be scheduled because incorrect nodeSelector.

@tnozicka
Copy link
Contributor Author

The admission controller only append nodeSelector to DaemonSet, it does not need to care about whether it's changed. The selector will get an ANDed result.

If the admission is reconfigured not to add the nodeSelelctor anymore and it had already modified DS previously you are broken and can't recover. That's why it modifies only Pods.

@k82cn
Copy link
Member

k82cn commented Sep 21, 2018

If the admission is reconfigured not to add the nodeSelelctor anymore and it had already modified DS previously you are broken and can't recover. That's why it modifies only Pods.

Thanks, that makes sense to me :)

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 20, 2018
@tnozicka
Copy link
Contributor Author

tnozicka commented Jan 7, 2019

/remove-lifecycle stale
/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/workload-api/daemonset kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
Development

No branches or pull requests