-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DaemonSet doesn't run in all nodes #23013
Comments
can I see:
Do the nodes have enough resource? Are their host port conflicts? |
Output at https://gist.github.com/Calpicow/c25057f849061deb5f2e. Resources look fine, no port conflicts. There is another DaemonSet in the dump, called |
i am experiencing the same issue with 10 nodes. running latest releases kubernetes 1.2.0
this is the only daemon set i have running |
Do the nodes have capacity, are you using host ports? |
2 of the nodes DO NOT have capacities, all the other do. i am not using host ports will it be possible to reserve resource for daemonsets? |
kube version across the cluster is 1.2 as well as my local kubectl version i started using --kube-reserved=cpu=1000m,memory=1G and launched a DS for fluentd
i then launched 3 pods into each node that were consuming about 80% ram and about 70% cpu, the DS staying there and all the logs were in ES. today i added second DS for sensu, only the fluentd DS pods, heapster and dashboard pods were running when i added this second DS
the result was all the pods being created and everything seemed to work, them i proceeded to create again 3 pods into each node that were consuming about 80% ram and about 70% cpu the result of this was that all the DS pods were killed for both DS
this is description on one of the nodes where all DS pods were killed
|
some updates here, after the pods were killed and i removed all the pods on the nodes, the DS pods were not recreated once there was resources available for them. i had to delete the DS and create it again for the pods to be back. |
Automatic merge from submit-queue add an event for when a daemonset can't place a pod due to insufficen… …t resource or port conflict #23013 (comment)
I'm seeing this same behavior in my 1.2 cluster. I have 4 nodes in a cluster, all of which have sufficient space available, but the DS is reporting "desired" and "current" counts of 2. What's worse is that things were properly working a few days ago when I rolled this out, but sometime in the last few days, 2 of the nodes lost their DS pods and they haven't come back. |
i gave up and moved to manifests on the nodes |
After playing around with label selectors, restarting api server and controller, and deleting/adding a node, finally got another of the 4 nodes to have a DS pod on it after waiting ~5 minutes.
|
I marked this p0 for triage for now. |
Ok - more troubleshooting with Kelsey on slack...deleting the problem nodes by hand, then restarting kubelet on those nodes seemed to fix the issue. The DS scheduled the remaining node once the kubelet had registered itself. Guessing a bad cache somewhere.
|
I'd love to kubectl get node -o json on a bad node |
The last node in the cluster (which came back with re-registering the kubelet) had recently undergone a coreos automatic update - looks like it's running docker 1.9.1 on that node now. The third node (which came online after playing around with label selectors and re-registration) did not receive the upgrade and is still running docker 1.8.3. Perhaps the reboot/registration that happened broke a cache somewhere? |
I have this behavior in another cluster as well - the bad node went through a reboot/upgrade cycle early this morning. Here's the node description:
|
Do you have any failed or succedded pods on those nodes? are you using Job resource for anything? you can see them when you run kubectl get pod -o wide --show-all. #23463 adds an event that should give us insight into where these pods are getting caught up. I also opened #23929 which looks like a bug. |
Lots of happy pods on these nodes, driven from RCs and Deployments. I'm
|
cc @mml |
I had two clusters, both with 2 nodes, one where DaemonSets are working and one (older) where they have stopped working. Both have happy (kube-system) pods on them, both have sufficient free capacity (20ish percent cpu and memory allocated on both). Both have used Jobs in the past, but don't have any jobs atm. I tried restarting the kube-controller-manager with no luck, and also using 1.2.1 kcm with no luck.
And if I enable
with the key line I believe being:
I'm more than happy to debug this interactively with someone on Slack (handle jknight) if they want more info. I was looking at the code last night and my feeling is it might be related to this call but without more logging output inside there it's hard to tell why. |
Can you paste the output of |
Ohhh... I think I see now, so this would be a result of #23929 thinking the nodes are full due to the completed pods? I can verify this on my other cluster by scheduling some jobs... stay tuned. Also, if anyone has a nightly build with #23929 in there, I can test that. |
Yup, on my working cluster I started with the two DaemonSets successfully running on both nodes, then after running some jobs I see that go down to 1 and then 0. Can I request a cherry-pick for #23929 ? |
@mikedanese It is still an issue for me: I have 4 nodes, 1 master and 3 slaves.
I deploy following daemonset:
It deploys, but has DESIRED set to 3, and not 4:
What's the weirdest of all, it deploys on master node and just two slaves..
|
Forgot to add:
|
This seems to be related to #28535 because describe on daemonset reports:
When there's no actual conflict. |
I'm seeing a similar issue. I had a disk full issue on a bunch of nodes (unrelated). Some nodes had the daemonsets removed. Other's didn't. The issue is, once I've fixed this, I can't have the nodes reschedule DaemonSets. Short of deleting the node and then restarting kubelet, which isn't much fun. |
I've just run into a similar looking issue. I have a DS called
There are three nodes with such label:
But Kubernetes thinks the DS should only run on two nodes:
The DS doesn't run on
After running
This was working fine for months; it only broke recently, after we upgraded from Kubernetes 1.3 to 1.5. I can't say for sure those things are related though, because I don't know exactly what the DS stopped working correctly. |
I'll leave the milestone on this one since it looks like it's still getting attention. @mikedanese clear the milestone if it is not release blocking |
I actually expect that this is very much improved in 1.6 so I'll drop the milestone and wait for feedback once people start testing 1.6. cc @janetkuo |
Let's just open new issues when people give feedback on 1.6. |
For people ending here with a 1.5 cluster and dreading to replace nodes: it might help to just recreate the daemonset itself by using something like kubectl get -o yaml ds NAME > ds.yml
kubectl delete --cascade=false ds NAME
kubectl apply -f ds.yml This worked for me to bring back a missing kube2iam pod on a node. Unfortunately I don't have the logs any more to see why it got lost in the first place. |
Just ran into the same issue with K8s 1.6.4. I had a node being OOD and repaired it manually, when it came back healthy the ds was not scheduled there and the ds controller did not even try that. Fixed it using @ankon's comment above (#23013 (comment)) This is issue is really bad when the ds in question is for example Calico, which is needed for the pod networking. |
I see this happening on 1.6 clusters. How should I debug? |
Seeing this with Openshift 3.7 / K8s 1.7 EDIT: root cause for me was related to a taint on some nodes. |
We ran into this problem with v1.6.13 The instructions in this comment make the daemonset pods start at all nodes. But even after a delete and recreate, I think the DaemonSet is still left in a wrong state
Even though all the pods have started daemonset thinks there are 0 desired replicas. |
Just hit this problem with v1.14.1. Deployed some identical servers (apart from hostname/IP obviously) from the same configuration management but one was not getting DaemonSets scheduled on it. Comment #23013 (comment) resolved the issue for us. Still strange it happened and why it only happened to one of them. |
@ross-w same problem with v1.13.5, experienced it over a dozen of times; had to either delete node from apiserver and restart kubelet or kill the kube-controller-manager that's the leader. |
UPSTREAM: 78595: Add dangling volumes as uncertain Origin-commit: 4ed130610c4f2ef849b54f6c53f5689fd0175ad1
We had the same problem on v1.15.10 with Daemon Set only using 3/9 available nodes. After attempting all sorts of solutions. The fix was to roll the nodes and let the Daemon Set find the new nodes. |
I had this same problem and managed to solve it. The root cause for me was that one of my nodes had a taint, due to problematic mounting of an AWS EBS volume (related to this other issue: #36450 (comment)).
The simplest solution was to "force detach" the problematic volume on the AWS console, and afterwards remove the taint from the node: Another solution if everything else fails is to drain the node and remove it from the cluster. |
Using v1.2.0-beta.1. Deployed a DaemonSet with no node selector, but it's not running in all of them.
The two that are running are the ones with SchedulingDisabled.
The text was updated successfully, but these errors were encountered: