Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DaemonSet doesn't run in all nodes #23013

Closed
Calpicow opened this issue Mar 15, 2016 · 44 comments
Closed

DaemonSet doesn't run in all nodes #23013

Calpicow opened this issue Mar 15, 2016 · 44 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/apps Categorizes an issue or PR as relevant to SIG Apps.

Comments

@Calpicow
Copy link
Contributor

Using v1.2.0-beta.1. Deployed a DaemonSet with no node selector, but it's not running in all of them.

The two that are running are the ones with SchedulingDisabled.

$ kubectl get nodes
NAME            STATUS                     AGE
100.64.32.234   Ready                      8d
100.64.32.71    Ready,SchedulingDisabled   5m
100.64.33.77    Ready,SchedulingDisabled   19m
100.64.33.82    Ready                      2d
$ kubectl describe daemonset kube-proxy
Name:       kube-proxy
Image(s):   calpicow/hyperkube:v1.2.0-beta.1-custom
Selector:   name in (kube-proxy)
Node-Selector:  <none>
Labels:     name=kube-proxy
Desired Number of Nodes Scheduled: 2
Current Number of Nodes Scheduled: 2
Number of Nodes Misscheduled: 0
Pods Status:    2 Running / 0 Waiting / 0 Succeeded / 0 Failed
@mikedanese mikedanese self-assigned this Mar 15, 2016
@mikedanese
Copy link
Member

can I see:

kubectl get no,ds,po -o json

Do the nodes have enough resource? Are their host port conflicts?

@Calpicow
Copy link
Contributor Author

Output at https://gist.github.com/Calpicow/c25057f849061deb5f2e. cmd and env parameters removed to avoid potential leaks.

Resources look fine, no port conflicts. There is another DaemonSet in the dump, called aws-node-labels that is experiencing the same issue.

@lacion
Copy link

lacion commented Mar 24, 2016

i am experiencing the same issue with 10 nodes.

running latest releases kubernetes 1.2.0

kubectl get nodes
NAME STATUS AGE
node1 Ready 2d
node10 Ready 2d
node2 Ready 2d
node3 Ready 2d
node4 Ready 2d
node5 Ready 2d
node6 Ready 2d
node7 Ready 2d
node8 Ready 2d
node9 Ready 2d

Name: fluentd
Image(s): dronedeploy/fluentd:v0.1.0
Selector: app=fluentd
Node-Selector:
Labels: name=fluentd
Desired Number of Nodes Scheduled: 3
Current Number of Nodes Scheduled: 3
Number of Nodes Misscheduled: 0
Pods Status: 3 Running / 0 Waiting / 0 Succeeded / 0 Failed

this is the only daemon set i have running

@mikedanese
Copy link
Member

Do the nodes have capacity, are you using host ports?

@lacion
Copy link

lacion commented Mar 24, 2016

2 of the nodes DO NOT have capacities, all the other do. i am not using host ports

will it be possible to reserve resource for daemonsets?
will they get created in the nodes with no capacity when there is?

@lacion
Copy link

lacion commented Mar 29, 2016

kube version across the cluster is 1.2 as well as my local kubectl version

i started using --kube-reserved=cpu=1000m,memory=1G and launched a DS for fluentd

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: fluentd
  namespace: kube-system
spec:
  template:
    metadata:
      labels:
        app: fluentd
    spec:
      dnsPolicy: "ClusterFirst"
      restartPolicy: "Always"
      containers:
      - name: fluentd
        image: dronedeploy/fluentd:v0.1.0
        imagePullPolicy: "Always"
        env:
        - name: KUBE_API
          value: http://kube-master
        - name: ES_HOST
          value: http://elasticsearch
        - name: ES_PORT
          value: "9200"
        - name: ES_USER
          valueFrom:
            secretKeyRef:
              name: "elasticsearch-found-logs"
              key: user
        - name: ES_PASSWORD
          valueFrom:
            secretKeyRef:
              name: "elasticsearch-found-logs"
              key: password
        - name: AWS_KEY
          valueFrom:
            secretKeyRef:
              name: "aws-s3"
              key: key
        - name: AWS_SECRET
          valueFrom:
            secretKeyRef:
              name: "aws-s3"
              key: secret
        - name: S3_BUCKET
          value: drone-kube-logs
        - name: S3_REGION
          value: us-east-1
        - name: S3_TIMESLICE
          value: 10m
        volumeMounts:
          - mountPath: /var/log
            name: varlog
          - mountPath: /var/lib/docker/containers
            name: containers
      volumes:
        - hostPath:
            path: /var/log
          name: varlog
        - hostPath:
            path: /var/lib/docker/containers
          name: containers

i then launched 3 pods into each node that were consuming about 80% ram and about 70% cpu, the DS staying there and all the logs were in ES.

today i added second DS for sensu, only the fluentd DS pods, heapster and dashboard pods were running when i added this second DS

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: sensu-client
  namespace: kube-system
spec:
  template:
    metadata:
      labels:
        app: sensu-client
    spec:
      dnsPolicy: "ClusterFirst"
      restartPolicy: "Always"
      hostNetwork: true
      containers:
      - name: sensu-client
        image: dronedeploy/sensu-client:v0.1.0
        imagePullPolicy: "Always"
        securityContext:
          privileged: true
        env:
        - name: CERT_SECRET
          value: "true"
        - name: CLIENT_SUBSCRIPTIONS
          value: kubenode
        - name: RABBITMQ_PORT
          value: "5671"
        - name: RABBITMQ_HOST
          value: xx.xx.xx.xx
        - name: RABBITMQ_USER
          valueFrom:
            secretKeyRef:
              name: "sensu-credentials"
              key: user
        - name: RABBITMQ_PASSWORD
          valueFrom:
            secretKeyRef:
              name: "sensu-credentials"
              key: password
        - name: RABBITMQ_VHOST
          value: sensu
        - name: SENSU_SSL_KEY
          valueFrom:
            secretKeyRef:
              name: "sensu-ssl"
              key: key
        - name: SENSU_SSL_CERT
          valueFrom:
            secretKeyRef:
              name: "sensu-ssl"
              key: cert

the result was all the pods being created and everything seemed to work, them i proceeded to create again 3 pods into each node that were consuming about 80% ram and about 70% cpu the result of this was that all the DS pods were killed for both DS

kubectl describe ds sensu-client --namespace=kube-system
Name:       sensu-client
Image(s):   dronedeploy/sensu-client:latest
Selector:   app=sensu-client
Node-Selector:  <none>
Labels:     app=sensu-client
Desired Number of Nodes Scheduled: 0
Current Number of Nodes Scheduled: 0
Number of Nodes Misscheduled: 0
Pods Status:    0 Running / 0 Waiting / 0 Succeeded / 0 Failed
Events:
  FirstSeen LastSeen    Count   From        SubobjectPath   Type        Reason          Message
  --------- --------    -----   ----        -------------   --------    ------          -------
  37m       37m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: sensu-client-gk52x
  37m       37m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: sensu-client-gk52x
  37m       37m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: sensu-client-gk52x
  36m       36m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: sensu-client-hd5ms
  36m       36m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: sensu-client-hd5ms
  35m       35m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: sensu-client-hd5ms
  34m       34m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: sensu-client-n0ue6
  34m       34m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: sensu-client-n0ue6
  34m       34m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: sensu-client-n0ue6
  32m       32m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: sensu-client-8255q
  32m       32m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: sensu-client-8255q
  32m       32m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: sensu-client-8255q
  31m       31m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: sensu-client-2x4my
  31m       31m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: sensu-client-2x4my
  31m       31m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: sensu-client-2x4my
  31m       31m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: sensu-client-nienw
  31m       31m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: sensu-client-2y90n
  31m       31m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: sensu-client-2y90n
  31m       31m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: sensu-client-nienw
  31m       31m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: sensu-client-nienw
  31m       31m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: sensu-client-2y90n
  29m       29m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: sensu-client-4jfl1
  29m       29m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: sensu-client-4jfl1
  29m       29m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: sensu-client-4jfl1
  29m       29m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: sensu-client-34lvc
  29m       29m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: sensu-client-34lvc
  29m       29m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: sensu-client-34lvc
kubectl describe ds fluentd --namespace=kube-system
Name:       fluentd
Image(s):   dronedeploy/fluentd:latest
Selector:   app=fluentd
Node-Selector:  <none>
Labels:     app=fluentd
Desired Number of Nodes Scheduled: 0
Current Number of Nodes Scheduled: 0
Number of Nodes Misscheduled: 0
Pods Status:    0 Running / 0 Waiting / 0 Succeeded / 0 Failed
Events:
  FirstSeen LastSeen    Count   From        SubobjectPath   Type        Reason          Message
  --------- --------    -----   ----        -------------   --------    ------          -------
  38m       38m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: fluentd-3qg9z
  38m       38m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: fluentd-3qg9z
  38m       38m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: fluentd-3qg9z
  36m       36m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: fluentd-08fnk
  36m       36m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: fluentd-08fnk
  36m       36m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: fluentd-08fnk
  35m       35m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: fluentd-cv096
  35m       35m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: fluentd-cv096
  35m       35m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: fluentd-cv096
  33m       33m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: fluentd-6ysry
  33m       33m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: fluentd-6ysry
  33m       33m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: fluentd-6ysry
  32m       32m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: fluentd-01gb5
  32m       32m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: fluentd-01gb5
  31m       31m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: fluentd-01gb5
  31m       31m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: fluentd-eii4s
  31m       31m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: fluentd-eii4s
  31m       31m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: fluentd-eii4s
  31m       31m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: fluentd-y27g1
  31m       31m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: fluentd-y27g1
  31m       31m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: fluentd-y27g1
  30m       30m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: fluentd-j21r9
  30m       30m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: fluentd-j21r9
  30m       30m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: fluentd-j21r9
  30m       30m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: fluentd-avpi7
  30m       30m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: fluentd-avpi7
  29m       29m     1   {daemon-set }           Normal      SuccessfulDelete    Deleted pod: fluentd-avpi7

this is description on one of the nodes where all DS pods were killed

kubectl describe node kube-node2
Name:           kube-node2
Labels:         kubernetes.io/hostname=kube-node2
CreationTimestamp:  Tue, 22 Mar 2016 17:47:34 +0100
Phase:
Conditions:
  Type      Status  LastHeartbeatTime           LastTransitionTime          Reason              Message
  ----      ------  -----------------           ------------------          ------              -------
  OutOfDisk     False   Tue, 29 Mar 2016 20:19:40 +0200     Mon, 28 Mar 2016 08:55:10 +0200     KubeletHasSufficientDisk    kubelet has sufficient disk space available
  Ready     True    Tue, 29 Mar 2016 20:19:40 +0200     Mon, 28 Mar 2016 08:55:10 +0200     KubeletReady            kubelet is posting ready status
Addresses:  xx.xx.xx.xx,xx.xx.xx.xx
Capacity:
 memory:    65965120Ki
 pods:      110
 cpu:       24
System Info:
 Machine ID:            xxxxxxx
 System UUID:           xxxxxxx
 Boot ID:           xxxxxxx
 Kernel Version:        3.19.0-56-generic
 OS Image:          Ubuntu 14.04.4 LTS
 Container Runtime Version: docker://1.10.3
 Kubelet Version:       v1.2.0
 Kube-Proxy Version:        v1.2.0
ExternalID:         kube-node2
Non-terminated Pods:        (3 in total)
  Namespace         Name                                    CPU Requests    CPU Limits  Memory Requests Memory Limits
  ---------         ----                                    ------------    ----------  --------------- -------------
  default           xxxxxxx     7 (29%)     7 (29%)     20000Mi (31%)   26000Mi (40%)
  default           xxxxxxx     7 (29%)     7 (29%)     20000Mi (31%)   26000Mi (40%)
  default           xxxxxxx     7 (29%)     7 (29%)     20000Mi (31%)   26000Mi (40%)
Allocated resources:
  (Total limits may be over 100%, i.e., overcommitted. More info: http://releases.k8s.io/HEAD/docs/user-guide/compute-resources.md)
  CPU Requests  CPU Limits  Memory Requests Memory Limits
  ------------  ----------  --------------- -------------
  21 (87%)  21 (87%)    60000Mi (93%)   78000Mi (121%)
Events:
  FirstSeen LastSeen    Count   From                SubobjectPath   Type        Reason          Message
  --------- --------    -----   ----                -------------   --------    ------          -------
  1m        1m      1   {kubelet kube-node2}            Normal      Starting        Starting kubelet.
  1m        1m      1   {kubelet kube-node2}            Normal      NodeHasSufficientDisk   Node kube-node2 status is now: NodeHasSufficientDisk

@lacion
Copy link

lacion commented Mar 30, 2016

some updates here, after the pods were killed and i removed all the pods on the nodes, the DS pods were not recreated once there was resources available for them.

i had to delete the DS and create it again for the pods to be back.

k8s-github-robot pushed a commit that referenced this issue Apr 3, 2016
Automatic merge from submit-queue

add an event for when a daemonset can't place a pod due to insufficen…

…t resource or port conflict

#23013 (comment)
@chris-codaio
Copy link
Contributor

I'm seeing this same behavior in my 1.2 cluster. I have 4 nodes in a cluster, all of which have sufficient space available, but the DS is reporting "desired" and "current" counts of 2. What's worse is that things were properly working a few days ago when I rolled this out, but sometime in the last few days, 2 of the nodes lost their DS pods and they haven't come back.

@lacion
Copy link

lacion commented Apr 6, 2016

i gave up and moved to manifests on the nodes

@chris-codaio
Copy link
Contributor

chris-codaio commented Apr 6, 2016

After playing around with label selectors, restarting api server and controller, and deleting/adding a node, finally got another of the 4 nodes to have a DS pod on it after waiting ~5 minutes.

kubectl describe ds loggly-fluentd
Name:       loggly-fluentd
Image(s):   krypton/fluentd:latest
Selector:   app=loggly-fluentd
Node-Selector:  <none>
Labels:     app=loggly-fluentd
Desired Number of Nodes Scheduled: 3
Current Number of Nodes Scheduled: 3
Number of Nodes Misscheduled: 0
Pods Status:    3 Running / 0 Waiting / 0 Succeeded / 0 Failed
Events:
  FirstSeen LastSeen    Count   From        SubobjectPath   Type        Reason          Message
  --------- --------    -----   ----        -------------   --------    ------          -------
  5m        5m      1   {daemon-set }           Normal      SuccessfulCreate    Created pod: loggly-fluentd-2vtwj
  5m        5m      1   {daemon-set }           Normal      SuccessfulCreate    Created pod: loggly-fluentd-xsuk3
  1m        1m      1   {daemon-set }           Normal      SuccessfulCreate    Created pod: loggly-fluentd-nbhb4

@dchen1107 dchen1107 added kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Apr 6, 2016
@dchen1107
Copy link
Member

I marked this p0 for triage for now.

@chris-codaio
Copy link
Contributor

Ok - more troubleshooting with Kelsey on slack...deleting the problem nodes by hand, then restarting kubelet on those nodes seemed to fix the issue. The DS scheduled the remaining node once the kubelet had registered itself. Guessing a bad cache somewhere.

kubectl delete node <node-name>
ssh <node ip> sudo systemctl restart kube-kubelet

@mikedanese
Copy link
Member

I'd love to kubectl get node -o json on a bad node

@chris-codaio
Copy link
Contributor

The last node in the cluster (which came back with re-registering the kubelet) had recently undergone a coreos automatic update - looks like it's running docker 1.9.1 on that node now. The third node (which came online after playing around with label selectors and re-registration) did not receive the upgrade and is still running docker 1.8.3.

Perhaps the reboot/registration that happened broke a cache somewhere?

@chris-codaio
Copy link
Contributor

I have this behavior in another cluster as well - the bad node went through a reboot/upgrade cycle early this morning. Here's the node description:

$ kubectl get node -o json ip-172-16-66-89.us-west-2.compute.internal
{
    "kind": "Node",
    "apiVersion": "v1",
    "metadata": {
        "name": "ip-172-16-66-89.us-west-2.compute.internal",
        "selfLink": "/api/v1/nodes/ip-172-16-66-89.us-west-2.compute.internal",
        "uid": "38203556-f06b-11e5-b3aa-060bea9fb465",
        "resourceVersion": "9985208",
        "creationTimestamp": "2016-03-22T20:18:19Z",
        "labels": {
            "beta.kubernetes.io/instance-type": "m4.2xlarge",
            "failure-domain.beta.kubernetes.io/region": "us-west-2",
            "failure-domain.beta.kubernetes.io/zone": "us-west-2a",
            "kubernetes.io/hostname": "172.16.66.89"
        }
    },
    "spec": {
        "externalID": "i-1f056cc7",
        "providerID": "aws:///us-west-2a/i-1f056cc7"
    },
    "status": {
        "capacity": {
            "cpu": "8",
            "memory": "32951376Ki",
            "pods": "110"
        },
        "allocatable": {
            "cpu": "8",
            "memory": "32951376Ki",
            "pods": "110"
        },
        "conditions": [
            {
                "type": "OutOfDisk",
                "status": "False",
                "lastHeartbeatTime": "2016-04-06T18:26:09Z",
                "lastTransitionTime": "2016-04-06T09:32:41Z",
                "reason": "KubeletHasSufficientDisk",
                "message": "kubelet has sufficient disk space available"
            },
            {
                "type": "Ready",
                "status": "True",
                "lastHeartbeatTime": "2016-04-06T18:26:09Z",
                "lastTransitionTime": "2016-04-06T09:32:52Z",
                "reason": "KubeletReady",
                "message": "kubelet is posting ready status"
            }
        ],
        "addresses": [
            {
                "type": "InternalIP",
                "address": "172.16.66.89"
            },
            {
                "type": "LegacyHostIP",
                "address": "172.16.66.89"
            }
        ],
        "daemonEndpoints": {
            "kubeletEndpoint": {
                "Port": 10250
            }
        },
        "nodeInfo": {
            "machineID": "e2b32c710bed4c1897d0f1b2d9c6ea4e",
            "systemUUID": "EC2D7286-5FE7-A635-ED34-D6C197A487FF",
            "bootID": "51faeeef-0abe-483b-91db-33a983367ad8",
            "kernelVersion": "4.3.6-coreos",
            "osImage": "CoreOS 899.15.0",
            "containerRuntimeVersion": "docker://1.9.1",
            "kubeletVersion": "v1.2.0",
            "kubeProxyVersion": "v1.2.0"
        },
        "images": [
…
        ]
    }
}

@mikedanese
Copy link
Member

Do you have any failed or succedded pods on those nodes? are you using Job resource for anything? you can see them when you run kubectl get pod -o wide --show-all.

#23463 adds an event that should give us insight into where these pods are getting caught up.

I also opened #23929 which looks like a bug.

@chris-codaio
Copy link
Contributor

Lots of happy pods on these nodes, driven from RCs and Deployments. I'm
not using Jobs at all yet.
On Apr 6, 2016 11:51 AM, "Mike Danese" notifications@github.com wrote:

Do you have any failed or succedded pods on those nodes? are you using Job
resource for anything? you can see them when you run kubectl get pod -o
wide --show-all.

#23463 #23463 adds an
event that should give us insight into where these pods are getting caught
up.

I also opened #23929 #23929
which looks like a bug.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#23013 (comment)

@bgrant0607
Copy link
Member

cc @mml

@binarybana
Copy link

I had two clusters, both with 2 nodes, one where DaemonSets are working and one (older) where they have stopped working. Both have happy (kube-system) pods on them, both have sufficient free capacity (20ish percent cpu and memory allocated on both). Both have used Jobs in the past, but don't have any jobs atm.

I tried restarting the kube-controller-manager with no luck, and also using 1.2.1 kcm with no luck.
kubectl get ds --kube-system gives:

NAME         DESIRED   CURRENT   NODE-SELECTOR   AGE
docker-reg   0         0         <none>          14h
fluentd      0         0         <none>          14h

And if I enable --v=6 on the KCM (where docker-reg and fluentd are the two DS) I see things like:

controller.go:610] Waiting for pods controller to sync, requeuing ds kube-system/docker-reg                                                                                                                                                                                               
controller.go:604] Finished syncing daemon set "kube-system/docker-reg" (107.40488ms)                                                                                                                                                                                                     
controller_utils.go:160] Controller kube-system/docker-reg either never recorded expectations, or the ttl expired.                                                                                                                                                                        
controller.go:277] No daemon sets found for pod monitoring-influx-grafana-v2-ewpms, daemon set controller will avoid syncing                                                                                                                                                              
controller.go:277] No daemon sets found for pod nginx-ingress-3rdpartycfg-sq7zu, daemon set controller will avoid syncing                                                                                                                                                                 
controller.go:277] No daemon sets found for pod job-b6faa90ca91a5a60ec63ff8626df54bc-c4k22, daemon set controller will avoid syncing                                                                                                                                                      
controller.go:277] No daemon sets found for pod job-e31777d34ca12fe19f17884d3f1f0ae9-2x0oa, daemon set controller will avoid syncing                                                                                                                                                      
controller.go:277] No daemon sets found for pod job-906a1803742c2c2a94bbe4b7c05bf091-qx7tj, daemon set controller will avoid syncing                                                                                                                                                      
controller.go:277] No daemon sets found for pod job-3481b2beea547fb88dadf54db11a6011-261i7, daemon set controller will avoid syncing                                                                                                                                                      
controller.go:277] No daemon sets found for pod job-1702ab34a8480abc553033153878dac6-ouk6j, daemon set controller will avoid syncing                                                                                                                                                      
controller.go:277] No daemon sets found for pod heapster-v1.0.0-btiod, daemon set controller will avoid syncing                                                                                                                                                                           
controller.go:277] No daemon sets found for pod kibana-logging-v1-tie1v, daemon set controller will avoid syncing                                                                                                                                                                         
controller.go:277] No daemon sets found for pod kubernetes-dashboard-v0.1.0-jn24t, daemon set controller will avoid syncing                                                                                                                                                               
controller.go:277] No daemon sets found for pod job-d5ce809e3634e9149f342dc99f5effaa-u3vll, daemon set controller will avoid syncing                                                                                                                                                      
controller.go:277] No daemon sets found for pod hli-annotation-nbserver-mnsjb, daemon set controller will avoid syncing                                                                                                                                                                   
controller.go:277] No daemon sets found for pod job-6e9e7f732da3bea73100b75d11571608-z800l, daemon set controller will avoid syncing                                                                                                                                                      
controller.go:277] No daemon sets found for pod job-cae92f28bba5380e06e2033960880b97-h0i13, daemon set controller will avoid syncing                                                                                                                                                      
controller.go:277] No daemon sets found for pod job-15b749338087f526fc4e99d97354f79b-faicp, daemon set controller will avoid syncing                                                                                                                                                      
controller.go:277] No daemon sets found for pod default-http-backend-91vuy, daemon set controller will avoid syncing                                                                                                                                                                      
controller.go:277] No daemon sets found for pod hli-annotation-docs-zbihw, daemon set controller will avoid syncing                                                                                                                                                                       
controller.go:277] No daemon sets found for pod job-131b74070c08f246e5c4bbdef2e3d6f8-82ofu, daemon set controller will avoid syncing                                                                                                                                                      
controller.go:277] No daemon sets found for pod kube-dns-v9-fqovd, daemon set controller will avoid syncing                                                                                                                                                                               
controller.go:277] No daemon sets found for pod job-f50882b69ed8f5df65032991ac1d6937-90cx0, daemon set controller will avoid syncing                                                                                                                                                      
controller.go:277] No daemon sets found for pod job-58bbd8ed55ed57c4cc38abfdbb4d0f67-g0dfd, daemon set controller will avoid syncing                                                                                                                                                      
controller_utils.go:177] Setting expectations &{add:0 del:0 key:kube-system/docker-reg timestamp:{sec:63595594968 nsec:775713592 loc:0x262fd60}}                                                                                                                                          
controller.go:507] Nodes needing daemon pods for daemon set docker-reg: [], creating 0                                                                                                                                                                                                    
controller.go:522] Pods to delete for daemon set docker-reg: [], deleting 0                                                                                                                                                                                                               
controller.go:565] Updating daemon set status                                                                                                                                                                                                                                             
controller.go:277] No daemon sets found for pod kibana-logging-v1-tie1v, daemon set controller will avoid syncing                                                                                                                                                                         
controller.go:277] No daemon sets found for pod kubernetes-dashboard-v0.1.0-jn24t, daemon set controller will avoid syncing                                                                                                                                                               
controller.go:277] No daemon sets found for pod job-d5ce809e3634e9149f342dc99f5effaa-u3vll, daemon set controller will avoid syncing                                                                                                                                                      
controller.go:277] No daemon sets found for pod job-1702ab34a8480abc553033153878dac6-ouk6j, daemon set controller will avoid syncing                                                                                                                                                      
controller.go:277] No daemon sets found for pod heapster-v1.0.0-btiod, daemon set controller will avoid syncing                                                                                                                                                                           
controller.go:277] No daemon sets found for pod hli-annotation-nbserver-mnsjb, daemon set controller will avoid syncing                                                                                                                                                                   
controller.go:277] No daemon sets found for pod job-6e9e7f732da3bea73100b75d11571608-z800l, daemon set controller will avoid syncing                                                                                                                                                      
controller.go:277] No daemon sets found for pod job-cae92f28bba5380e06e2033960880b97-h0i13, daemon set controller will avoid syncing                                                                                                                                                      
controller.go:277] No daemon sets found for pod monitoring-influx-grafana-v2-ewpms, daemon set controller will avoid syncing                                                                                                                                                              
controller.go:277] No daemon sets found for pod nginx-ingress-3rdpartycfg-sq7zu, daemon set controller will avoid syncing                                                                                                                                                                 
controller.go:277] No daemon sets found for pod job-906a1803742c2c2a94bbe4b7c05bf091-qx7tj, daemon set controller will avoid syncing                                                                                                                                                      
controller.go:277] No daemon sets found for pod job-3481b2beea547fb88dadf54db11a6011-261i7, daemon set controller will avoid syncing                                                                                                                                                      
controller.go:277] No daemon sets found for pod job-b6faa90ca91a5a60ec63ff8626df54bc-c4k22, daemon set controller will avoid syncing                                                                                                                                                      
controller.go:277] No daemon sets found for pod job-e31777d34ca12fe19f17884d3f1f0ae9-2x0oa, daemon set controller will avoid syncing                                                                                                                                                      
controller.go:277] No daemon sets found for pod job-58bbd8ed55ed57c4cc38abfdbb4d0f67-g0dfd, daemon set controller will avoid syncing                                                                                                                                                      
controller.go:277] No daemon sets found for pod kube-dns-v9-fqovd, daemon set controller will avoid syncing                                                                                                                                                                               
controller.go:277] No daemon sets found for pod job-f50882b69ed8f5df65032991ac1d6937-90cx0, daemon set controller will avoid syncing                                                                                                                                                      
controller.go:277] No daemon sets found for pod job-15b749338087f526fc4e99d97354f79b-faicp, daemon set controller will avoid syncing                                                                                                                                                      
controller.go:277] No daemon sets found for pod default-http-backend-91vuy, daemon set controller will avoid syncing                                                                                                                                                                      
controller.go:277] No daemon sets found for pod hli-annotation-docs-zbihw, daemon set controller will avoid syncing                                                                                                                                                                       
controller.go:277] No daemon sets found for pod job-131b74070c08f246e5c4bbdef2e3d6f8-82ofu, daemon set controller will avoid syncing                                                                                                                                                      
controller.go:604] Finished syncing daemon set "kube-system/docker-reg" (1.325079ms)                                                                                                                                                                                                      
controller.go:610] Waiting for pods controller to sync, requeuing ds kube-system/fluentd                                                                                                                                                                                                  
controller.go:604] Finished syncing daemon set "kube-system/fluentd" (108.773848ms)                                                                                                                                                                                                       
controller_utils.go:160] Controller kube-system/fluentd either never recorded expectations, or the ttl expired.                                                                                                                                                                           
controller.go:277] No daemon sets found for pod monitoring-influx-grafana-v2-ewpms, daemon set controller will avoid syncing

with the key line I believe being:
controller.go:507] Nodes needing daemon pods for daemon set docker-reg: [], creating 0
This despite the fact that these DaemonSets have no node selectors (here's one lightly edited one for example):

$ kg ds ks -o yaml
apiVersion: v1
items:
- apiVersion: extensions/v1beta1
  kind: DaemonSet
  metadata:
    creationTimestamp: 2016-04-07T03:22:22Z
    generation: 1
    labels:
      component: docker-reg
    name: docker-reg
    namespace: kube-system
    resourceVersion: "2466044"
    selfLink: /apis/extensions/v1beta1/namespaces/kube-system/daemonsets/docker-reg
    uid: f1262897-fc6f-11e5-92da-06965e83d425
  spec:
    selector:
      matchLabels:
        component: docker-reg
    template:
      metadata:
        creationTimestamp: null
        labels:
          component: docker-reg
      spec:
        containers:
        - env:
          - name: REGISTRY_STORAGE
            value: s3
          image: registry:2
          imagePullPolicy: IfNotPresent
          name: local-docker-reg
          resources:
            requests:
              cpu: 100m
          terminationMessagePath: /dev/termination-log
        dnsPolicy: ClusterFirst
        hostNetwork: true
        restartPolicy: Always
        securityContext: {}
        terminationGracePeriodSeconds: 30
  status:
    currentNumberScheduled: 0
    desiredNumberScheduled: 0
    numberMisscheduled: 0

I'm more than happy to debug this interactively with someone on Slack (handle jknight) if they want more info. I was looking at the code last night and my feeling is it might be related to this call but without more logging output inside there it's hard to tell why.

@mikedanese
Copy link
Member

Can you paste the output of kubectl get pod --all-namespaces --show-all -o wide?

@binarybana
Copy link

$ kubectl get pod --all-namespaces --show-all -o wide
NAMESPACE     NAME                                         READY     STATUS      RESTARTS   AGE       NODE
default       default-http-backend-91vuy                   1/1       Running     0          20d       ip-10-2-94-157.us-west-2.compute.internal
default       hli-annotation-docs-zbihw                    2/2       Running     0          15d       ip-10-2-94-157.us-west-2.compute.internal
default       hli-annotation-nbserver-mnsjb                1/1       Running     0          9d        ip-10-2-51-248.us-west-2.compute.internal
default       job-131b74070c08f246e5c4bbdef2e3d6f8-82ofu   0/1       Completed   0          20d       ip-10-2-94-157.us-west-2.compute.internal
default       job-15b749338087f526fc4e99d97354f79b-faicp   0/1       Completed   0          20d       ip-10-2-94-157.us-west-2.compute.internal
default       job-1702ab34a8480abc553033153878dac6-ouk6j   0/1       Completed   0          20d       ip-10-2-51-248.us-west-2.compute.internal
default       job-3481b2beea547fb88dadf54db11a6011-261i7   0/1       Completed   0          20d       ip-10-2-51-248.us-west-2.compute.internal
default       job-58bbd8ed55ed57c4cc38abfdbb4d0f67-g0dfd   0/1       Completed   0          20d       ip-10-2-94-157.us-west-2.compute.internal
default       job-6e9e7f732da3bea73100b75d11571608-z800l   0/1       Completed   0          20d       ip-10-2-51-248.us-west-2.compute.internal
default       job-906a1803742c2c2a94bbe4b7c05bf091-qx7tj   0/1       Completed   0          20d       ip-10-2-51-248.us-west-2.compute.internal
default       job-b6faa90ca91a5a60ec63ff8626df54bc-c4k22   0/1       Completed   0          20d       ip-10-2-51-248.us-west-2.compute.internal
default       job-cae92f28bba5380e06e2033960880b97-h0i13   0/1       Completed   0          20d       ip-10-2-51-248.us-west-2.compute.internal
default       job-d5ce809e3634e9149f342dc99f5effaa-u3vll   0/1       Completed   0          20d       ip-10-2-51-248.us-west-2.compute.internal
default       job-e31777d34ca12fe19f17884d3f1f0ae9-2x0oa   0/1       Completed   0          20d       ip-10-2-51-248.us-west-2.compute.internal
default       job-f50882b69ed8f5df65032991ac1d6937-90cx0   0/1       Completed   0          20d       ip-10-2-94-157.us-west-2.compute.internal
default       nginx-ingress-3rdpartycfg-sq7zu              1/1       Running     0          8d        ip-10-2-51-248.us-west-2.compute.internal
kube-system   heapster-v1.0.0-btiod                        2/2       Running     0          20d       ip-10-2-51-248.us-west-2.compute.internal
kube-system   kibana-logging-v1-tie1v                      1/1       Running     0          20d       ip-10-2-51-248.us-west-2.compute.internal
kube-system   kube-dns-v9-fqovd                            4/4       Running     0          20d       ip-10-2-94-157.us-west-2.compute.internal
kube-system   kubernetes-dashboard-v0.1.0-jn24t            1/1       Running     0          20d       ip-10-2-51-248.us-west-2.compute.internal
kube-system   monitoring-influx-grafana-v2-ewpms           2/2       Running     0          20d       ip-10-2-51-248.us-west-2.compute.internal

Ohhh... I think I see now, so this would be a result of #23929 thinking the nodes are full due to the completed pods? I can verify this on my other cluster by scheduling some jobs... stay tuned.

Also, if anyone has a nightly build with #23929 in there, I can test that.

@binarybana
Copy link

Yup, on my working cluster I started with the two DaemonSets successfully running on both nodes, then after running some jobs I see that go down to 1 and then 0. Can I request a cherry-pick for #23929 ?

@sheerun
Copy link

sheerun commented Jan 17, 2017

@mikedanese It is still an issue for me:

I have 4 nodes, 1 master and 3 slaves.

kubectl get nodes
NAME          STATUS         AGE
api-master1   Ready,master   10d
api-node1     Ready          10d
api-node2     Ready          10d
api-node3     Ready          10d

I deploy following daemonset:

apiVersion: v1
kind: ConfigMap
metadata:
  name: nginx-volume-config
data:
  nginx.conf: |-
    worker_processes 1;

    events {
      worker_connections 1024;
    }

    stream {
      error_log stderr;

      resolver 127.0.0.1 ipv6=off;

      server {
        listen 80;
        proxy_pass traefik-ingress-service.default.svc.cluster.local:80;
      }

      server {
        listen 443;
        proxy_pass traefik-ingress-service.default.svc.cluster.local:443;
      }

      server {
        listen 2222;
        proxy_pass deis-router.deis.svc.cluster.local:2222;
      }
    }
---
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: nginx-ingress-proxy
spec:
  template:
    metadata:
      labels:
        name: nginx-ingress-proxy
    spec:
      terminationGracePeriodSeconds: 60
      containers:
      - name: dnsmasq
        image: "janeczku/go-dnsmasq:release-1.0.5"
        args:
          - --listen
          - "127.0.0.1:53"
          - --default-resolver
          - --nameservers
          - "10.96.0.10,8.8.8.8"
          - --hostsfile=/etc/hosts
          - --verbose
        ports:
        - name: http
          containerPort: 53
          hostPort: 53
          protocol: UDP
      - image: nginx
        name: nginx
        imagePullPolicy: IfNotPresent
        resources:
          limits:
            cpu: 400m
            memory: 300Mi
          requests:
            cpu: 200m
            memory: 200Mi
        volumeMounts:
        - mountPath: /etc/nginx
          name: config
          readOnly: false
        ports:
        - name: http
          containerPort: 80
          hostPort: 80
          protocol: TCP
        - name: https
          containerPort: 443
          hostPort: 443
          protocol: TCP
        - name: builder
          containerPort: 2222
          hostPort: 2222
          protocol: TCP
      hostNetwork: true
      restartPolicy: Always
      securityContext: {}
      volumes:
      - name: config
        configMap:
          name: nginx-volume-config
          items:
          - key: nginx.conf
            path: nginx.conf

It deploys, but has DESIRED set to 3, and not 4:

NAME                  DESIRED   CURRENT   READY     NODE-SELECTOR   AGE
nginx-ingress-proxy   3         3         3         <none>          5m

What's the weirdest of all, it deploys on master node and just two slaves..

NAME                                          READY     STATUS    RESTARTS   AGE       IP          NODE
nginx-ingress-proxy-0958x                     2/2       Running   0          6m        10.0.1.4    api-node3
nginx-ingress-proxy-r3dcs                     2/2       Running   0          6m        10.0.1.6    api-node2
nginx-ingress-proxy-zk50w                     2/2       Running   0          6m        10.0.1.7    api-master1

@sheerun
Copy link

sheerun commented Jan 17, 2017

Forgot to add:

Client Version: version.Info{Major:"1", Minor:"5+", GitVersion:"v1.5.2-beta.0-dirty", GitCommit:"5f332aab13e58173f85fd204a2c77731f7a2573f", GitTreeState:"dirty", BuildDate:"2017-01-02T00:07:18Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.1", GitCommit:"82450d03cb057bab0950214ef122b67c83fb11df", GitTreeState:"clean", BuildDate:"2016-12-14T00:52:01Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}

@sheerun
Copy link

sheerun commented Jan 17, 2017

This seems to be related to #28535 because describe on daemonset reports:

Events:
  FirstSeen     LastSeen        Count   From                    SubObjectPath   Type    Reason                   Message
  ---------     --------        -----   ----                    -------------   --------------                   -------
  3m            3m              1       {daemon-set }                           Normal  SuccessfulCreate Created pod: nginx-ingress-proxy-5c886
  3m            3m              1       {daemon-set }                           Normal  SuccessfulCreate Created pod: nginx-ingress-proxy-llrqn
  3m            3m              1       {daemon-set }                           Normal  SuccessfulCreate Created pod: nginx-ingress-proxy-5pscr
  3m            3m              26      {daemonset-controller }                 Normal  FailedPlacement          failed to place pod on "api-node1": host port conflict

When there's no actual conflict.

@jaxxstorm
Copy link

I'm seeing a similar issue.

I had a disk full issue on a bunch of nodes (unrelated). Some nodes had the daemonsets removed. Other's didn't.

The issue is, once I've fixed this, I can't have the nodes reschedule DaemonSets. Short of deleting the node and then restarting kubelet, which isn't much fun.

@unixwitch
Copy link

unixwitch commented Feb 23, 2017

I've just run into a similar looking issue. I have a DS called trafficserver with a node selector:

      nodeSelector:
        type: head

There are three nodes with such label:

> kubectl get nodes --show-labels
NAME                             STATUS    AGE       LABELS
app-1-1.itl.rslon.torchbox.net   Ready     175d      beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=app-1-1.itl.rslon.torchbox.net,type=head
app-1-2.itl.rslon.torchbox.net   Ready     175d      beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=app-1-2.itl.rslon.torchbox.net,type=head
app-1-3.itl.rslon.torchbox.net   Ready     175d      beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=app-1-3.itl.rslon.torchbox.net,type=head

But Kubernetes thinks the DS should only run on two nodes:

> kubectl --namespace=kube-system describe ds trafficserver
Name:           trafficserver
Image(s):       docker.io/torchbox/trafficserver-ingress-controller:latest
Selector:       app=trafficserver
Node-Selector:  type=head
Labels:         app=trafficserver
Desired Number of Nodes Scheduled: 2
Current Number of Nodes Scheduled: 2
Number of Nodes Misscheduled: 0
Pods Status:    2 Running / 0 Waiting / 0 Succeeded / 0 Failed
No events.

The DS doesn't run on app-1-1.itl.rslon.torchbox.net:

> kubectl describe node app-1-1.itl.rslon.torchbox.net
Name:                   app-1-1.itl.rslon.torchbox.net
Role:
Labels:                 beta.kubernetes.io/arch=amd64
                        beta.kubernetes.io/os=linux
                        kubernetes.io/hostname=app-1-1.itl.rslon.torchbox.net
                        type=head
Taints:                 <none>
CreationTimestamp:      Wed, 31 Aug 2016 21:29:16 +0100
Phase:
Conditions:
  Type                  Status  LastHeartbeatTime                       LastTransitionTime              Reason
Message
  ----                  ------  -----------------                       ------------------              ------
-------
  OutOfDisk             False   Thu, 23 Feb 2017 19:59:29 +0000         Fri, 03 Feb 2017 01:49:08 +0000         KubeletHasSufficientDisk      kubelet has sufficient disk space available
  MemoryPressure        False   Thu, 23 Feb 2017 19:59:29 +0000         Wed, 31 Aug 2016 21:29:16 +0100         KubeletHasSufficientMemory    kubelet has sufficient memory available
  Ready                 True    Thu, 23 Feb 2017 19:59:29 +0000         Fri, 03 Feb 2017 01:49:20 +0000         KubeletReady
        kubelet is posting ready status
  DiskPressure          False   Thu, 23 Feb 2017 19:59:29 +0000         Thu, 13 Oct 2016 09:11:15 +0100         KubeletHasNoDiskPressure      kubelet has no disk pressure
Addresses:              172.31.240.27,172.31.240.27,app-1-1.itl.rslon.torchbox.net
Capacity:
 alpha.kubernetes.io/nvidia-gpu:        0
 cpu:                                   4
 memory:                                4044960Ki
 pods:                                  110
Allocatable:
 alpha.kubernetes.io/nvidia-gpu:        0
 cpu:                                   4
 memory:                                4044960Ki
 pods:                                  110
System Info:
 Machine ID:                    0db7e1fdc1604f4da7dd4f0f169fc20c
 System UUID:                   184CE027-FA1D-320B-28FC-460968D84BC4
 Boot ID:                       ddbaa84c-f42f-4763-be98-c11d09245790
 Kernel Version:                4.9.0-0.bpo.1-amd64
 OS Image:                      Debian GNU/Linux 8 (jessie)
 Operating System:              linux
 Architecture:                  amd64
 Container Runtime Version:     docker://1.12.1
 Kubelet Version:               v1.5.2
 Kube-Proxy Version:            v1.5.2
ExternalID:                     app-1-1.itl.rslon.torchbox.net
Non-terminated Pods:            (11 in total)
  Namespace                     Name                                            CPU Requests    CPU Limits      Memory Requests Memory Limits
  ---------                     ----                                            ------------    ----------      --------------- -------------
  default                       carers-688996725-6qk15                          100m (2%)       1 (25%)         768Mi (19%)     768Mi (19%)
  default                       docker-registry-339607264-927kp                 100m (2%)       1 (25%)         64Mi (1%)       64Mi (1%)
  default                       gitlab-runner-1051148307-6cw6l                  0 (0%)          0 (0%)          0 (0%)          0 (0%)

  default                       mencap-3744808569-n30d8                         100m (2%)       1 (25%)         768Mi (19%)     768Mi (19%)
  default                       sentry-cron-845455488-r4blx                     100m (2%)       1 (25%)         96Mi (2%)       96Mi (2%)
  default                       sentry-worker-417112904-1chfl                   100m (2%)       1 (25%)         512Mi (12%)     512Mi (12%)
  gitlab                        gitlab-3779041866-2v0l8                         200m (5%)       2200m (55%)     1120Mi (28%)    2244Mi (56%)
  gitlab                        gitlab-runner-1051148307-ldhtd                  0 (0%)          0 (0%)          0 (0%)          0 (0%)

  kube-system                   kube-dns-2814314096-4vr7f                       110m (2%)       0 (0%)          160Mi (4%)      250Mi (6%)
  kube-system                   kube-lego-3245961140-b9r4g                      100m (2%)       1 (25%)         64Mi (1%)       64Mi (1%)
  kube-system                   kubernetes-dashboard-3203831700-8q1b9           0 (0%)          0 (0%)          0 (0%)          0 (0%)

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.
  CPU Requests  CPU Limits      Memory Requests Memory Limits
  ------------  ----------      --------------- -------------
  910m (22%)    8200m (204%)    3552Mi (89%)    4766Mi (120%)
No events.

After running kubectl delete node app-1-1.itl.rslon.torchbox.net and rebooting the node, the DS runs as expected on all three nodes:

> kubectl --namespace=kube-system describe ds trafficserver
Name:           trafficserver
Image(s):       docker.io/torchbox/trafficserver-ingress-controller:latest
Selector:       app=trafficserver
Node-Selector:  type=head
Labels:         app=trafficserver
Desired Number of Nodes Scheduled: 3
Current Number of Nodes Scheduled: 3
Number of Nodes Misscheduled: 0
Pods Status:    3 Running / 0 Waiting / 0 Succeeded / 0 Failed

This was working fine for months; it only broke recently, after we upgraded from Kubernetes 1.3 to 1.5. I can't say for sure those things are related though, because I don't know exactly what the DS stopped working correctly.

@grodrigues3
Copy link
Contributor

I'll leave the milestone on this one since it looks like it's still getting attention. @mikedanese clear the milestone if it is not release blocking

@liggitt liggitt added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed team/control-plane (deprecated - do not use) labels Mar 13, 2017
@mikedanese
Copy link
Member

I actually expect that this is very much improved in 1.6 so I'll drop the milestone and wait for feedback once people start testing 1.6.

cc @janetkuo

@mikedanese mikedanese removed this from the v1.6 milestone Mar 14, 2017
@mikedanese
Copy link
Member

Let's just open new issues when people give feedback on 1.6.

@ankon
Copy link
Contributor

ankon commented Apr 24, 2017

For people ending here with a 1.5 cluster and dreading to replace nodes: it might help to just recreate the daemonset itself by using something like

kubectl get -o yaml ds NAME > ds.yml
kubectl delete --cascade=false ds NAME
kubectl apply -f ds.yml

This worked for me to bring back a missing kube2iam pod on a node. Unfortunately I don't have the logs any more to see why it got lost in the first place.

@puja108
Copy link
Member

puja108 commented Jun 8, 2017

Just ran into the same issue with K8s 1.6.4. I had a node being OOD and repaired it manually, when it came back healthy the ds was not scheduled there and the ds controller did not even try that.

Fixed it using @ankon's comment above (#23013 (comment))

This is issue is really bad when the ds in question is for example Calico, which is needed for the pod networking.

@mfornasa
Copy link

mfornasa commented Jul 3, 2017

I see this happening on 1.6 clusters. How should I debug?

@puja108
Copy link
Member

puja108 commented Jul 3, 2017

@mfornasa like I said #23013 (comment) helps on 1.6

Otherwise, now that we have DS updates with 1.7 an update might trigger the logic, too. Haven't tested that, yet, though.

@amine250
Copy link

amine250 commented Apr 30, 2018

Seeing this with Openshift 3.7 / K8s 1.7

EDIT: root cause for me was related to a taint on some nodes.

@ahakanbaba
Copy link
Contributor

ahakanbaba commented May 3, 2018

We ran into this problem with v1.6.13

The instructions in this comment make the daemonset pods start at all nodes. But even after a delete and recreate, I think the DaemonSet is still left in a wrong state

$ kubectl get ds --namespace calico-prod
NAME            DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE-SELECTOR             AGE
calico          0         0         0         0            0           box.com/calico-pod=true   4m

Even though all the pods have started daemonset thinks there are 0 desired replicas.

@ross-w
Copy link

ross-w commented May 10, 2019

Just hit this problem with v1.14.1. Deployed some identical servers (apart from hostname/IP obviously) from the same configuration management but one was not getting DaemonSets scheduled on it.

Comment #23013 (comment) resolved the issue for us. Still strange it happened and why it only happened to one of them.

@ottoyiu
Copy link

ottoyiu commented May 27, 2019

@ross-w same problem with v1.13.5, experienced it over a dozen of times; had to either delete node from apiserver and restart kubelet or kill the kube-controller-manager that's the leader.

openshift-publish-robot pushed a commit to openshift/kubernetes that referenced this issue Jun 4, 2019
UPSTREAM: 78595: Add dangling volumes as uncertain

Origin-commit: 4ed130610c4f2ef849b54f6c53f5689fd0175ad1
@jasonraimondi
Copy link

We had the same problem on v1.15.10 with Daemon Set only using 3/9 available nodes. After attempting all sorts of solutions. The fix was to roll the nodes and let the Daemon Set find the new nodes.

@rpagliuca
Copy link

I had this same problem and managed to solve it.

The root cause for me was that one of my nodes had a taint, due to problematic mounting of an AWS EBS volume (related to this other issue: #36450 (comment)).

$ kubectl get nodes -o json
...
"taints": [                                                                  
    {                                                                        
        "effect": "NoSchedule",                                              
        "key": "NodeWithImpairedVolumes",                                    
        "value": "true"                                                      
    }                                                                        
] 
...

The simplest solution was to "force detach" the problematic volume on the AWS console, and afterwards remove the taint from the node:
kubectl taint nodes <my_node_name> NodeWithImpairedVolumes-
(watch out for the trailing dash symbol in the command above, which means to remove a taint).

Another solution if everything else fails is to drain the node and remove it from the cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects
None yet
Development

No branches or pull requests