Crashlooping containers sometimes not GC'd #21085

bprashanth · 2016-02-11T18:22:25Z

In the process of debugging #20096 I disabled crash loop backoff and ran a 100 pods that were crashing on my node. It looks like I ran out of space. This may just be because the Kubelet was too busy restarting containers to run background gc.

My kubelet logs are full of:

I0211 17:17:04.229739   29218 kubelet.go:2300] skipping pod synchronization - [Failed to start cAdvisor inotify_add_watch /sys/fs/cgroup/cpuacct/1f444e28ac3bc15e328945c409fc9c4eaab6cc3fac41c107aabd2fc0f2729a3e: no space left on device]
I0211 17:17:05.073544   29218 docker.go:353] Docker Container: /sleepy_noyce is not managed by kubelet.
I0211 17:17:05.073586   29218 docker.go:353] Docker Container: /jolly_cori1 is not managed by kubelet.
I0211 17:17:05.073596   29218 docker.go:353] Docker Container: /tender_mcnulty is not managed by kubelet.
I0211 17:17:05.073603   29218 docker.go:353] Docker Container: /grave_turing is not managed by kubelet.
I0211 17:17:05.073609   29218 docker.go:353] Docker Container: /grave_davinci is not managed by kubelet.

I'm puzzled by the pet sounding names of the containers. I'm not running any other docker stress test on the node. Docker inspect shows that it is a kubernetes pod:

$ docker inspect grave_davinci
        "Labels": {
            "io.kubernetes.container.hash": "6059dfa2",
            "io.kubernetes.container.name": "POD",
            "io.kubernetes.container.restartCount": "0",
            "io.kubernetes.container.terminationMessagePath": "",
            "io.kubernetes.pod.name": "pinger-5nofn",
            "io.kubernetes.pod.namespace": "default",
            "io.kubernetes.pod.terminationGracePeriod": "0",
            "io.kubernetes.pod.uid": "051e4464-d05c-11e5-acac-42010af00002"
        }

The only real modification to the kubelet was https://github.com/kubernetes/kubernetes/pull/20991/files#diff-be35452540b93977b6dedb1d1d04670dR85

The text was updated successfully, but these errors were encountered:

bgrant0607 · 2016-02-11T18:26:17Z

cc @kubernetes/goog-node

dchen1107 · 2016-02-11T18:30:18Z

I believe this is a regression we introduced lately. @Random-Liu can you take a look?

Random-Liu · 2016-02-11T19:36:48Z

@dchen1107 Sure. I'll look into it.

yujuhong · 2016-02-11T19:40:55Z

@bprashanth, what's the version/commit of k8s and docker in your cluster?

Random-Liu · 2016-02-11T19:50:40Z

It looks like docker or ourselves messed up the container name, so the started containers are out of our control, because we check whether a container is managed by kubelet by name here.

bprashanth · 2016-02-11T19:57:02Z

@bprashanth, what's the version/commit of k8s and docker in your cluster?

184779b is an approximation from git history. I'm guessing docker 1.9, given the timeframe.

bprashanth · 2016-02-11T21:26:00Z

@Random-Liu this might have something to do with load on the node and the fact that both docker and kubelet restarted multiple times. If you try to reproduce it, set the hairpin flag or you'll run into the kernel bug https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/config-test.sh#L144.

Random-Liu · 2016-02-11T21:42:33Z

@bprashanth Thanks a lot! I'll try to reproduce this.

bprashanth · 2016-02-11T21:44:43Z

Suggest just leaving the cluster in the backgroud and ignoring it. If it doesn't repro, close this out as something random that requires a lot of hammering.

Random-Liu · 2016-02-11T22:55:12Z

I've started a cluster to run the test. Hope I could hit that bug. :)

Random-Liu · 2016-02-11T23:17:56Z

OK. I hit the error even with hairpin mode turned off. "unregister_netdevice: waiting for veth09d8b93 to become free. Usage count = 1"

Random-Liu · 2016-02-11T23:35:48Z

It turns out that the haripin mode was not turned off. I changed the file https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/config-test.sh. But because I'm using a regular cluster (not e2e cluster), I should change the file https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/config-default.sh. :)

bprashanth · 2016-02-11T23:38:52Z

Lets all try to reproduce it!

Also fyi there are environment differences between the kube-up cluster and an e2e cluster (master kubelet runs in standalone mode, e2e clusters have a bunch of firewall rules etc). Starting an e2e cluster is as easy as:

go run hack/e2e.go -pushup -check_version_skew=false -v

Random-Liu · 2016-02-11T23:54:07Z

@bprashanth Thanks for your help!

bprashanth · 2016-02-12T02:10:35Z

Fwiw 2/2 nodes I've tried to repro this on have run into #20995. This makes me thinkg that I might've been on docker 1.8 the first time around.

Random-Liu · 2016-02-12T18:30:49Z

After running over night, I didn't see the issue @bprashanth met yesterday. However, I found another issue, when I come here and check my machine this morning, docker daemon on one of my nodes doesn't respond.

lantaol@e2e-test-lantaol-minion-13mj:~$ sudo docker ps
Cannot connect to the Docker daemon. Is the docker daemon running on this host?

But in fact, it was running:

lantaol@e2e-test-lantaol-minion-13mj:~$ ps -ef | grep docker
root      4060     1  6 04:07 ?        00:52:58 /usr/bin/docker daemon -p /var/run/docker.pid --bridge=cbr0
 --iptables=false --ip-masq=false --log-level=warn
root     17767  3761  0 18:41 ?        00:00:00 /bin/bash /usr/sbin/docker-checker.sh
lantaol  17811 13453  0 18:41 pts/1    00:00:00 grep docker

Kubelet is also running:

lantaol@e2e-test-lantaol-minion-13mj:~$ sudo ps aux | grep kubelet
root     13992  0.0  0.0  17668  2640 ?        S    18:24   0:00 /bin/bash /usr/sbin/kubelet-checker.sh
root     14007  0.6  0.5 197616 42560 ?        Sl   18:24   0:01 /usr/local/bin/kubelet --api-servers=https
://e2e-test-lantaol-master --enable-debugging-handlers=true --cloud-provider=gce --config=/etc/kubernetes/m
anifests --allow-privileged=True --v=4 --cluster-dns=10.0.0.10 --cluster-domain=cluster.local --configure-c
br0=true --cgroup-root=/ --system-container=/system --configure-hairpin-mode=false --max-pods=110 --seriali
ze-image-pulls=false --outofdisk-transition-frequency=0
lantaol  14943  0.0  0.0   7844  1916 pts/1    S+   18:29   0:00 grep kubelet

In /var/log/docker, I saw tons of warning:

lantaol@lantaol0:~$ tail -15 docker.log 
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.

Supervisord keeps trying to restart docker, because docker daemon doesn't respond:

lantaol@e2e-test-lantaol-minion-13mj:/var/log/supervisor$ tail docker-stdout.log 
Starting Docker: docker failed!
waiting a minute for startup
Docker failed!
Stopping Docker: docker.
Starting Docker: docker failed!
waiting a minute for startup
Docker failed!
Stopping Docker: docker.
Starting Docker: docker failed!
waiting a minute for startup

I tried to restart docker daemon, and got:

lantaol@e2e-test-lantaol-minion-13mj:~$ sudo service docker restart
[ ok ] Stopping Docker: docker.
[FAIL] Starting Docker: docker failed!

I killed it forcely, and got:

lantaol@e2e-test-lantaol-minion-13mj:~$ sudo kill -9 4060
lantaol@e2e-test-lantaol-minion-13mj:~$ sudo ps aux | grep docker
root      4060  6.0  0.0      0     0 ?        Zl   04:07  52:58 [docker] <defunct>
root     18365  0.0  0.0  17664  2816 ?        S    18:43   0:00 /bin/bash /usr/sbin/docker-checker.sh
lantaol  18443  0.0  0.0   7848  2032 pts/1    S+   18:43   0:00 grep docker

After a while, another docker daemon was restarted:

lantaol@e2e-test-lantaol-minion-13mj:~$ ps aux | grep docker
root      4060  5.9  0.0      0     0 ?        Zl   04:07  52:58 [docker] <defunct>
root     18636  0.1  0.3  93368 27276 ?        Dl   18:44   0:01 /usr/bin/docker daemon -p /var/run/docker.
pid --bridge=cbr0 --iptables=false --ip-masq=false --log-level=warn
root     20941  0.0  0.0  17664  2720 ?        S    18:56   0:00 /bin/bash /usr/sbin/docker-checker.sh
lantaol  21062  0.0  0.0   7848  1976 pts/1    S+   18:57   0:00 grep docker

But, it doesn't even reply anything now, it totally hangs there:

lantaol@e2e-test-lantaol-minion-13mj:~$ sudo docker ps

lantaol@e2e-test-lantaol-minion-13mj:~$ sudo docker version

@dchen1107 @yujuhong

vishh · 2016-02-12T18:35:40Z

Were you restarting docker manually? I have seen this whenever docker was
put into a restart loop.

On Fri, Feb 12, 2016 at 10:31 AM, Lantao Liu notifications@github.com
wrote:

After running over night, I didn't see the issue @bprashanth
https://github.com/bprashanth met yesterday. However, I found another
issue, when I come here and check my machine this morning, docker daemon on
one of my node was down.

lantaol@e2e-test-lantaol-minion-13mj:~$ sudo docker ps
Cannot connect to the Docker daemon. Is the docker daemon running on this host?

While kubelet is still running, but it didn't restart the docker daemon.

lantaol@e2e-test-lantaol-minion-13mj:~$ sudo ps aux | grep kubelet
root 13992 0.0 0.0 17668 2640 ? S 18:24 0:00 /bin/bash /usr/sbin/kubelet-checker.sh
root 14007 0.6 0.5 197616 42560 ? Sl 18:24 0:01 /usr/local/bin/kubelet --api-servers=https
://e2e-test-lantaol-master --enable-debugging-handlers=true --cloud-provider=gce --config=/etc/kubernetes/m
anifests --allow-privileged=True --v=4 --cluster-dns=10.0.0.10 --cluster-domain=cluster.local --configure-c
br0=true --cgroup-root=/ --system-container=/system --configure-hairpin-mode=false --max-pods=110 --seriali
ze-image-pulls=false --outofdisk-transition-frequency=0
lantaol 14943 0.0 0.0 7844 1916 pts/1 S+ 18:29 0:00 grep kubelet

In /var/log/docker, I saw tons of warning:

lantaol@lantaol0:~$ tail -15 docker.log
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.

@dchen1107 https://github.com/dchen1107 @yujuhong
https://github.com/yujuhong

—
Reply to this email directly or view it on GitHub
#21085 (comment)
.

bprashanth · 2016-02-12T18:38:37Z

After running over night, I didn't see the issue @bprashanth met yesterday. However, I found another issue, when I come here and check my machine this morning, docker daemon on one of my nodes was down.

That's also what I observed in #21085 (comment), and you won't hit the node out of disk issue if docker goes down when the disk is 70%.

Random-Liu · 2016-02-12T18:50:57Z

@vishh No, I was just running @bprashanth's script.

yujuhong · 2016-02-12T18:54:10Z

That's also what I observed in #21085 (comment), and you won't hit the node out of disk issue if docker goes down when the disk is 70%.

@Random-Liu's case seems different. The docker daemon process was still running, but was unresponsive. There were no errors regarding starting the docker daemon in the log, other than /usr/bin/docker already running because supervisord repeatedly tried to kill and start docker.

Random-Liu · 2016-02-12T19:09:39Z

I removed all things under /var/lib/docker/network/files/ and it still doesn't work:

lantaol@e2e-test-lantaol-minion-13mj:~$ sudo rm -rf /var/lib/docker/network/files/
lantaol@e2e-test-lantaol-minion-13mj:~$ sudo docker version
^Clantaol@e2e-test-lantaol-minion-13mj:~$ sudo service docker restart
[ ok ] Stopping Docker: docker.
[FAIL] Starting Docker: docker failed!

And I can't even kill it this time:

lantaol@e2e-test-lantaol-minion-13mj:~$ ps aux | grep docker
root      4060  5.8  0.0      0     0 ?        Zl   04:07  52:58 [docker] <defunct>
root     18636  0.1  0.3  93368 27284 ?        D    18:44   0:02 /usr/bin/docker daemon -p /var/run/docker.
pid --bridge=cbr0 --iptables=false --ip-masq=false --log-level=warn
root     23058  0.0  0.0  17664  2812 ?        S    19:07   0:00 /bin/bash /usr/sbin/docker-checker.sh
lantaol  23282  0.0  0.0   7848  1996 pts/1    S+   19:07   0:00 grep docker
lantaol@e2e-test-lantaol-minion-13mj:~$ sudo kill -9 18636
lantaol@e2e-test-lantaol-minion-13mj:~$ ps aux | grep docker
root      4060  5.8  0.0      0     0 ?        Zl   04:07  52:58 [docker] <defunct>
root     18636  0.1  0.3  93368 27284 ?        D    18:44   0:02 /usr/bin/docker daemon -p /var/run/docker.
pid --bridge=cbr0 --iptables=false --ip-masq=false --log-level=warn
root     23058  0.0  0.0  17664  2812 ?        S    19:07   0:00 /bin/bash /usr/sbin/docker-checker.sh
lantaol  23311  0.0  0.0   7848  1996 pts/1    S+   19:08   0:00 grep docker

bprashanth · 2016-02-12T22:35:14Z

You docker daemon is in uninterruptible sleep. does dmesg/docker logs (before it got this way) have anything suspicous?

bprashanth · 2016-02-12T22:54:01Z

For the record, I walked up to Lantao's computer and the docker daemon is hung. The callstack does have some suspicous calls to functions like copy_net_ns which we saw in the #20096, but:

No associated kworker
No unregister_netdevice log lines in dmesg
Docker was hosed on various errors for a while before entering the uninterruptible state

So it doesn't look like we're hitting the kernel bug, and this appears to only be reproducible under heavy stress of restarting crashlooping containers for a while.

dchen1107 · 2016-02-13T00:47:26Z

This is not the first time I saw this issue, and it is relatively rare case: moby/moby#9605

The only reason I marked this p0 bug for triage is from @bprashanth's initial description: the containers' name don't follow with our k8s's name pattern, and kubelet also declare those are unknown containers. But on another hand, docker inspect clearly shows them are k8s containers.

@Random-Liu If we couldn't reproduce the initial issue reported here, let me just close this one. Thanks!

Random-Liu · 2016-02-13T02:11:46Z

@dchen1107 I've ran the test almost a whole day, and still didn't reproduce the issue. I think we could close this one. :) Thanks

Random-Liu · 2016-02-13T02:12:29Z

I'll close this one. Feel free to reopen this if similar error occurs again.

yujuhong · 2016-02-17T19:30:15Z

I just encountered this issue. It might have something to do with docker being killed repeatedly during the window of time.

The error messages afterwards:

W0217 18:52:52.080230    3890 container_gc.go:111] Failed to remove dead container "/k8s_POD.cf58006d_density300-eefd6c59-d5a6-11e5-b867-c4346b658736-7p9up_e2e-tests-density-skkqq_f486626a-d5a6-11e5-8eda-42010af00002_420a4f55": API error (500): Driver aufs failed to remove root filesystem 479b1ec89f096230b9e0436d3c43300641957e3905bbb1074b0aa9fd1f0a6975: rename /var/lib/docker/aufs/mnt/479b1ec89f096230b9e0436d3c43300641957e3905bbb1074b0aa9fd1f0a6975 /var/lib/docker/aufs/mnt/479b1ec89f096230b9e0436d3c43300641957e3905bbb1074b0aa9fd1f0a6975-removing: device or resource busy

Reopening this issue so that we can keep track of the problem and update. Lowered the priority until we can reliably reproduce.

yujuhong · 2016-02-17T19:38:15Z

FWIW, I hit the "no available IPv4 addresses" problem again after this.

I0217 19:35:57.712273    3890 server.go:587] Event(api.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-density-h5i54", Name:"density300-41affca7-d5ac-11e5-bd83-c4346b658736-q9p5w", UID:"49527dd1-d5ac-11e5-8eda-42010af00002", APIVersion:"v1", ResourceVersion:"25640", FieldPath:""}): type: 'Warning' reason: 'FailedSync' Error syncing pod, skipping: failed to "StartContainer" for "POD" with RunContainerError: "runContainer: API error (500): Cannot start container ae1131b610fc7c27cfc76f9f0e07eb46530b2bdb126cd808de67fc5462b6ab6d: no available IPv4 addresses on this network's address pools: bridge (7381fd28301fbd41155b2adb26eaf54c63c919e249b42bc5506f3c60dafd9f4b)\n"

yujuhong · 2016-02-19T22:48:13Z

Again, hitting the same problem after the "no available IPv4 addresses" problem (#21523).

After running into the IPv4 problem, I ran docker ps -a and saw ~10 infra containers in the "dead" status with no name.
According to http://stackoverflow.com/questions/30550472/docker-container-with-status-dead-after-consul-healthcheck-runs, a container is set to dead after a failed "stop".
docker rm the container resulting in an error:

Error response from daemon: Driver aufs failed to remove root filesystem b7b5b8f0f12e4d43c17df8142f65886bcf5f578d7f322d62abb0b07357c49224: rename /var/lib/docker/aufs/mnt/b7b5b8f0f12e4d43c17df8142f65886bcf5f578d7f322d62abb0b07357c49224 /var/lib/docker/aufs/mnt/b7b5b8f0f12e4d43c17df8142f65886bcf5f578d7f322d62abb0b07357c49224-removing: device or resource busy
Error: failed to remove containers: [b7b5b8f0f12e]

I deleted /var/lib/docker/network and restarted docker. This time, the "dead" containers were randomly renamed by docker.

Random-Liu · 2017-03-09T23:33:18Z

We've already handled this issue in the new dockershim #35930

And this issue is more than 1 year old. Close this one.

@calebamiles

bprashanth added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Feb 11, 2016

dchen1107 added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. kind/bug Categorizes issue or PR as related to a bug. labels Feb 11, 2016

dchen1107 assigned Random-Liu Feb 11, 2016

Random-Liu closed this as completed Feb 13, 2016

yujuhong removed the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Feb 17, 2016

yujuhong reopened this Feb 17, 2016

yujuhong added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Feb 17, 2016

bprashanth mentioned this issue Feb 24, 2016

e2e flake: Kubelet failed in scalability test #21866

Closed

yujuhong mentioned this issue Oct 31, 2016

CRI: Handle empty container name in dockershim. #35930

Merged

calebamiles modified the milestone: v1.6 Mar 8, 2017

Random-Liu closed this as completed Mar 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crashlooping containers sometimes not GC'd #21085

Crashlooping containers sometimes not GC'd #21085

bprashanth commented Feb 11, 2016

bgrant0607 commented Feb 11, 2016

dchen1107 commented Feb 11, 2016

Random-Liu commented Feb 11, 2016

yujuhong commented Feb 11, 2016

Random-Liu commented Feb 11, 2016

bprashanth commented Feb 11, 2016

bprashanth commented Feb 11, 2016

Random-Liu commented Feb 11, 2016

bprashanth commented Feb 11, 2016

Random-Liu commented Feb 11, 2016

Random-Liu commented Feb 11, 2016

Random-Liu commented Feb 11, 2016

bprashanth commented Feb 11, 2016

Random-Liu commented Feb 11, 2016

bprashanth commented Feb 12, 2016

Random-Liu commented Feb 12, 2016

vishh commented Feb 12, 2016

bprashanth commented Feb 12, 2016

Random-Liu commented Feb 12, 2016

yujuhong commented Feb 12, 2016

Random-Liu commented Feb 12, 2016

bprashanth commented Feb 12, 2016

bprashanth commented Feb 12, 2016

dchen1107 commented Feb 13, 2016

Random-Liu commented Feb 13, 2016

Random-Liu commented Feb 13, 2016

yujuhong commented Feb 17, 2016

yujuhong commented Feb 17, 2016

yujuhong commented Feb 19, 2016

Random-Liu commented Mar 9, 2017

Crashlooping containers sometimes not GC'd #21085

Crashlooping containers sometimes not GC'd #21085

Comments

bprashanth commented Feb 11, 2016

bgrant0607 commented Feb 11, 2016

dchen1107 commented Feb 11, 2016

Random-Liu commented Feb 11, 2016

yujuhong commented Feb 11, 2016

Random-Liu commented Feb 11, 2016

bprashanth commented Feb 11, 2016

bprashanth commented Feb 11, 2016

Random-Liu commented Feb 11, 2016

bprashanth commented Feb 11, 2016

Random-Liu commented Feb 11, 2016

Random-Liu commented Feb 11, 2016

Random-Liu commented Feb 11, 2016

bprashanth commented Feb 11, 2016

Random-Liu commented Feb 11, 2016

bprashanth commented Feb 12, 2016

Random-Liu commented Feb 12, 2016

vishh commented Feb 12, 2016

bprashanth commented Feb 12, 2016

Random-Liu commented Feb 12, 2016

yujuhong commented Feb 12, 2016

Random-Liu commented Feb 12, 2016

bprashanth commented Feb 12, 2016

bprashanth commented Feb 12, 2016

dchen1107 commented Feb 13, 2016

Random-Liu commented Feb 13, 2016

Random-Liu commented Feb 13, 2016

yujuhong commented Feb 17, 2016

yujuhong commented Feb 17, 2016

yujuhong commented Feb 19, 2016

Random-Liu commented Mar 9, 2017