Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashlooping containers sometimes not GC'd #21085

Closed
bprashanth opened this issue Feb 11, 2016 · 30 comments
Closed

Crashlooping containers sometimes not GC'd #21085

bprashanth opened this issue Feb 11, 2016 · 30 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/node Categorizes an issue or PR as relevant to SIG Node.
Milestone

Comments

@bprashanth
Copy link
Contributor

In the process of debugging #20096 I disabled crash loop backoff and ran a 100 pods that were crashing on my node. It looks like I ran out of space. This may just be because the Kubelet was too busy restarting containers to run background gc.

My kubelet logs are full of:

I0211 17:17:04.229739   29218 kubelet.go:2300] skipping pod synchronization - [Failed to start cAdvisor inotify_add_watch /sys/fs/cgroup/cpuacct/1f444e28ac3bc15e328945c409fc9c4eaab6cc3fac41c107aabd2fc0f2729a3e: no space left on device]
I0211 17:17:05.073544   29218 docker.go:353] Docker Container: /sleepy_noyce is not managed by kubelet.
I0211 17:17:05.073586   29218 docker.go:353] Docker Container: /jolly_cori1 is not managed by kubelet.
I0211 17:17:05.073596   29218 docker.go:353] Docker Container: /tender_mcnulty is not managed by kubelet.
I0211 17:17:05.073603   29218 docker.go:353] Docker Container: /grave_turing is not managed by kubelet.
I0211 17:17:05.073609   29218 docker.go:353] Docker Container: /grave_davinci is not managed by kubelet.

I'm puzzled by the pet sounding names of the containers. I'm not running any other docker stress test on the node. Docker inspect shows that it is a kubernetes pod:

$ docker inspect grave_davinci
        "Labels": {
            "io.kubernetes.container.hash": "6059dfa2",
            "io.kubernetes.container.name": "POD",
            "io.kubernetes.container.restartCount": "0",
            "io.kubernetes.container.terminationMessagePath": "",
            "io.kubernetes.pod.name": "pinger-5nofn",
            "io.kubernetes.pod.namespace": "default",
            "io.kubernetes.pod.terminationGracePeriod": "0",
            "io.kubernetes.pod.uid": "051e4464-d05c-11e5-acac-42010af00002"
        }

The only real modification to the kubelet was https://github.com/kubernetes/kubernetes/pull/20991/files#diff-be35452540b93977b6dedb1d1d04670dR85

@bprashanth bprashanth added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Feb 11, 2016
@bgrant0607
Copy link
Member

cc @kubernetes/goog-node

@dchen1107 dchen1107 added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. kind/bug Categorizes issue or PR as related to a bug. labels Feb 11, 2016
@dchen1107
Copy link
Member

I believe this is a regression we introduced lately. @Random-Liu can you take a look?

@Random-Liu
Copy link
Member

@dchen1107 Sure. I'll look into it.

@yujuhong
Copy link
Contributor

@bprashanth, what's the version/commit of k8s and docker in your cluster?

@Random-Liu
Copy link
Member

It looks like docker or ourselves messed up the container name, so the started containers are out of our control, because we check whether a container is managed by kubelet by name here.

@bprashanth
Copy link
Contributor Author

@bprashanth, what's the version/commit of k8s and docker in your cluster?

184779b is an approximation from git history. I'm guessing docker 1.9, given the timeframe.

@bprashanth
Copy link
Contributor Author

@Random-Liu this might have something to do with load on the node and the fact that both docker and kubelet restarted multiple times. If you try to reproduce it, set the hairpin flag or you'll run into the kernel bug https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/config-test.sh#L144.

@Random-Liu
Copy link
Member

@bprashanth Thanks a lot! I'll try to reproduce this.

@bprashanth
Copy link
Contributor Author

Suggest just leaving the cluster in the backgroud and ignoring it. If it doesn't repro, close this out as something random that requires a lot of hammering.

@Random-Liu
Copy link
Member

I've started a cluster to run the test. Hope I could hit that bug. :)

@Random-Liu
Copy link
Member

OK. I hit the error even with hairpin mode turned off. "unregister_netdevice: waiting for veth09d8b93 to become free. Usage count = 1"

@Random-Liu
Copy link
Member

It turns out that the haripin mode was not turned off. I changed the file https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/config-test.sh. But because I'm using a regular cluster (not e2e cluster), I should change the file https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/config-default.sh. :)

@bprashanth
Copy link
Contributor Author

Lets all try to reproduce it!

Also fyi there are environment differences between the kube-up cluster and an e2e cluster (master kubelet runs in standalone mode, e2e clusters have a bunch of firewall rules etc). Starting an e2e cluster is as easy as:

go run hack/e2e.go -pushup -check_version_skew=false -v

@Random-Liu
Copy link
Member

@bprashanth Thanks for your help!

@bprashanth
Copy link
Contributor Author

Fwiw 2/2 nodes I've tried to repro this on have run into #20995. This makes me thinkg that I might've been on docker 1.8 the first time around.

@Random-Liu
Copy link
Member

After running over night, I didn't see the issue @bprashanth met yesterday. However, I found another issue, when I come here and check my machine this morning, docker daemon on one of my nodes doesn't respond.

lantaol@e2e-test-lantaol-minion-13mj:~$ sudo docker ps
Cannot connect to the Docker daemon. Is the docker daemon running on this host?

But in fact, it was running:

lantaol@e2e-test-lantaol-minion-13mj:~$ ps -ef | grep docker
root      4060     1  6 04:07 ?        00:52:58 /usr/bin/docker daemon -p /var/run/docker.pid --bridge=cbr0
 --iptables=false --ip-masq=false --log-level=warn
root     17767  3761  0 18:41 ?        00:00:00 /bin/bash /usr/sbin/docker-checker.sh
lantaol  17811 13453  0 18:41 pts/1    00:00:00 grep docker

Kubelet is also running:

lantaol@e2e-test-lantaol-minion-13mj:~$ sudo ps aux | grep kubelet
root     13992  0.0  0.0  17668  2640 ?        S    18:24   0:00 /bin/bash /usr/sbin/kubelet-checker.sh
root     14007  0.6  0.5 197616 42560 ?        Sl   18:24   0:01 /usr/local/bin/kubelet --api-servers=https
://e2e-test-lantaol-master --enable-debugging-handlers=true --cloud-provider=gce --config=/etc/kubernetes/m
anifests --allow-privileged=True --v=4 --cluster-dns=10.0.0.10 --cluster-domain=cluster.local --configure-c
br0=true --cgroup-root=/ --system-container=/system --configure-hairpin-mode=false --max-pods=110 --seriali
ze-image-pulls=false --outofdisk-transition-frequency=0
lantaol  14943  0.0  0.0   7844  1916 pts/1    S+   18:29   0:00 grep kubelet

In /var/log/docker, I saw tons of warning:

lantaol@lantaol0:~$ tail -15 docker.log 
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.

Supervisord keeps trying to restart docker, because docker daemon doesn't respond:

lantaol@e2e-test-lantaol-minion-13mj:/var/log/supervisor$ tail docker-stdout.log 
Starting Docker: docker failed!
waiting a minute for startup
Docker failed!
Stopping Docker: docker.
Starting Docker: docker failed!
waiting a minute for startup
Docker failed!
Stopping Docker: docker.
Starting Docker: docker failed!
waiting a minute for startup

I tried to restart docker daemon, and got:

lantaol@e2e-test-lantaol-minion-13mj:~$ sudo service docker restart
[ ok ] Stopping Docker: docker.
[FAIL] Starting Docker: docker failed!

I killed it forcely, and got:

lantaol@e2e-test-lantaol-minion-13mj:~$ sudo kill -9 4060
lantaol@e2e-test-lantaol-minion-13mj:~$ sudo ps aux | grep docker
root      4060  6.0  0.0      0     0 ?        Zl   04:07  52:58 [docker] <defunct>
root     18365  0.0  0.0  17664  2816 ?        S    18:43   0:00 /bin/bash /usr/sbin/docker-checker.sh
lantaol  18443  0.0  0.0   7848  2032 pts/1    S+   18:43   0:00 grep docker

After a while, another docker daemon was restarted:

lantaol@e2e-test-lantaol-minion-13mj:~$ ps aux | grep docker
root      4060  5.9  0.0      0     0 ?        Zl   04:07  52:58 [docker] <defunct>
root     18636  0.1  0.3  93368 27276 ?        Dl   18:44   0:01 /usr/bin/docker daemon -p /var/run/docker.
pid --bridge=cbr0 --iptables=false --ip-masq=false --log-level=warn
root     20941  0.0  0.0  17664  2720 ?        S    18:56   0:00 /bin/bash /usr/sbin/docker-checker.sh
lantaol  21062  0.0  0.0   7848  1976 pts/1    S+   18:57   0:00 grep docker

But, it doesn't even reply anything now, it totally hangs there:

lantaol@e2e-test-lantaol-minion-13mj:~$ sudo docker ps

lantaol@e2e-test-lantaol-minion-13mj:~$ sudo docker version

@dchen1107 @yujuhong

@vishh
Copy link
Contributor

vishh commented Feb 12, 2016

Were you restarting docker manually? I have seen this whenever docker was
put into a restart loop.

On Fri, Feb 12, 2016 at 10:31 AM, Lantao Liu notifications@github.com
wrote:

After running over night, I didn't see the issue @bprashanth
https://github.com/bprashanth met yesterday. However, I found another
issue, when I come here and check my machine this morning, docker daemon on
one of my node was down.

lantaol@e2e-test-lantaol-minion-13mj:~$ sudo docker ps
Cannot connect to the Docker daemon. Is the docker daemon running on this host?

While kubelet is still running, but it didn't restart the docker daemon.

lantaol@e2e-test-lantaol-minion-13mj:~$ sudo ps aux | grep kubelet
root 13992 0.0 0.0 17668 2640 ? S 18:24 0:00 /bin/bash /usr/sbin/kubelet-checker.sh
root 14007 0.6 0.5 197616 42560 ? Sl 18:24 0:01 /usr/local/bin/kubelet --api-servers=https
://e2e-test-lantaol-master --enable-debugging-handlers=true --cloud-provider=gce --config=/etc/kubernetes/m
anifests --allow-privileged=True --v=4 --cluster-dns=10.0.0.10 --cluster-domain=cluster.local --configure-c
br0=true --cgroup-root=/ --system-container=/system --configure-hairpin-mode=false --max-pods=110 --seriali
ze-image-pulls=false --outofdisk-transition-frequency=0
lantaol 14943 0.0 0.0 7844 1916 pts/1 S+ 18:29 0:00 grep kubelet

In /var/log/docker, I saw tons of warning:

lantaol@lantaol0:~$ tail -15 docker.log
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.
/usr/bin/docker already running.

@dchen1107 https://github.com/dchen1107 @yujuhong
https://github.com/yujuhong


Reply to this email directly or view it on GitHub
#21085 (comment)
.

@bprashanth
Copy link
Contributor Author

After running over night, I didn't see the issue @bprashanth met yesterday. However, I found another issue, when I come here and check my machine this morning, docker daemon on one of my nodes was down.

That's also what I observed in #21085 (comment), and you won't hit the node out of disk issue if docker goes down when the disk is 70%.

@Random-Liu
Copy link
Member

@vishh No, I was just running @bprashanth's script.

@yujuhong
Copy link
Contributor

That's also what I observed in #21085 (comment), and you won't hit the node out of disk issue if docker goes down when the disk is 70%.

@Random-Liu's case seems different. The docker daemon process was still running, but was unresponsive. There were no errors regarding starting the docker daemon in the log, other than /usr/bin/docker already running because supervisord repeatedly tried to kill and start docker.

@Random-Liu
Copy link
Member

I removed all things under /var/lib/docker/network/files/ and it still doesn't work:

lantaol@e2e-test-lantaol-minion-13mj:~$ sudo rm -rf /var/lib/docker/network/files/
lantaol@e2e-test-lantaol-minion-13mj:~$ sudo docker version
^Clantaol@e2e-test-lantaol-minion-13mj:~$ sudo service docker restart
[ ok ] Stopping Docker: docker.
[FAIL] Starting Docker: docker failed!

And I can't even kill it this time:

lantaol@e2e-test-lantaol-minion-13mj:~$ ps aux | grep docker
root      4060  5.8  0.0      0     0 ?        Zl   04:07  52:58 [docker] <defunct>
root     18636  0.1  0.3  93368 27284 ?        D    18:44   0:02 /usr/bin/docker daemon -p /var/run/docker.
pid --bridge=cbr0 --iptables=false --ip-masq=false --log-level=warn
root     23058  0.0  0.0  17664  2812 ?        S    19:07   0:00 /bin/bash /usr/sbin/docker-checker.sh
lantaol  23282  0.0  0.0   7848  1996 pts/1    S+   19:07   0:00 grep docker
lantaol@e2e-test-lantaol-minion-13mj:~$ sudo kill -9 18636
lantaol@e2e-test-lantaol-minion-13mj:~$ ps aux | grep docker
root      4060  5.8  0.0      0     0 ?        Zl   04:07  52:58 [docker] <defunct>
root     18636  0.1  0.3  93368 27284 ?        D    18:44   0:02 /usr/bin/docker daemon -p /var/run/docker.
pid --bridge=cbr0 --iptables=false --ip-masq=false --log-level=warn
root     23058  0.0  0.0  17664  2812 ?        S    19:07   0:00 /bin/bash /usr/sbin/docker-checker.sh
lantaol  23311  0.0  0.0   7848  1996 pts/1    S+   19:08   0:00 grep docker

@bprashanth
Copy link
Contributor Author

You docker daemon is in uninterruptible sleep. does dmesg/docker logs (before it got this way) have anything suspicous?

@bprashanth
Copy link
Contributor Author

For the record, I walked up to Lantao's computer and the docker daemon is hung. The callstack does have some suspicous calls to functions like copy_net_ns which we saw in the #20096, but:

  1. No associated kworker
  2. No unregister_netdevice log lines in dmesg
  3. Docker was hosed on various errors for a while before entering the uninterruptible state

So it doesn't look like we're hitting the kernel bug, and this appears to only be reproducible under heavy stress of restarting crashlooping containers for a while.

@dchen1107
Copy link
Member

This is not the first time I saw this issue, and it is relatively rare case: moby/moby#9605

The only reason I marked this p0 bug for triage is from @bprashanth's initial description: the containers' name don't follow with our k8s's name pattern, and kubelet also declare those are unknown containers. But on another hand, docker inspect clearly shows them are k8s containers.

@Random-Liu If we couldn't reproduce the initial issue reported here, let me just close this one. Thanks!

@Random-Liu
Copy link
Member

@dchen1107 I've ran the test almost a whole day, and still didn't reproduce the issue. I think we could close this one. :) Thanks

@Random-Liu
Copy link
Member

I'll close this one. Feel free to reopen this if similar error occurs again.

@yujuhong yujuhong removed the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Feb 17, 2016
@yujuhong yujuhong reopened this Feb 17, 2016
@yujuhong yujuhong added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Feb 17, 2016
@yujuhong
Copy link
Contributor

I just encountered this issue. It might have something to do with docker being killed repeatedly during the window of time.

The error messages afterwards:

W0217 18:52:52.080230    3890 container_gc.go:111] Failed to remove dead container "/k8s_POD.cf58006d_density300-eefd6c59-d5a6-11e5-b867-c4346b658736-7p9up_e2e-tests-density-skkqq_f486626a-d5a6-11e5-8eda-42010af00002_420a4f55": API error (500): Driver aufs failed to remove root filesystem 479b1ec89f096230b9e0436d3c43300641957e3905bbb1074b0aa9fd1f0a6975: rename /var/lib/docker/aufs/mnt/479b1ec89f096230b9e0436d3c43300641957e3905bbb1074b0aa9fd1f0a6975 /var/lib/docker/aufs/mnt/479b1ec89f096230b9e0436d3c43300641957e3905bbb1074b0aa9fd1f0a6975-removing: device or resource busy

Reopening this issue so that we can keep track of the problem and update. Lowered the priority until we can reliably reproduce.

@yujuhong
Copy link
Contributor

FWIW, I hit the "no available IPv4 addresses" problem again after this.

I0217 19:35:57.712273    3890 server.go:587] Event(api.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-density-h5i54", Name:"density300-41affca7-d5ac-11e5-bd83-c4346b658736-q9p5w", UID:"49527dd1-d5ac-11e5-8eda-42010af00002", APIVersion:"v1", ResourceVersion:"25640", FieldPath:""}): type: 'Warning' reason: 'FailedSync' Error syncing pod, skipping: failed to "StartContainer" for "POD" with RunContainerError: "runContainer: API error (500): Cannot start container ae1131b610fc7c27cfc76f9f0e07eb46530b2bdb126cd808de67fc5462b6ab6d: no available IPv4 addresses on this network's address pools: bridge (7381fd28301fbd41155b2adb26eaf54c63c919e249b42bc5506f3c60dafd9f4b)\n"

@yujuhong
Copy link
Contributor

Again, hitting the same problem after the "no available IPv4 addresses" problem (#21523).

After running into the IPv4 problem, I ran docker ps -a and saw ~10 infra containers in the "dead" status with no name.
According to http://stackoverflow.com/questions/30550472/docker-container-with-status-dead-after-consul-healthcheck-runs, a container is set to dead after a failed "stop".
docker rm the container resulting in an error:

Error response from daemon: Driver aufs failed to remove root filesystem b7b5b8f0f12e4d43c17df8142f65886bcf5f578d7f322d62abb0b07357c49224: rename /var/lib/docker/aufs/mnt/b7b5b8f0f12e4d43c17df8142f65886bcf5f578d7f322d62abb0b07357c49224 /var/lib/docker/aufs/mnt/b7b5b8f0f12e4d43c17df8142f65886bcf5f578d7f322d62abb0b07357c49224-removing: device or resource busy
Error: failed to remove containers: [b7b5b8f0f12e]

I deleted /var/lib/docker/network and restarted docker. This time, the "dead" containers were randomly renamed by docker.

@Random-Liu
Copy link
Member

We've already handled this issue in the new dockershim #35930

And this issue is more than 1 year old. Close this one.

@calebamiles

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

7 participants