-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crashlooping containers sometimes not GC'd #21085
Comments
cc @kubernetes/goog-node |
I believe this is a regression we introduced lately. @Random-Liu can you take a look? |
@dchen1107 Sure. I'll look into it. |
@bprashanth, what's the version/commit of k8s and docker in your cluster? |
It looks like docker or ourselves messed up the container name, so the started containers are out of our control, because we check whether a container is managed by kubelet by name here. |
184779b is an approximation from git history. I'm guessing docker 1.9, given the timeframe. |
@Random-Liu this might have something to do with load on the node and the fact that both docker and kubelet restarted multiple times. If you try to reproduce it, set the hairpin flag or you'll run into the kernel bug https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/config-test.sh#L144. |
@bprashanth Thanks a lot! I'll try to reproduce this. |
Suggest just leaving the cluster in the backgroud and ignoring it. If it doesn't repro, close this out as something random that requires a lot of hammering. |
I've started a cluster to run the test. Hope I could hit that bug. :) |
OK. I hit the error even with hairpin mode turned off. "unregister_netdevice: waiting for veth09d8b93 to become free. Usage count = 1" |
It turns out that the haripin mode was not turned off. I changed the file https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/config-test.sh. But because I'm using a regular cluster (not e2e cluster), I should change the file https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/config-default.sh. :) |
Lets all try to reproduce it! Also fyi there are environment differences between the kube-up cluster and an e2e cluster (master kubelet runs in standalone mode, e2e clusters have a bunch of firewall rules etc). Starting an e2e cluster is as easy as:
|
@bprashanth Thanks for your help! |
Fwiw 2/2 nodes I've tried to repro this on have run into #20995. This makes me thinkg that I might've been on docker 1.8 the first time around. |
After running over night, I didn't see the issue @bprashanth met yesterday. However, I found another issue, when I come here and check my machine this morning, docker daemon on one of my nodes doesn't respond.
But in fact, it was running:
Kubelet is also running:
In
Supervisord keeps trying to restart docker, because docker daemon doesn't respond:
I tried to restart docker daemon, and got:
I killed it forcely, and got:
After a while, another docker daemon was restarted:
But, it doesn't even reply anything now, it totally hangs there:
|
Were you restarting docker manually? I have seen this whenever docker was On Fri, Feb 12, 2016 at 10:31 AM, Lantao Liu notifications@github.com
|
That's also what I observed in #21085 (comment), and you won't hit the node out of disk issue if docker goes down when the disk is 70%. |
@vishh No, I was just running @bprashanth's script. |
@Random-Liu's case seems different. The docker daemon process was still running, but was unresponsive. There were no errors regarding starting the docker daemon in the log, other than |
I removed all things under
And I can't even kill it this time:
|
You docker daemon is in uninterruptible sleep. does dmesg/docker logs (before it got this way) have anything suspicous? |
For the record, I walked up to Lantao's computer and the docker daemon is hung. The callstack does have some suspicous calls to functions like copy_net_ns which we saw in the #20096, but:
So it doesn't look like we're hitting the kernel bug, and this appears to only be reproducible under heavy stress of restarting crashlooping containers for a while. |
This is not the first time I saw this issue, and it is relatively rare case: moby/moby#9605 The only reason I marked this p0 bug for triage is from @bprashanth's initial description: the containers' name don't follow with our k8s's name pattern, and kubelet also declare those are unknown containers. But on another hand, docker inspect clearly shows them are k8s containers. @Random-Liu If we couldn't reproduce the initial issue reported here, let me just close this one. Thanks! |
@dchen1107 I've ran the test almost a whole day, and still didn't reproduce the issue. I think we could close this one. :) Thanks |
I'll close this one. Feel free to reopen this if similar error occurs again. |
I just encountered this issue. It might have something to do with docker being killed repeatedly during the window of time. The error messages afterwards:
Reopening this issue so that we can keep track of the problem and update. Lowered the priority until we can reliably reproduce. |
FWIW, I hit the "no available IPv4 addresses" problem again after this.
|
Again, hitting the same problem after the "no available IPv4 addresses" problem (#21523). After running into the IPv4 problem, I ran
I deleted /var/lib/docker/network and restarted docker. This time, the "dead" containers were randomly renamed by docker. |
We've already handled this issue in the new dockershim #35930 And this issue is more than 1 year old. Close this one. |
In the process of debugging #20096 I disabled crash loop backoff and ran a 100 pods that were crashing on my node. It looks like I ran out of space. This may just be because the Kubelet was too busy restarting containers to run background gc.
My kubelet logs are full of:
I'm puzzled by the pet sounding names of the containers. I'm not running any other docker stress test on the node. Docker inspect shows that it is a kubernetes pod:
The only real modification to the kubelet was https://github.com/kubernetes/kubernetes/pull/20991/files#diff-be35452540b93977b6dedb1d1d04670dR85
The text was updated successfully, but these errors were encountered: