-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Many orphaned volumes in hyperkube #22911
Comments
To reproduce this behavior I cleaned up everything related to kubernetes (containers, /var/lib/kubernetes contents, etc), started hyperkube, then started a bunch of containers - everything is good up to this point. Then I delete a pod with (kubectl delete pod frontend), and this happens:
|
More observations. Here is my default token:
Here is error message I get:
This path is changing all the time:
(Note the subdirectory names are different) Pod deleteion code was recently touched by @caesarxuchao, authentication tokens (whose directories can't be removed) by @liggitt, @jlowdermilk. Any ideas what it could be? I would ignore this problem if it didn't consume 15% of CPU in a very tight orphan removal loop. |
Those tokens are just secrets as far as the kubelet is concerned. Not sure how secret volume cleanup differs in hyperkube. cc @pmorie |
We have same problem on nodes with CoreOS alpha 991.0.0 and hyperkube 1.2 a lot of messages like:
|
same issue here. |
Same on Ubuntu 14.04, hyperkube v1.2.3, docker 1.9.1 with aufs backend, kernel 3.13.0-48-generic. I also see pod state transitions taking a long time; they hang in "Pending" and "Terminating" for minutes. Does anyone else see that in conjunction with this issue? |
For context, this issue doesn't show up immediately in my hyperkube setup. It turns up after a few days. During that time, I update 6 or 7 deployments simultaneously, several times per day. |
Sorry I missed the ping on this. Assigned to self, will try to triage by the end of the week. |
got the same problem on centos
|
I am wondering if this depends on Docker version, and not so much the graph driver. |
xref #26421 |
Same problem here, any ideas? |
@bgrant0607, could you please re-triage this bug. I'm not sure "team/ux" is the right tag for it - it doesn't have anything common with UX. It looks like a problem with some low-level container machinery. And given the number of +1's, maybe raise the priority? Thanks. |
@alml What problem does this cause for you? |
The error msg will be a confuse to log/alert monitor infrastructures and really hope it can be fixed. |
@bgrant0607 As I wrote before:
I have a 8-core laptop that can work ~ 4 hours on battery (enough for a mid-range flight). When I run Kubernetes cluster via hyperkube, it loads 15% of my CPU resources (more than 1 core!) in a tight loop draining my battery less than in hour. |
After setting up a hypekube and digging the logs, I think the following happened and caused the issue. Hyperkube automatically creates a few pods for dns, dashboard, proxy services etc. But there is something wrong when setting up and mounting the volumes. The following shows some logs from one such pod. (1) 21:20:03.214957 kubelet.go:2579] SyncLoop (ADD, "api"): "kubernetes-dashboard-fnivx_kube-system(efa7ceb6-29d0-11e6-bafc-5065f351b2ea)" (3) 21:20:07.170736 nsenter_mount.go:196] IsLikelyNotMountPoint findmnt output for path /var/lib/kubelet/pods/efa7ceb6-29d0-11e6-bafc-5065f351b2ea/volumes/kubernetes.io (4) 21:20:07.218321 nsenter_mount.go:117] Mount command: nsenter [--mount=/rootfs/proc/1/ns/mnt -- /bin/mount -t tmpfs tmpfs /var/lib/kubelet/pods/efa7ceb6-29d0-11e6-bafc-5065f351b2ea/volumes/kubernetes.io~secret/default-token-z7ul7] From this point, each time SyncPod check the volume, it mounts again the again which is proved by the findmnt output in the log. After a few hours running, there are over hundreds of mounts. I am working on finding the reason why the last letter of the returned mount point is missing. I manually tested the findmnt command inside of the container and it returned the correct result. There is also an issue for umount. The umount code will first move the mount dir to a temporary dir, but failed with "device or resource busy" error. I also tried to mount tmpfs in the container with command Will continue working on solving this issue. Please let me know if you any comments |
Excellent debugging @jingxu97. Looking forward to seeing what the root cause is for both issues. |
Some updates for this bug
I will try to modify the IsLikeLyNotMountPoint() to make it more robust. Please let me know if you have any comments or suggestions. Thanks! |
PR #21486 and PR #23435 both modified the behavior of Also, have you observed the missing character issue in a non-hyperkube environment, or is there something unique about running locally inside a docker container that causes this to happen? |
I see isLikelyNotMountPoint() function is implmented in two places: nscenter_mount.go and mount_linux.go. My understanding is nsenterMounter is only used when kubelet is running inside of container. nscenter_mount uses findmnt command result to check the mount point. In mount_linux.go, it uses a different approach: It checks whether the directory has a different device as parent, then it is a mountpoint I also wrote a small go program to test the "findmnt" command in my local environment, it also truncate the last letter if the mounting path is relatively long. So this "findmnt" behavior is not related to hyperkube environment I think. |
Thanks for the clarification |
Thanks for debugging this. |
Sure. THanks! Jing On Tue, Jun 7, 2016 at 5:01 AM, Lucas Käldström notifications@github.com
|
In nsenter_mount.go/isLikelyNotMountPoint function, the returned output from findmnt command misses the last letter. Modify the code to make sure that output has the full target path. fix kubernetes#26421 kubernetes#25056 kubernetes#22911
Please note that findmnt command does not exist on docker host vm for virtualbox/windows setup (i.e. 'boot2docker'). I'm pretty sure this will be the case for Mac too. The errors that I get are: nsenter_mount.go:185] Failed findmnt command for path /var/lib/kubelet/pods/9602feb4-31ba-11e6-bdbe-f62284e3ec86/volumes/kubernetes.io~secret/default-token-4hz3v: exit status 1 and these don't seem to get cleaned up |
This problem is still there. Right now I'm observing it on my GKE instances (managed VM, Kubernetes 1.3.5):
I noticed it after CPU load jumped from typical 15% to crazy 100% on both my nodes: |
@amlinux Thank you for providing this information. After checking the log offline, the problem should be fixed in v1.3.6. Please upgrade to v1.3.6 and let us know if you have any problem. Thanks! |
Kubelet continuously throws lots of errors:
I'm starting the cluster this way:
The text was updated successfully, but these errors were encountered: