Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many orphaned volumes in hyperkube #22911

Closed
amlinux opened this issue Mar 12, 2016 · 29 comments · Fixed by #27054
Closed

Many orphaned volumes in hyperkube #22911

amlinux opened this issue Mar 12, 2016 · 29 comments · Fixed by #27054
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@amlinux
Copy link

amlinux commented Mar 12, 2016

Kubelet continuously throws lots of errors:

W0312 22:53:09.750535   13347 kubelet.go:1935] Orphaned volume "50da8d22-e89b-11e5-952b-5c514f64b257/default-token-rse42" found, tearing down volume
E0312 22:53:09.783931   13347 kubelet.go:1946] Could not tear down volume "50da8d22-e89b-11e5-952b-5c514f64b257/default-token-rse42": rename /var/lib/kubelet/pods/50da8d22-e89b-11e5-952b-5c514f64b257/volumes/kubernetes.io~secret/default-token-rse42 /var/lib/kubelet/pods/50da8d22-e89b-11e5-952b-5c514f64b257/volumes/kubernetes.io~secret/wrapped_default-token-rse42.deleting~720717277: device or resource busy
W0312 22:53:11.731998   13347 kubelet.go:1935] Orphaned volume "50da8d22-e89b-11e5-952b-5c514f64b257/wrapped_default-token-rse42.deleting~720717277" found, tearing down volume
W0312 22:53:11.734984   13347 mount.go:98] could not determine device for path: "/var/lib/kubelet/pods/50da8d22-e89b-11e5-952b-5c514f64b257/volumes/kubernetes.io~secret/wrapped_default-token-rse42.deleting~720717277"
W0312 22:53:11.747966   13347 kubelet.go:1935] Orphaned volume "50da8d22-e89b-11e5-952b-5c514f64b257/default-token-rse42" found, tearing down volume
E0312 22:53:11.764105   13347 kubelet.go:1946] Could not tear down volume "50da8d22-e89b-11e5-952b-5c514f64b257/default-token-rse42": rename /var/lib/kubelet/pods/50da8d22-e89b-11e5-952b-5c514f64b257/volumes/kubernetes.io~secret/default-token-rse42 /var/lib/kubelet/pods/50da8d22-e89b-11e5-952b-5c514f64b257/volumes/kubernetes.io~secret/wrapped_default-token-rse42.deleting~060988951: device or resource busy
W0312 22:53:13.730480   13347 kubelet.go:1935] Orphaned volume "50da8d22-e89b-11e5-952b-5c514f64b257/wrapped_default-token-rse42.deleting~060988951" found, tearing down volume
W0312 22:53:13.733530   13347 mount.go:98] could not determine device for path: "/var/lib/kubelet/pods/50da8d22-e89b-11e5-952b-5c514f64b257/volumes/kubernetes.io~secret/wrapped_default-token-rse42.deleting~060988951"
W0312 22:53:13.762582   13347 kubelet.go:1935] Orphaned volume "50da8d22-e89b-11e5-952b-5c514f64b257/default-token-rse42" found, tearing down volume
E0312 22:53:13.787960   13347 kubelet.go:1946] Could not tear down volume "50da8d22-e89b-11e5-952b-5c514f64b257/default-token-rse42": rename /var/lib/kubelet/pods/50da8d22-e89b-11e5-952b-5c514f64b257/volumes/kubernetes.io~secret/default-token-rse42 /var/lib/kubelet/pods/50da8d22-e89b-11e5-952b-5c514f64b257/volumes/kubernetes.io~secret/wrapped_default-token-rse42.deleting~997836385: device or resource busy
W0312 22:53:15.732156   13347 kubelet.go:1935] Orphaned volume "50da8d22-e89b-11e5-952b-5c514f64b257/wrapped_default-token-rse42.deleting~997836385" found, tearing down volume
W0312 22:53:15.735224   13347 mount.go:98] could not determine device for path: "/var/lib/kubelet/pods/50da8d22-e89b-11e5-952b-5c514f64b257/volumes/kubernetes.io~secret/wrapped_default-token-rse42.deleting~997836385"
W0312 22:53:15.771894   13347 kubelet.go:1935] Orphaned volume "50da8d22-e89b-11e5-952b-5c514f64b257/default-token-rse42" found, tearing down volume
E0312 22:53:15.796077   13347 kubelet.go:1946] Could not tear down volume "50da8d22-e89b-11e5-952b-5c514f64b257/default-token-rse42": rename /var/lib/kubelet/pods/50da8d22-e89b-11e5-952b-5c514f64b257/volumes/kubernetes.io~secret/default-token-rse42 /var/lib/kubelet/pods/50da8d22-e89b-11e5-952b-5c514f64b257/volumes/kubernetes.io~secret/wrapped_default-token-rse42.deleting~884779835: device or resource busy

I'm starting the cluster this way:

docker run \
    --name=kube_kubelet \
    --volume=/:/rootfs:ro \
    --volume=/sys:/sys:ro \
    --volume=/var/lib/docker/:/var/lib/docker:rw \
    --volume=/var/lib/kubelet/:/var/lib/kubelet:rw \
    --volume=/var/run:/var/run:rw \
    --net=host \
    --pid=host \
    --privileged=true \
    -d \
    gcr.io/google_containers/hyperkube:v1.2.0-beta.0 \
    /hyperkube kubelet \
    --containerized \
    --hostname-override="127.0.0.1" \
    --address="0.0.0.0" \
    --api-servers=http://localhost:8080 \
    --cluster-dns=10.0.0.10 \
    --cluster-domain=cluster.local \
    --allow-privileged=true \
    --v=2 \
    --config=/etc/kubernetes/manifests || exit $?

docker run \
    --name=kube_kubeproxy \
    -d \
    --net=host \
    --privileged \
    gcr.io/google_containers/hyperkube:v1.2.0-beta.0 \
    /hyperkube proxy \
    --master=http://127.0.0.1:8080 \
    --v=2 || exit $?
@amlinux
Copy link
Author

amlinux commented Mar 19, 2016

To reproduce this behavior I cleaned up everything related to kubernetes (containers, /var/lib/kubernetes contents, etc), started hyperkube, then started a bunch of containers - everything is good up to this point.

Then I delete a pod with (kubectl delete pod frontend), and this happens:

I0319 15:21:25.936936    9224 kubelet.go:2391] SyncLoop (UPDATE, "api"): "frontend-1jmyn_default(dbe5be50-ede5-11e5-9681-5c514f64b257)"
I0319 15:21:25.937489    9224 manager.go:1368] Killing container "e0f0f65106cd3e4165066d010f77366c95f8e96361441c7d846e4f4455e34368 frontend default/frontend-1jmyn" with 30 second grace period
I0319 15:21:25.969848    9224 kubelet.go:2388] SyncLoop (ADD, "api"): "frontend-koi5j_default(3ef8ed1e-ede6-11e5-9681-5c514f64b257)"
I0319 15:21:25.989511    9224 nsenter_mount.go:179] Failed findmnt command: exit status 1
I0319 15:21:25.990450    9224 server.go:1096] GET /containerLogs/default/frontend-1jmyn/frontend?follow=true: (2m1.56788367s) 200 [[Go 1.1 package http] 192.168.192.25:57244]
I0319 15:21:26.005163    9224 nsenter_mount.go:179] Failed findmnt command: exit status 1
I0319 15:21:26.015950    9224 manager.go:1400] Container "e0f0f65106cd3e4165066d010f77366c95f8e96361441c7d846e4f4455e34368 frontend default/frontend-1jmyn" exited after 78.410518ms
I0319 15:21:26.018012    9224 manager.go:1368] Killing container "44d793d313e3676f293b51dd940edb21a91b565bc00f1c1ef5f179ca1e7c8415 default/frontend-1jmyn" with 30 second grace period
I0319 15:21:26.154601    9224 manager.go:1688] Need to restart pod infra container for "frontend-koi5j_default(3ef8ed1e-ede6-11e5-9681-5c514f64b257)" because it is not found
I0319 15:21:26.430028    9224 manager.go:1400] Container "44d793d313e3676f293b51dd940edb21a91b565bc00f1c1ef5f179ca1e7c8415 default/frontend-1jmyn" exited after 411.995092ms
I0319 15:21:27.453218    9224 kubelet.go:2410] SyncLoop (PLEG): "frontend-koi5j_default(3ef8ed1e-ede6-11e5-9681-5c514f64b257)", event: &pleg.PodLifecycleEvent{ID:"3ef8ed1e-ede6-11e5-9681-5c514f64b257", Type:"ContainerStarted", Data:"35b226753961be3decf567780735d46a11298fe9ef980024fb560435d3811058"}
I0319 15:21:27.730298    9224 kubelet.go:2410] SyncLoop (PLEG): "frontend-1jmyn_default(dbe5be50-ede5-11e5-9681-5c514f64b257)", event: &pleg.PodLifecycleEvent{ID:"dbe5be50-ede5-11e5-9681-5c514f64b257", Type:"ContainerDied", Data:"e0f0f65106cd3e4165066d010f77366c95f8e96361441c7d846e4f4455e34368"}
I0319 15:21:27.733627    9224 kubelet.go:2410] SyncLoop (PLEG): "frontend-1jmyn_default(dbe5be50-ede5-11e5-9681-5c514f64b257)", event: &pleg.PodLifecycleEvent{ID:"dbe5be50-ede5-11e5-9681-5c514f64b257", Type:"ContainerDied", Data:"44d793d313e3676f293b51dd940edb21a91b565bc00f1c1ef5f179ca1e7c8415"}
I0319 15:21:27.767601    9224 kubelet.go:2391] SyncLoop (UPDATE, "api"): "frontend-1jmyn_default(dbe5be50-ede5-11e5-9681-5c514f64b257)"
I0319 15:21:27.822477    9224 kubelet.go:2394] SyncLoop (REMOVE, "api"): "frontend-1jmyn_default(dbe5be50-ede5-11e5-9681-5c514f64b257)"
I0319 15:21:27.822521    9224 kubelet.go:2538] Failed to delete pod "frontend-1jmyn_default(dbe5be50-ede5-11e5-9681-5c514f64b257)", err: pod not found
W0319 15:21:28.646963    9224 kubelet.go:1979] Orphaned volume "dbe5be50-ede5-11e5-9681-5c514f64b257/default-token-c6gc1" found, tearing down volume
W0319 15:21:28.649524    9224 mount.go:98] could not determine device for path: "/var/lib/kubelet/pods/dbe5be50-ede5-11e5-9681-5c514f64b257/volumes/kubernetes.io~secret/default-token-c6gc1"
E0319 15:21:28.665859    9224 kubelet.go:1990] Could not tear down volume "dbe5be50-ede5-11e5-9681-5c514f64b257/default-token-c6gc1": rename /var/lib/kubelet/pods/dbe5be50-ede5-11e5-9681-5c514f64b257/volumes/kubernetes.io~secret/default-token-c6gc1 /var/lib/kubelet/pods/dbe5be50-ede5-11e5-9681-5c514f64b257/volumes/kubernetes.io~secret/wrapped_default-token-c6gc1.deleting~655762369: device or resource busy
I0319 15:21:28.736742    9224 kubelet.go:2410] SyncLoop (PLEG): "frontend-koi5j_default(3ef8ed1e-ede6-11e5-9681-5c514f64b257)", event: &pleg.PodLifecycleEvent{ID:"3ef8ed1e-ede6-11e5-9681-5c514f64b257", Type:"ContainerStarted", Data:"aca06e4796eb721791f4d0cdc63c362746e9e301c8b12f5609197cddd0e2ab5b"}
I0319 15:21:28.736849    9224 manager.go:363] Status for pod "dbe5be50-ede5-11e5-9681-5c514f64b257" is up-to-date; skipping
W0319 15:21:30.645863    9224 kubelet.go:1979] Orphaned volume "dbe5be50-ede5-11e5-9681-5c514f64b257/wrapped_default-token-c6gc1.deleting~655762369" found, tearing down volume
W0319 15:21:30.648390    9224 mount.go:98] could not determine device for path: "/var/lib/kubelet/pods/dbe5be50-ede5-11e5-9681-5c514f64b257/volumes/kubernetes.io~secret/wrapped_default-token-c6gc1.deleting~655762369"
W0319 15:21:30.673478    9224 kubelet.go:1979] Orphaned volume "dbe5be50-ede5-11e5-9681-5c514f64b257/default-token-c6gc1" found, tearing down volume

@amlinux
Copy link
Author

amlinux commented Mar 19, 2016

More observations. Here is my default token:

# kubectl get secrets | grep default
default-token-rjwc9   kubernetes.io/service-account-token   3         17m

Here is error message I get:

E0319 16:14:53.642626   23448 kubelet.go:1990] Could not tear down volume "7cd0d6fc-edeb-11e5-bb95-5c514f64b257/default-token-rjwc9": rename /var/lib/kubelet/pods/7cd0d6fc-edeb-11e5-bb95-5c514f64b257/volumes/kubernetes.io~secret/default-token-rjwc9 /var/lib/kubelet/pods/7cd0d6fc-edeb-11e5-bb95-5c514f64b257/volumes/kubernetes.io~secret/wrapped_default-token-rjwc9.deleting~210881875: device or resource busy

This path is changing all the time:

root@amlie:~# ls -l /var/lib/kubelet/pods/7cd0d6fc-edeb-11e5-bb95-5c514f64b257/volumes/kubernetes.io~secret/
итого 4
drwxrwxrwt 2 root root  100 мар 19 16:58 default-token-rjwc9
drwx------ 2 root root 4096 мар 19 17:16 wrapped_default-token-rjwc9.deleting~055097388
root@amlie:~# ls -l /var/lib/kubelet/pods/7cd0d6fc-edeb-11e5-bb95-5c514f64b257/volumes/kubernetes.io~secret/
итого 4
drwxrwxrwt 2 root root  100 мар 19 16:58 default-token-rjwc9
drwx------ 2 root root 4096 мар 19 17:16 wrapped_default-token-rjwc9.deleting~525414462

(Note the subdirectory names are different)

Pod deleteion code was recently touched by @caesarxuchao, authentication tokens (whose directories can't be removed) by @liggitt, @jlowdermilk. Any ideas what it could be?

I would ignore this problem if it didn't consume 15% of CPU in a very tight orphan removal loop.

@liggitt
Copy link
Member

liggitt commented Mar 19, 2016

Those tokens are just secrets as far as the kubelet is concerned. Not sure how secret volume cleanup differs in hyperkube.

cc @pmorie

@NikolayMurha
Copy link

We have same problem on nodes with CoreOS alpha 991.0.0 and hyperkube 1.2

a lot of messages like:

Mar 28 14:02:41 az1 kubelet-wrapper[8417]: W0328 11:02:41.899982    8417 mount.go:98] could not determine device for path: "/var/lib/kubelet/pods/e170d3bf-f27a-11e5-b08c-90b11c4d3a00/volumes/kubernetes.io~secret/wrapped_default-token-3bfau.deleting~864258571"
Mar 28 14:02:41 az1 kubelet-wrapper[8417]: W0328 11:02:41.901281    8417 mount.go:98] could not determine device for path: "/var/lib/kubelet/pods/e170d3bf-f27a-11e5-b08c-90b11c4d3a00/volumes/kubernetes.io~secret/default-token-3bfau"
Mar 28 14:02:41 az1 kubelet-wrapper[8417]: E0328 11:02:41.901463    8417 kubelet.go:1990] Could not tear down volume "e170d3bf-f27a-11e5-b08c-90b11c4d3a00/default-token-3bfau": rename /var/lib/kubelet/pods/e170d3bf-f27a-11e5-b08c-90b11c4d3a00/volumes/kubernetes.io~secret/default-token-3bfau /var/lib/kubelet/pods/e170d3bf-f27a-11e5-b08c-90b11c4d3a00/volumes/kubernetes.io~secret/wrapped_default-token-3bfau.deleting~272239891: device or resource busy

@saturnism
Copy link
Contributor

same issue here.

@apatil
Copy link

apatil commented May 4, 2016

Same on Ubuntu 14.04, hyperkube v1.2.3, docker 1.9.1 with aufs backend, kernel 3.13.0-48-generic.

I also see pod state transitions taking a long time; they hang in "Pending" and "Terminating" for minutes. Does anyone else see that in conjunction with this issue?

@apatil
Copy link

apatil commented May 4, 2016

For context, this issue doesn't show up immediately in my hyperkube setup. It turns up after a few days. During that time, I update 6 or 7 deployments simultaneously, several times per day.

@pmorie pmorie self-assigned this May 4, 2016
@pmorie
Copy link
Member

pmorie commented May 4, 2016

Sorry I missed the ping on this. Assigned to self, will try to triage by the end of the week.

@bgrant0607 bgrant0607 added kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence. area/devexp labels May 6, 2016
@bgrant0607
Copy link
Member

cc @dlorenc @vishh

@AaronTao1990
Copy link

got the same problem on centos 7.2

W0519 06:43:26.715097   49123 kubelet.go:1995] Orphaned volume "0640a1c8-1d8a-11e6-a9b1-1402ec3ed5c0/default-token-gofxs" found, tearing down volume
W0519 06:43:26.720166   49123 mount.go:99] could not determine device for path: "/var/lib/kubelet/pods/0640a1c8-1d8a-11e6-a9b1-1402ec3ed5c0/volumes/kubernetes.io~secret/default-token-gofxs"
E0519 06:43:26.738842   49123 kubelet.go:2006] Could not tear down volume "0640a1c8-1d8a-11e6-a9b1-1402ec3ed5c0/default-token-gofxs": rename /var/lib/kubelet/pods/0640a1c8-1d8a-11e6-a9b1-1402ec3ed5c0/volumes/kubernetes.io~secret/default-token-gofxs /var/lib/kubelet/pods/0640a1c8-1d8a-11e6-a9b1-1402ec3ed5c0/volumes/kubernetes.io~secret/wrapped_default-token-gofxs.deleting~000822711: device or resource busy
W0519 06:43:28.655343   49123 kubelet.go:1995] Orphaned volume "0640a1c8-1d8a-11e6-a9b1-1402ec3ed5c0/default-token-gofxs" found, tearing down volume
W0519 06:43:28.660379   49123 mount.go:99] could not determine device for path: "/var/lib/kubelet/pods/0640a1c8-1d8a-11e6-a9b1-1402ec3ed5c0/volumes/kubernetes.io~secret/default-token-gofxs"
E0519 06:43:28.675859   49123 kubelet.go:2006] Could not tear down volume "0640a1c8-1d8a-11e6-a9b1-1402ec3ed5c0/default-token-gofxs": rename /var/lib/kubelet/pods/0640a1c8-1d8a-11e6-a9b1-1402ec3ed5c0/volumes/kubernetes.io~secret/default-token-gofxs /var/lib/kubelet/pods/0640a1c8-1d8a-11e6-a9b1-1402ec3ed5c0/volumes/kubernetes.io~secret/wrapped_default-token-gofxs.deleting~550898602: device or resource busy
W0519 06:43:28.675911   49123 kubelet.go:1995] Orphaned volume "0640a1c8-1d8a-11e6-a9b1-1402ec3ed5c0/wrapped_default-token-gofxs.deleting~000822711" found, tearing down volume
W0519 06:43:28.680160   49123 mount.go:99] could not determine device for path: "/var/lib/kubelet/pods/0640a1c8-1d8a-11e6-a9b1-1402ec3ed5c0/volumes/kubernetes.io~secret/wrapped_default-token-gofxs.deleting~000822711"

@errordeveloper
Copy link
Member

I am wondering if this depends on Docker version, and not so much the graph driver.

@errordeveloper
Copy link
Member

xref #26421

@king6cong
Copy link

{"log":"W0530 08:51:01.922523  200496 kubelet.go:1979] Orphaned volume \"75e97c2a-1e56-11e6-b226-e84dd0c219e4/default-token-5p5vg\" found, tearing down volume\n","stream":"stderr","time":"2016-05-30T08:51:01.922661105Z"}
{"log":"W0530 08:51:01.924846  200496 mount.go:98] could not determine device for path: \"/var/lib/kubelet/pods/75e97c2a-1e56-11e6-b226-e84dd0c219e4/volumes/kubernetes.io~secret/default-token-5p5vg\"\n","stream":"stderr","time":"2016-05-30T08:51:01.924981918Z"}
{"log":"E0530 08:51:01.944604  200496 kubelet.go:1990] Could not tear down volume \"75e97c2a-1e56-11e6-b226-e84dd0c219e4/default-token-5p5vg\": rename /var/lib/kubelet/pods/75e97c2a-1e56-11e6-b226-e84dd0c219e4/volumes/kubernetes.io~secret/default-token-5p5vg /var/lib/kubelet/pods/75e97c2a-1e56-11e6-b226-e84dd0c219e4/volumes/kubernetes.io~secret/wrapped_default-token-5p5vg.deleting~503183222: device or resource busy\n","stream":"stderr","time":"2016-05-30T08:51:01.944784947Z"}
{"log":"W0530 08:51:01.944634  200496 kubelet.go:1979] Orphaned volume \"76e8bf5f-128c-11e6-9469-e84dd0c219e4/wrapped_default-token-5p5vg.deleting~986070007\" found, tearing down volume\n","stream":"stderr","time":"2016-05-30T08:51:01.944793936Z"}

Same problem here, any ideas?

@alml
Copy link
Contributor

alml commented May 30, 2016

@bgrant0607, could you please re-triage this bug. I'm not sure "team/ux" is the right tag for it - it doesn't have anything common with UX. It looks like a problem with some low-level container machinery. And given the number of +1's, maybe raise the priority? Thanks.

@bgrant0607
Copy link
Member

@alml What problem does this cause for you?

@king6cong
Copy link

king6cong commented May 31, 2016

@bgrant0607

image

The error msg will be a confuse to log/alert monitor infrastructures and really hope it can be fixed.

@alml
Copy link
Contributor

alml commented May 31, 2016

@bgrant0607 As I wrote before:

I would ignore this problem if it didn't consume 15% of CPU in a very tight orphan removal loop.

I have a 8-core laptop that can work ~ 4 hours on battery (enough for a mid-range flight). When I run Kubernetes cluster via hyperkube, it loads 15% of my CPU resources (more than 1 core!) in a tight loop draining my battery less than in hour.

@saad-ali
Copy link
Member

Assigning #26421 #25056 #22911 (all seem related--secret volume tear down issues with hyperkube) to @jingxu97 for investigation.

@jingxu97
Copy link
Contributor

jingxu97 commented Jun 4, 2016

After setting up a hypekube and digging the logs, I think the following happened and caused the issue.

Hyperkube automatically creates a few pods for dns, dashboard, proxy services etc. But there is something wrong when setting up and mounting the volumes. The following shows some logs from one such pod.

(1) 21:20:03.214957 kubelet.go:2579] SyncLoop (ADD, "api"): "kubernetes-dashboard-fnivx_kube-system(efa7ceb6-29d0-11e6-bafc-5065f351b2ea)"
(2) 21:20:03.269801 nsenter_mount.go:117] Mount command: nsenter [--mount=/rootfs/proc/1/ns/mnt -- /bin/mount -t tmpfs tmpfs /var/lib/kubelet/pods/efa7ceb6-29d0-11e6-bafc-5065f351b2ea/volumes/kubernetes.io~secret/default-token-z7ul7]
****** These two steps show that volume is mounted

(3) 21:20:07.170736 nsenter_mount.go:196] IsLikelyNotMountPoint findmnt output for path /var/lib/kubelet/pods/efa7ceb6-29d0-11e6-bafc-5065f351b2ea/volumes/kubernetes.iosecret/default-token-z7ul7: /var/lib/kubelet/pods/efa7ceb6-29d0-11e6-bafc-5065f351b2ea/volumes/kubernetes.iosecret/default-token-z7ul
****** Very strange thing here is when SychPod check the mount path, the last letter "7" of the path is missing. Now the step (4) show that the volume is mounted again

(4) 21:20:07.218321 nsenter_mount.go:117] Mount command: nsenter [--mount=/rootfs/proc/1/ns/mnt -- /bin/mount -t tmpfs tmpfs /var/lib/kubelet/pods/efa7ceb6-29d0-11e6-bafc-5065f351b2ea/volumes/kubernetes.io~secret/default-token-z7ul7]

From this point, each time SyncPod check the volume, it mounts again the again which is proved by the findmnt output in the log. After a few hours running, there are over hundreds of mounts. I am working on finding the reason why the last letter of the returned mount point is missing. I manually tested the findmnt command inside of the container and it returned the correct result.

There is also an issue for umount. The umount code will first move the mount dir to a temporary dir, but failed with "device or resource busy" error. I also tried to mount tmpfs in the container with command
mount -t tmpfs tmpfs userdir
Then tried to mv userdir also failed with the same error.

Will continue working on solving this issue. Please let me know if you any comments

@saad-ali
Copy link
Member

saad-ali commented Jun 6, 2016

Excellent debugging @jingxu97. Looking forward to seeing what the root cause is for both issues.

@jingxu97
Copy link
Contributor

jingxu97 commented Jun 6, 2016

Some updates for this bug

  1. About issue one that path name missing the last character:
    I tried to run the "findmnt" command directly on terminal, the output is fine. But when executing the command though exec.command.Output(), the target path name's last character always misses. I tried out different findmnt options, it does not work. Only if the mounting path name is relatively short (looks like under 90~100 characters), there is no missing character. Also no matter how long the path is, only the last character is missing.
    Our current code in function nsenter_mount.IsLikelyNotMountPoint() uses exact string match to detect whether the path is already mounted or not, which is not robust. We can change to substring match instead.
  2. About umount issue. The tearDown also invokes IsLikelyNotMountPoint(). If it returns error message, GetMountMedium() function will regard the mount medium as default, but instead it should be memory. Then teaddown will invokes tearDownDefault and tries to rename directories and delete them, which in turn results in "Device or resource busy" error. Also it does not handle the case if path is mounted multiple times (this case should not happen in normal situation)

I will try to modify the IsLikeLyNotMountPoint() to make it more robust. Please let me know if you have any comments or suggestions. Thanks!

@saad-ali
Copy link
Member

saad-ali commented Jun 6, 2016

PR #21486 and PR #23435 both modified the behavior of NsenterMounter.IsLikelyNotMountPoint() to fix issues and evidently introduced different issues. So watch out for regressions in any potential fix. CC @jsafrane @fgrzadkowski

Also, have you observed the missing character issue in a non-hyperkube environment, or is there something unique about running locally inside a docker container that causes this to happen?

@jingxu97
Copy link
Contributor

jingxu97 commented Jun 7, 2016

I see isLikelyNotMountPoint() function is implmented in two places: nscenter_mount.go and mount_linux.go. My understanding is nsenterMounter is only used when kubelet is running inside of container. nscenter_mount uses findmnt command result to check the mount point. In mount_linux.go, it uses a different approach: It checks whether the directory has a different device as parent, then it is a mountpoint

I also wrote a small go program to test the "findmnt" command in my local environment, it also truncate the last letter if the mounting path is relatively long. So this "findmnt" behavior is not related to hyperkube environment I think.

@saad-ali
Copy link
Member

saad-ali commented Jun 7, 2016

Thanks for the clarification

@luxas
Copy link
Member

luxas commented Jun 7, 2016

Thanks for debugging this.
May I set the v1.3 milestone, as I think we should fix this before release?

@jingxu97
Copy link
Contributor

jingxu97 commented Jun 7, 2016

Sure. THanks!

Jing

On Tue, Jun 7, 2016 at 5:01 AM, Lucas Käldström notifications@github.com
wrote:

Thanks for debugging this.
May I set the v1.3 milestone, as I think we should fix this before
release?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#22911 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ASSNxQhUtKOPU_PAYivLIVa0iHHXgOonks5qJV1-gaJpZM4HvaNq
.

  • Jing

@luxas luxas added this to the v1.3 milestone Jun 7, 2016
@luxas luxas added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Jun 7, 2016
jingxu97 added a commit to jingxu97/kubernetes that referenced this issue Jun 14, 2016
In nsenter_mount.go/isLikelyNotMountPoint function, the returned output
from findmnt command misses the last letter. Modify the code to make sure
that output has the full target path. fix kubernetes#26421 kubernetes#25056 kubernetes#22911
@dsanders1234
Copy link

Please note that findmnt command does not exist on docker host vm for virtualbox/windows setup (i.e. 'boot2docker'). I'm pretty sure this will be the case for Mac too. The errors that I get are:

nsenter_mount.go:185] Failed findmnt command for path /var/lib/kubelet/pods/9602feb4-31ba-11e6-bdbe-f62284e3ec86/volumes/kubernetes.io~secret/default-token-4hz3v: exit status 1

and these don't seem to get cleaned up

k8s-github-robot pushed a commit that referenced this issue Jun 19, 2016
Automatic merge from submit-queue

Fix bug in isLikelyNotMountPoint function

In nsenter_mount.go/isLikelyNotMountPoint function, the returned output
from findmnt command misses the last letter. Modify the code to use
String.contains instead of string matching. fixes #26421 fixes #25056 fixes #22911
@amlinux
Copy link
Author

amlinux commented Sep 7, 2016

This problem is still there. Right now I'm observing it on my GKE instances (managed VM, Kubernetes 1.3.5):

E0907 00:20:19.615448    3457 kubelet.go:2275] Failed cleaning up orphaned pod directories: remove /var/lib/kube
let/pods/85e13d83-711e-11e6-9db3-42010af0003f/volumes/kubernetes.io~secret/authdb-dsn: device or resource busy
E0907 00:20:19.615520    3457 kubelet.go:2596] Failed cleaning pods: remove /var/lib/kubelet/pods/85e13d83-711e-
11e6-9db3-42010af0003f/volumes/kubernetes.io~secret/authdb-dsn: device or resource busy
W0907 00:20:20.638008    3457 container_gc.go:116] Failed to remove dead container "/k8s_POD.d8dbe16c_frontend-5
11d9_default_e70b06ac-6c7d-11e6-b07d-42010af00048_528d48e3": Error response from daemon: Unable to remove filesy
stem for f273e3cb9cb9b65e44ded8de5e4dd3215eac08ccc4abc0402d7d4fb849de3e1c: remove /var/lib/docker/containers/f27
3e3cb9cb9b65e44ded8de5e4dd3215eac08ccc4abc0402d7d4fb849de3e1c/shm: device or resource busy
W0907 00:20:20.641453    3457 container_gc.go:116] Failed to remove dead container "/k8s_POD.d8dbe16c_stationd-v
20160831-2-k1xyz_orbital-prod_096e406c-6f3f-11e6-b07d-42010af00048_de6b8b72": Error response from daemon: Unable
 to remove filesystem for f73afb5773a5321412340d6e29d3936470b9784a98af3874d599a071f2196d5a: remove /var/lib/dock
er/containers/f73afb5773a5321412340d6e29d3936470b9784a98af3874d599a071f2196d5a/shm: device or resource busy
W0907 00:20:20.649780    3457 container_gc.go:116] Failed to remove dead container "/k8s_POD.d8dbe16c_frontend-v
20160831-2-ixopu_orbital-prod_7a4ce4d0-6f3c-11e6-b07d-42010af00048_124ba6ee": Error response from daemon: Unable
 to remove filesystem for 7616add0f17d3a6c65be02a5f65dfb2585d69062c80193249d90a38bd0068479: remove /var/lib/dock
er/containers/7616add0f17d3a6c65be02a5f65dfb2585d69062c80193249d90a38bd0068479/shm: device or resource busy
W0907 00:20:20.651825    3457 container_gc.go:116] Failed to remove dead container "/k8s_POD.d8dbe16c_sessionsd-
v20160831-2-4srld_orbital-prod_fb6a70d2-6f3f-11e6-b07d-42010af00048_e7b5b4b9": Error response from daemon: Unabl
e to remove filesystem for d37a415eb2f303631297bee47c38dd23f0cb0e214b727d337109ed0b4513af11: remove /var/lib/doc
ker/containers/d37a415eb2f303631297bee47c38dd23f0cb0e214b727d337109ed0b4513af11/shm: device or resource busy
W0907 00:20:20.653428    3457 container_gc.go:116] Failed to remove dead container "/k8s_POD.d8dbe16c_version-as
signer-m0gc8_default_8e4e34e0-5d64-11e6-b07d-42010af00048_69f10933": Error response from daemon: Unable to remov
e filesystem for d55b1f61bbacd01572205b48b3d085030a47352f5889933f303cc3fd7abe1308: remove /var/lib/docker/contai
ners/d55b1f61bbacd01572205b48b3d085030a47352f5889933f303cc3fd7abe1308/shm: device or resource busy
I0907 00:20:21.399689    3457 kubelet.go:2029] Failed to remove orphaned pod "85e13d83-711e-11e6-9db3-42010af000
3f" dir; err: remove /var/lib/kubelet/pods/85e13d83-711e-11e6-9db3-42010af0003f/volumes/kubernetes.io~secret/aut
hdb-dsn: device or resource busy
E0907 00:20:21.399941    3457 kubelet.go:2275] Failed cleaning up orphaned pod directories: remove /var/lib/kube
let/pods/85e13d83-711e-11e6-9db3-42010af0003f/volumes/kubernetes.io~secret/authdb-dsn: device or resource busy
E0907 00:20:21.400817    3457 kubelet.go:2596] Failed cleaning pods: remove /var/lib/kubelet/pods/85e13d83-711e-
11e6-9db3-42010af0003f/volumes/kubernetes.io~secret/authdb-dsn: device or resource busy

I noticed it after CPU load jumped from typical 15% to crazy 100% on both my nodes:

CPU utilization

@jingxu97
Copy link
Contributor

jingxu97 commented Sep 7, 2016

@amlinux Thank you for providing this information. After checking the log offline, the problem should be fixed in v1.3.6. Please upgrade to v1.3.6 and let us know if you have any problem. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.