-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error while tearing down pod, "device or resource busy" on service account secret #28750
Comments
This delays pod shutdown by over a minute, which means that deletes take a lot longer to complete. |
Kernel |
@kubernetes/sig-storage |
This is also happening on pod startup. I think it's roughly ~10% of pods failing to get their initial setup.
It appears that this did not resolve within 1m, which caused another timeout to fail. Not sure if it eventually clears up. |
Hi Clayton, Thanks for posting the log messages. Do you also have the full log file so Jing On Mon, Jul 11, 2016 at 8:16 AM, Clayton Coleman notifications@github.com
|
@kubernetes/sig-storage Targeting this to 1.3 until we know more of the root cause. |
We're trying to get a loglevel=5 reproducer. We had log level 4 at One consideration - this is from a single node run, so we have lots of On Mon, Jul 11, 2016 at 1:36 PM, Michael Rubin notifications@github.com
|
And we have lots more secrets per namespace. On Mon, Jul 11, 2016 at 1:59 PM, Clayton Coleman ccoleman@redhat.com
|
This might be a dup of #28616 |
The But it does not explain this failure: |
It does, except we have that patch in place. I'm currently trying to get a Is the parallelization issue prioritized already or something we can assist On Mon, Jul 11, 2016 at 10:27 PM, Saad Ali notifications@github.com wrote:
|
Thanks Clayton. I'm working on it, and have a good handle on it. Tracking it with #28616. Will include patch in 1.3.1. Will make sure to loop you in on the fix. |
Facing the same issue in 1.3. I also notice that the files in /var/lib/kubelet/ are not removed even after deleting the pod. And am still seeing the message a day after deleting the pod: And pod startup is failing in default namespace with the following message So I removed all pods running on the cluster and found something interesting. There are no pods running on the cluster yet the following logs show up in kubelet
|
I'm seeing this problem on centos 7 + flannel + kubernetes in docker ( Also for me this error always happen on kubernetes-dashboard pod. I don't know how hyberkube runs all pods (e.g. in serial or parallel) |
Hi ShashankYadav, Thank you for posting the log message. In normal situation, those secret Jing On Wed, Jul 13, 2016 at 2:07 AM, ShashankYadav notifications@github.com
|
To be clear, based on what Clayton reported above, there are two issues.
We can leave this issue open to track the |
Debugged this some more (thanks to @justinsb for repro logs). The |
@jingxu97 I am attaching the full logs. I did not encounter this issue when using a different namespace(other than default) for the pods. |
Until kubernetes/kubernetes#28616 and kubernetes/kubernetes#28750 are fixed, Deis Workflow cannot start on Kubernetes 1.3.x.
Until kubernetes/kubernetes#28616 and kubernetes/kubernetes#28750 are fixed, Deis Workflow cannot start on Kubernetes 1.3.x.
Until kubernetes/kubernetes#28616 and kubernetes/kubernetes#28750 are fixed, Deis Workflow cannot start on Kubernetes 1.3.x.
Until kubernetes/kubernetes#28616 and kubernetes/kubernetes#28750 are fixed, Deis Workflow cannot start on Kubernetes 1.3.x.
Now that the fix for 2 has been merged the bite of this bug should be greatly reduced (won't prevent other pods from mounting the same secret). However, we still need to understand 1, what causes secret mounts to sometimes fail with |
Since we can reproduce this reliably, any info we can dump on unmount we're happy to do in our runs and gather info if we know what to add. |
I'm thinking this maybe the same or similar to #22911 (comment) Which would mean the underlying medium is being detected incorrectly, which causes the wrong tear down routine to be used. So to debug this I would start with extra logging around |
@smarterclayton, I am working on this bug. Could you please list the steps to reproduce this error so I can further debug it. Thanks a lot! |
We've run into this same issue. Does anyone have a temporary workaround they've found until the fix is available? We're having a hard time deploying due to certain services not being able to create new pods. Any help is greatly appreciated :) |
Hi Kyle, I am trying to investigate this problem. Could you please send us the Best, On Mon, Jul 25, 2016 at 9:20 AM, Kyle Ferguson notifications@github.com
|
We are still in the process of investigating the problem. But want to add The behavior of preventing following mount operations Even if the error may still occur, but after this fix, you should be able Jing On Mon, Jul 25, 2016 at 10:07 AM, Jing Xu jinxu@google.com wrote:
|
@pmorie was looking into this on our end (so folks know) - he says he found On Mon, Jul 25, 2016 at 1:08 PM, Jing Xu notifications@github.com wrote:
|
Great! @pmorie Could you share the findings you have? Thank you! Jing On Mon, Jul 25, 2016 at 10:15 AM, Clayton Coleman notifications@github.com
|
So far I've spent most of my time looking at the mount timeouts. I have findings to report (trying to figure out the right place) on that, but not yet on the umount issues. |
@jingxu97 Cursory glance at the logs I couldn't find anything, it was a couple days ago. I'll try to recreate if I get some time later and capture it. We're on GKE and don't the ability to rollback the master so we're kind of stranded. Some of our services are stuck and unable to deploy, and we have scheduled tasks (daily reports, etc.) that are very intermittent. Seems to be dependent on what node they get scheduled on. Might try to find a healthy node and force jobs to run there in the meantime to see if that will work as a stop gap until a fix is available. |
@kyleferguson , we fixed a few issues recently in version 1.3.4. Could you please try on this new version and let us know if you have any issues. Thanks! |
@jingxu97 Thanks for the update! Unfortunately we're on GKE at the moment so I will have to wait until 1.3.4 is available on there. Looks like the latest available currently is 1.3.3 |
We believe that 1 was caused by issue #29555 which was fixed by PR #29673. Patches for both have been cherry-picked to the v1.3 release branch and are part of the v1.3.4 release. Closing this issue as resolved. |
I'm seeing errors in the kubelet now on 1.3.0+ that indicate a timeout on unmount of secrets into the pod, which looks like it causes some delay in shutdown (although not positive, since this was a long run).
Occurred 4 times for roughly 30 pods. This is a rhel72 system but I don't have the kernel version handy. @kubernetes/rh-storage what sorts of frequency should we expect a device busy on unmount of a tmpfs?
The text was updated successfully, but these errors were encountered: