-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GCE PD Volumes already attached to a node fail with "Error 400: The disk resource is already being used by" node #19953
Comments
Saads already got 3 flakes, I can do some initial digging. @saad-ali let me know if you already started on this. |
Taking a look |
The test moves a PD between two nodes. The first node attached/detached the PD successfully.
On the second node, the call to attach the PD failed with a backend error:
Despite the failure the call actually attached the disk to [node-v4cj], so further attempts by
Fix: disk is already attached errors should result in success not failure. |
|
Right |
Saad, how deep is your queue? |
@bprashanth Ya, feel free to take #19574 |
We have a new failure at: kubernetes-soak-continuous-e2e-gce/4396/, and related output from failure tests:
|
Another failure instance for PD test: http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-soak-continuous-e2e-gce/4397/ , but different test case: Pod Disks should schedule a pod w/two RW PDs both mounted to one container, write to PD, verify contents, delete pod, recreate pod, verify contents, and repeat in rapid succession [Slow] /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/pd.go:266 Expected error: <*errors.errorString | 0xc2083d9e50>: { s: "gave up waiting for pod 'pd-test-f247fa94-c3bb-11e5-8603-42010af01555' to be 'running' after 15m0s", } gave up waiting for pod 'pd-test-f247fa94-c3bb-11e5-8603-42010af01555' to be 'running' after 15m0s not to have occurred I guess they might have the same cause, thus not create a new issue. Feel free to create another one though. |
@kubernetes/rh-storage |
Thanks @dchen1107. Confirmed that both those continuous E2E builds hit the same issue.
Basically this bug exposes the house of cards that is the current state of the attach/detach logic in kubelet. In the existing code there is a race condition between the asynchronous pod creation loop and the orphaned pods clean up loop. Although there is some logic to make sure that the actual attach/detach operations don't step on each others' toes, their is basically no synchronization between the remaining logic in the two loops responsible for setting up and tearing down volumes, so the loops end up racing. As it is currently written, the code "failing attach when the disk is already attached" behavior actually masks a very common race condition that happens when a pod is rapidly created and deleted and recreated: The 2nd attach operation tends to get triggered slightly before the detach operation, and in the current code, it eventually fails out (since being attached to the same node is considered an error), then the detach goes through and pulls the PD, the attach loop comes back around in the meantime retires and then all is well. If we change just this logic ("disk is already attached errors should result in success not failure"), we'll end up exposing that, far nastier bug, that pulls disks from under users as they are using them: disk will get attached, second attach operation comes in just before the detach and succeeds, detach comes along and pulls the disk while it is in use (looks like data loss to user). The correct fix is larger: interlock attach/detach such that they do not operate independently of each other. I'm tackling this as part of the larger attach/detach controller redesign, which likely will not be part of 1.2. Therefore a fix for this is unlikely to be part of 1.2. |
There is similar logic on the GCE PD side to prevent the attach/detach operations from interrupting each other, but that does nothing to guarantee the order of operations--which is the key issue here. That will be addressed by unifying the loops. |
IIUC, the FIX is O(hard). Is there a short term mitigation? |
Looking into it |
any thoughts on this? If the answer is "not until the real fix", that's what we need to know. |
I played with a bunch of hacks, like skipping the teardown step if there are pending setups in progress for a given volume. But they all end up destabilizing the system even more. I'm tempted to just implement the "real fix", but that seems like a bad idea to get in at the last minute. I'll keep experimenting, and see if I can get something before Friday, but let's go with "not until the real fix" for now. |
Consider flipping this to a "[Feature:PD]" test if you need time, since it sounds like the problem is well understood |
Proposed design here: #21931. Implementation in progress. |
@saad-ali - any update on this? |
This is breaking often enough that it's bypassing my submit queue optimization - can we move this test to flaky until we get it fixed? |
Specifically the test "Pod Disks should schedule a pod w/two RW PDs both mounted to one container, write to PD, verify contents, delete pod, recreate pod, verify contents, and repeat in rapid succession" |
…Changes Automatic merge from submit-queue Attach/Detach Controller Kubelet Changes This PR contains changes to enable attach/detach controller proposed in #20262. Specifically it: * Introduces a new `enable-controller-attach-detach` kubelet flag to enable control by attach/detach controller. Default enabled. * Removes all references `SafeToDetach` annotation from controller. * Adds the new `VolumesInUse` field to the Node Status API object. * Modifies the controller to use `VolumesInUse` instead of `SafeToDetach` annotation to gate detachment. * Modifies kubelet to set `VolumesInUse` before Mount and after Unmount. * There is a bug in the `node-problem-detector` binary that causes `VolumesInUse` to get reset to nil every 30 seconds. Issue kubernetes/node-problem-detector#9 (comment) opened to fix that. * There is a bug here in the mount/unmount code that prevents resetting `VolumeInUse in some cases, this will be fixed by mount/unmount refactor. * Have controller process detaches before attaches so that volumes referenced by pods that are rescheduled to a different node are detached first. * Fix misc bugs in controller. * Modify GCE attacher to: remove retries, remove mutex, and not fail if volume is already attached or already detached. Fixes #14642, #19953 ```release-note Kubernetes v1.3 introduces a new Attach/Detach Controller. This controller manages attaching and detaching volumes on-behalf of nodes that have the "volumes.kubernetes.io/controller-managed-attach-detach" annotation. A kubelet flag, "enable-controller-attach-detach" (default true), controls whether a node sets the "controller-managed-attach-detach" or not. ```
Is this one now fixed since #26351 went in? |
Yes, this is fixed with #26351 |
Just got this error. With 2 nodes cluster. The number of replicas is 1. I've use deployment type of update: Recreate. From events I've noticed that pod is created before the the disk get detached from one node and attached to the other one (it gets attached after some time). The documentation doesn't state if there's a retry mechanism in place. I've checked the attachment of disk by running:
It may be a Google Container Engine issue. |
I am also seeing this issue with a 1.5 cluster on GCE |
To pile on, I am seeing this issue on a 1.6.2 GKE cluster. We have a persistent group and a preemptible one, and we've started seeing this quite a bit recently. I believe this may be a regression in 1.6, I have seen it happen before but it usually clears up. Now it requires manual intervention. |
We're also seeing this on a 1.6.4 GKE cluster. When a pod with a PD is destroyed, the re-created one fails with "The disk resource [...] is already being used by [...]". |
Also seeing this. In my case, it's being attached to a pod run by a job, and when that job runs to completion, the disk isn't being detached. Resource is here: https://github.com/andrewhowdencom/m2onk8s/blob/master/deploy/helm/charts/magento/templates/install.job.yaml It's being consumed by the deployment in that same folder.
@bbhoss @shimizuf perhaps we should open a new issue rather than trying to necro this one? |
Please re-open, this is still happening with v1.6.4 |
Instead of re-opening this bug, I've created a new bug: #48968 Please carry on conversation there |
Set etcd DialTimeout, fix etcd start order in all-in-one Origin-commit: 116b179defb1dcbbe5e734efbee1b038a64fc4c7
Original title: e2e failure: Pod Disks should schedule a pod w/ a RW PD, remove it, then schedule it on another host #19953
From http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gke-1.1/1506/
Pod Disks should schedule a pod w/ a RW PD, remove it, then schedule it on another host
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/pd.go:124 Expected error: <*errors.errorString | 0xc208ce6950>: { s: "gave up waiting for pod 'pd-test-893001f0-c07f-11e5-9b35-42010af01555' to be 'running' after 15m0s", } gave up waiting for pod 'pd-test-893001f0-c07f-11e5-9b35-42010af01555' to be 'running' after 15m0s not to have occurred
@saad-ali as team/cluster PD expert?
The text was updated successfully, but these errors were encountered: