Skip safe to detach check if node API object no longer exists #30737

saad-ali · 2016-08-17T03:43:32Z

This change is

matchstick · 2016-08-17T05:01:49Z

👍

jingxu97 · 2016-08-17T10:25:51Z

pkg/volume/util/operationexecutor/operation_executor.go

+				volumeToDetach.NodeName,
+				volumeToDetach.VolumeName,
+				volumeToDetach.VolumeSpec.Name())
+			return nil


By returning nil here, it can unblock DetachVolume in operation_executor. But from our current workflow, something is still not very clear to me. When reconciler tries to attach the volume through gce service, the following cases might happen

If the volume is still attached by another node, attach will fail.

If the node to which the volume is originally attached is removed from API server (by kubectl delete), attach failed at first but succeed eventually because volume is detached after timeout (waiting for unmount status update). In this case verifySafeToDetach is set to false

If the node to which the volume is originally attached is deleted (by gcloud command), attach will succeed? I am not quite sure in this situation verifySafeToDetach is set to true or false in reconciler.

If the previous operation on the volume failed, it will block on backoff

So it is not clear to me that in the issue #29903, during node upgrading, if the old node is deleted, the following attach to the new node should succeed. Also from the kubelet log I checked, kubelet verifyVolumeAttached succeed because the volume is in the attached list of the new node, but failed on checking the devicePath before trying to mount volume. I am thinking there might be some race condition issue in the attachdetach controller

So I just want to make sure we cover all the cases.

By returning nil here, it can unblock DetachVolume in operation_executor.

Returning nil will not block the operation_executor, returning an error will.

If the volume is still attached by another node, attach will fail.

Correct.

If the node to which the volume is originally attached is removed from API server (by kubectl delete), attach failed at first but succeed eventually because volume is detached after timeout (waiting for unmount status update). In this case verifySafeToDetach is set to false

Detach from the original node would fail because of the bug this PR is fixing. After this PR, what you stated will be correct.

If the node to which the volume is originally attached is deleted (by gcloud command), attach will succeed? I am not quite sure in this situation verifySafeToDetach is set to true or false in reconciler.

Yes, volume will be detached thanks to PR #29485, and be ok to reattach to another node.

If the previous operation on the volume failed, it will block on backoff

Not sure what you mean here.

So it is not clear to me that in the issue #29903, during node upgrading, if the old node is deleted, the following attach to the new node should succeed. Also from the kubelet log I checked, kubelet verifyVolumeAttached succeed because the volume is in the attached list of the new node, but failed on checking the devicePath before trying to mount volume. I am thinking there might be some race condition issue in the attachdetach controller

The issue was that when the cluster is updated the old machine is deleted/removed from API server. Then the master would 1) unable to correctly update the list of attached volumes and 2) not be able to detach the volume with the node obj missing from the API server. This PR fixes those bugs.

saad-ali · 2016-08-17T18:25:14Z

@k8s-bot e2e test this issue: #27524

k8s-bot · 2016-08-17T19:04:47Z

GCE e2e build/test passed for commit 0c72568.

jingxu97 · 2016-08-18T00:07:31Z

The PR LGTM, but I am not very sure about the logic of setting verifySafeToDetach. In what situation we should verify and what situation we don't need to. But we can discuss in different thread.

saad-ali · 2016-08-18T01:05:00Z

The PR LGTM, but I am not very sure about the logic of setting verifySafeToDetach. In what situation we should verify and what situation we don't need to. But we can discuss in different thread.

Yes, we can follow up offline. Adding LGTM label to get this merged

k8s-github-robot · 2016-08-18T10:21:51Z

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

k8s-bot · 2016-08-18T10:59:39Z

GCE e2e build/test passed for commit 0c72568.

k8s-github-robot · 2016-08-18T11:00:04Z

Automatic merge from submit-queue

fabioy · 2016-08-18T15:53:32Z

@saad-ali Given the comment above, do you still want to cherrypick this PR? Or should I wait?

saad-ali · 2016-08-18T16:45:27Z

Yes, this PR fixes an existing issue and needs to be cherry picked (I'll prepare one). We'll debug and triage any other issues that we discover independent of this.

…37-upstream-release-1.3 Automatic merge from submit-queue Automated cherry pick of #30737 upstream release 1.3 Automated cherry pick of PR #30737 ("Skip safe to detach if node api obj doesn't exist") to upstream release 1.3.

k8s-cherrypick-bot · 2016-08-19T13:59:57Z

Commit found in the "release-1.3" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.

…ck-of-#30737-upstream-release-1.3 Automatic merge from submit-queue Automated cherry pick of kubernetes#30737 upstream release 1.3 Automated cherry pick of PR kubernetes#30737 ("Skip safe to detach if node api obj doesn't exist") to upstream release 1.3.

saad-ali added release-note Denotes a PR that will be considered when it comes time to generate release notes. cherrypick-candidate labels Aug 17, 2016

saad-ali added this to the v1.3 milestone Aug 17, 2016

saad-ali assigned jingxu97 and matchstick Aug 17, 2016

googlebot added the cla: yes label Aug 17, 2016

k8s-github-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Aug 17, 2016

Skip safe to detach if node api obj doesn't exist

0c72568

saad-ali force-pushed the fix29358Round2 branch from d173e03 to 0c72568 Compare August 17, 2016 04:31

jingxu97 reviewed Aug 17, 2016
View reviewed changes

saad-ali added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 18, 2016

k8s-github-robot merged commit 9696a27 into kubernetes:master Aug 18, 2016

jingxu97 mentioned this pull request Aug 18, 2016

node not exist failure during node status update flush controller's log #30898

Closed

saad-ali mentioned this pull request Aug 18, 2016

Automated cherry pick of #30737 upstream release 1.3 #30911

Merged

jingxu97 mentioned this pull request Aug 18, 2016

Mounting/unmounting volume error on GKE #29903

Closed

fabioy added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Aug 18, 2016

k8s-cherrypick-bot removed the cherrypick-candidate label Aug 19, 2016

saad-ali mentioned this pull request Sep 2, 2016

volume controller not handling terminated node #31088

Closed

saad-ali deleted the fix29358Round2 branch November 19, 2016 01:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip safe to detach check if node API object no longer exists #30737

Skip safe to detach check if node API object no longer exists #30737

saad-ali commented Aug 17, 2016 •

edited by k8s-oncall

Loading

matchstick commented Aug 17, 2016

jingxu97 Aug 17, 2016 •

edited

Loading

saad-ali Aug 17, 2016

saad-ali commented Aug 17, 2016

k8s-bot commented Aug 17, 2016

jingxu97 commented Aug 18, 2016

saad-ali commented Aug 18, 2016

k8s-github-robot commented Aug 18, 2016

k8s-bot commented Aug 18, 2016

k8s-github-robot commented Aug 18, 2016

fabioy commented Aug 18, 2016

saad-ali commented Aug 18, 2016

k8s-cherrypick-bot commented Aug 19, 2016

Skip safe to detach check if node API object no longer exists #30737

Skip safe to detach check if node API object no longer exists #30737

Conversation

saad-ali commented Aug 17, 2016 • edited by k8s-oncall Loading

matchstick commented Aug 17, 2016

jingxu97 Aug 17, 2016 • edited Loading

Choose a reason for hiding this comment

saad-ali Aug 17, 2016

Choose a reason for hiding this comment

saad-ali commented Aug 17, 2016

k8s-bot commented Aug 17, 2016

jingxu97 commented Aug 18, 2016

saad-ali commented Aug 18, 2016

k8s-github-robot commented Aug 18, 2016

k8s-bot commented Aug 18, 2016

k8s-github-robot commented Aug 18, 2016

fabioy commented Aug 18, 2016

saad-ali commented Aug 18, 2016

k8s-cherrypick-bot commented Aug 19, 2016

saad-ali commented Aug 17, 2016 •

edited by k8s-oncall

Loading

jingxu97 Aug 17, 2016 •

edited

Loading