Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip safe to detach check if node API object no longer exists #30737

Merged
merged 1 commit into from
Aug 18, 2016

Conversation

saad-ali
Copy link
Member

@saad-ali saad-ali commented Aug 17, 2016

Fixes #29358


This change is Reviewable

@saad-ali saad-ali added release-note Denotes a PR that will be considered when it comes time to generate release notes. cherrypick-candidate labels Aug 17, 2016
@saad-ali saad-ali added this to the v1.3 milestone Aug 17, 2016
@k8s-github-robot k8s-github-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Aug 17, 2016
@matchstick
Copy link
Contributor

👍

volumeToDetach.NodeName,
volumeToDetach.VolumeName,
volumeToDetach.VolumeSpec.Name())
return nil
Copy link
Contributor

@jingxu97 jingxu97 Aug 17, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By returning nil here, it can unblock DetachVolume in operation_executor. But from our current workflow, something is still not very clear to me. When reconciler tries to attach the volume through gce service, the following cases might happen

  1. If the volume is still attached by another node, attach will fail.
  2. If the node to which the volume is originally attached is removed from API server (by kubectl delete), attach failed at first but succeed eventually because volume is detached after timeout (waiting for unmount status update). In this case verifySafeToDetach is set to false
  3. If the node to which the volume is originally attached is deleted (by gcloud command), attach will succeed? I am not quite sure in this situation verifySafeToDetach is set to true or false in reconciler.
  4. If the previous operation on the volume failed, it will block on backoff

So it is not clear to me that in the issue #29903, during node upgrading, if the old node is deleted, the following attach to the new node should succeed. Also from the kubelet log I checked, kubelet verifyVolumeAttached succeed because the volume is in the attached list of the new node, but failed on checking the devicePath before trying to mount volume. I am thinking there might be some race condition issue in the attachdetach controller

So I just want to make sure we cover all the cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By returning nil here, it can unblock DetachVolume in operation_executor.

Returning nil will not block the operation_executor, returning an error will.

  1. If the volume is still attached by another node, attach will fail.

Correct.

  1. If the node to which the volume is originally attached is removed from API server (by kubectl delete), attach failed at first but succeed eventually because volume is detached after timeout (waiting for unmount status update). In this case verifySafeToDetach is set to false

Detach from the original node would fail because of the bug this PR is fixing. After this PR, what you stated will be correct.

  1. If the node to which the volume is originally attached is deleted (by gcloud command), attach will succeed? I am not quite sure in this situation verifySafeToDetach is set to true or false in reconciler.

Yes, volume will be detached thanks to PR #29485, and be ok to reattach to another node.

  1. If the previous operation on the volume failed, it will block on backoff

Not sure what you mean here.

So it is not clear to me that in the issue #29903, during node upgrading, if the old node is deleted, the following attach to the new node should succeed. Also from the kubelet log I checked, kubelet verifyVolumeAttached succeed because the volume is in the attached list of the new node, but failed on checking the devicePath before trying to mount volume. I am thinking there might be some race condition issue in the attachdetach controller

The issue was that when the cluster is updated the old machine is deleted/removed from API server. Then the master would 1) unable to correctly update the list of attached volumes and 2) not be able to detach the volume with the node obj missing from the API server. This PR fixes those bugs.

@saad-ali
Copy link
Member Author

@k8s-bot e2e test this issue: #27524

@k8s-bot
Copy link

k8s-bot commented Aug 17, 2016

GCE e2e build/test passed for commit 0c72568.

@jingxu97
Copy link
Contributor

The PR LGTM, but I am not very sure about the logic of setting verifySafeToDetach. In what situation we should verify and what situation we don't need to. But we can discuss in different thread.

@saad-ali
Copy link
Member Author

The PR LGTM, but I am not very sure about the logic of setting verifySafeToDetach. In what situation we should verify and what situation we don't need to. But we can discuss in different thread.

Yes, we can follow up offline. Adding LGTM label to get this merged

@saad-ali saad-ali added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 18, 2016
@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-bot
Copy link

k8s-bot commented Aug 18, 2016

GCE e2e build/test passed for commit 0c72568.

@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit 9696a27 into kubernetes:master Aug 18, 2016
@fabioy
Copy link
Contributor

fabioy commented Aug 18, 2016

@saad-ali Given the comment above, do you still want to cherrypick this PR? Or should I wait?

@saad-ali
Copy link
Member Author

Yes, this PR fixes an existing issue and needs to be cherry picked (I'll prepare one). We'll debug and triage any other issues that we discover independent of this.

@fabioy fabioy added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Aug 18, 2016
k8s-github-robot pushed a commit that referenced this pull request Aug 19, 2016
…37-upstream-release-1.3

Automatic merge from submit-queue

Automated cherry pick of #30737 upstream release 1.3

Automated cherry pick of PR #30737 ("Skip safe to detach if node api obj doesn't exist") to upstream release 1.3.
@k8s-cherrypick-bot
Copy link

Commit found in the "release-1.3" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.

@saad-ali saad-ali deleted the fix29358Round2 branch November 19, 2016 01:07
shyamjvs pushed a commit to shyamjvs/kubernetes that referenced this pull request Dec 1, 2016
…ck-of-#30737-upstream-release-1.3

Automatic merge from submit-queue

Automated cherry pick of kubernetes#30737 upstream release 1.3

Automated cherry pick of PR kubernetes#30737 ("Skip safe to detach if node api obj doesn't exist") to upstream release 1.3.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants