-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip safe to detach check if node API object no longer exists #30737
Conversation
d173e03
to
0c72568
Compare
👍 |
volumeToDetach.NodeName, | ||
volumeToDetach.VolumeName, | ||
volumeToDetach.VolumeSpec.Name()) | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By returning nil here, it can unblock DetachVolume in operation_executor. But from our current workflow, something is still not very clear to me. When reconciler tries to attach the volume through gce service, the following cases might happen
- If the volume is still attached by another node, attach will fail.
- If the node to which the volume is originally attached is removed from API server (by kubectl delete), attach failed at first but succeed eventually because volume is detached after timeout (waiting for unmount status update). In this case verifySafeToDetach is set to false
- If the node to which the volume is originally attached is deleted (by gcloud command), attach will succeed? I am not quite sure in this situation verifySafeToDetach is set to true or false in reconciler.
- If the previous operation on the volume failed, it will block on backoff
So it is not clear to me that in the issue #29903, during node upgrading, if the old node is deleted, the following attach to the new node should succeed. Also from the kubelet log I checked, kubelet verifyVolumeAttached succeed because the volume is in the attached list of the new node, but failed on checking the devicePath before trying to mount volume. I am thinking there might be some race condition issue in the attachdetach controller
So I just want to make sure we cover all the cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By returning nil here, it can unblock DetachVolume in operation_executor.
Returning nil will not block the operation_executor, returning an error will.
- If the volume is still attached by another node, attach will fail.
Correct.
- If the node to which the volume is originally attached is removed from API server (by kubectl delete), attach failed at first but succeed eventually because volume is detached after timeout (waiting for unmount status update). In this case verifySafeToDetach is set to false
Detach from the original node would fail because of the bug this PR is fixing. After this PR, what you stated will be correct.
- If the node to which the volume is originally attached is deleted (by gcloud command), attach will succeed? I am not quite sure in this situation verifySafeToDetach is set to true or false in reconciler.
Yes, volume will be detached thanks to PR #29485, and be ok to reattach to another node.
- If the previous operation on the volume failed, it will block on backoff
Not sure what you mean here.
So it is not clear to me that in the issue #29903, during node upgrading, if the old node is deleted, the following attach to the new node should succeed. Also from the kubelet log I checked, kubelet verifyVolumeAttached succeed because the volume is in the attached list of the new node, but failed on checking the devicePath before trying to mount volume. I am thinking there might be some race condition issue in the attachdetach controller
The issue was that when the cluster is updated the old machine is deleted/removed from API server. Then the master would 1) unable to correctly update the list of attached volumes and 2) not be able to detach the volume with the node obj missing from the API server. This PR fixes those bugs.
GCE e2e build/test passed for commit 0c72568. |
The PR LGTM, but I am not very sure about the logic of setting verifySafeToDetach. In what situation we should verify and what situation we don't need to. But we can discuss in different thread. |
Yes, we can follow up offline. Adding LGTM label to get this merged |
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
GCE e2e build/test passed for commit 0c72568. |
Automatic merge from submit-queue |
@saad-ali Given the comment above, do you still want to cherrypick this PR? Or should I wait? |
Yes, this PR fixes an existing issue and needs to be cherry picked (I'll prepare one). We'll debug and triage any other issues that we discover independent of this. |
Commit found in the "release-1.3" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked. |
…ck-of-#30737-upstream-release-1.3 Automatic merge from submit-queue Automated cherry pick of kubernetes#30737 upstream release 1.3 Automated cherry pick of PR kubernetes#30737 ("Skip safe to detach if node api obj doesn't exist") to upstream release 1.3.
Fixes #29358
This change is