-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix race condition issue when detaching azure disk #60183
fix race condition issue when detaching azure disk #60183
Conversation
@khenidak I think lots of |
fix build error
630269a
to
f3324a6
Compare
Looking good on my side. |
@djsly just replace the controller manager, thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with the changes.
Let's wait a while for @djsly's validation.
@@ -268,7 +268,7 @@ func (d *azureDiskDetacher) Detach(diskURI string, nodeName types.NodeName) erro | |||
return fmt.Errorf("invalid disk to detach: %q", diskURI) | |||
} | |||
|
|||
_, err := d.cloud.InstanceID(context.TODO(), nodeName) | |||
instanceid, err := d.cloud.InstanceID(context.TODO(), nodeName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
context.TODO() seems like a mistake here? is there a context we can attach to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brendanburns it's related to #59287, there is some future work item for context.TODO()
@andyzhangx here are the logs for my latest test after I applied the patch on the 1.7 branch. https://gist.github.com/djsly/d04edbd3239373f345ca27616ddd5132 Things are looking good! Except the Multi-Attach error. This seems to be caused by the reconciler. It last ~2m30sec until the disks are unmounted and the first detach kicks in. Could you confirm that those warnings do not involve the ARM api ?
|
Also, could we have this PR cherry picked all the way to 1.7 ? thanks! |
/lgtm @jdumars can you shepard this into 1.7.x and 1.8.x? Thanks! |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andyzhangx, brendandburns The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Removing label |
@djsly I checked the code again, current I will cherry pick this PR to v1.7-1.9. |
My pleasure and thanks for the quick PR. |
@karataliu @rootfs FYI |
Automatic merge from submit-queue (batch tested with PRs 60208, 60084, 60183, 59713, 60096). If you want to cherry-pick this change to another branch, please follow the instructions here. |
Removing label |
Commit found in the "release-1.9" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked. |
What this PR does / why we need it:
add lock before detaching azure disk, without this PR, there would be lots of
Multi-Attach error
when scheduling one pod from one node to another.Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #60101
Special notes for your reviewer:
@feiskyer @djsly @khenidak
Since we are using
getLunMutex.LockKey(instanceid)
for both AttachDisk and DetachDisk, there would be only one VM.update operation at a time for both AttachDisk and DetachDisk.Release note:
/assign @feiskyer
Could you also mark as v1.10 milestone @feiskyer thanks.
/sig azure