-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race between MountVolume and UnmountDevice #71074
Conversation
/priority important-soon |
@@ -166,6 +166,10 @@ func (rc *reconciler) reconcile() { | |||
// Ensure volumes that should be unmounted are unmounted. | |||
for _, mountedVolume := range rc.actualStateOfWorld.GetMountedVolumes() { | |||
if !rc.desiredStateOfWorld.PodExistsInVolume(mountedVolume.PodName, mountedVolume.VolumeName) { | |||
if rc.operationExecutor.IsOperationPending(mountedVolume.VolumeName, nestedpendingoperations.EmptyUniquePodName) { | |||
klog.V(5).Infof("Skipping UnmountVolume, device operation is in progress") | |||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this different from the check that the executer will do internally? See here:
kubernetes/pkg/volume/util/nestedpendingoperations/nestedpendingoperations.go
Lines 94 to 108 in b7e2980
func (grm *nestedPendingOperations) Run( | |
volumeName v1.UniqueVolumeName, | |
podName types.UniquePodName, | |
generatedOperations types.GeneratedOperations) error { | |
grm.lock.Lock() | |
defer grm.lock.Unlock() | |
opExists, previousOpIndex := grm.isOperationExists(volumeName, podName) | |
if opExists { | |
previousOp := grm.operations[previousOpIndex] | |
// Operation already exists | |
if previousOp.operationPending { | |
// Operation is pending | |
operationName := getOperationName(volumeName, podName) | |
return NewAlreadyExistsError(operationName) | |
} |
When I asked about it in #70319 and later in a CSI WG meeting, @saad-ali said that the executioner will prevent running multiple operations against the same volume in parallel.
It wasn't obvious how that works and I was still worried that the executioner would merely serialize operations. But after looking at the code above, my interpretation is that it will discard new operations instead of queuing them, so it should do something similar to what you propose here.
But I am not entirely sure - perhaps your check is stricter, or I misunderstood something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference is podName
- (Un)MountDevice operation uses empty pod name, it's global for all pods, while (Un)MountVolume uses pod names. IMO, (volumeName, podName) is the primary key in the operation map, so multiple MountVolumes can run in parallel on the same volume.
/retest |
Jan Šafránek <notifications@github.com> writes:
The difference is `podName` - (Un)MountDevice operation uses empty pod
name, it's global for all pods, while (Un)MountVolume uses pod
names. IMO, (volumeName, podName) is the primary key in the operation
map, so multiple MountVolumes can run in parallel on the same volume.
Then let me come back to my question from
#70319 (comment):
is there there ever a situation where there should be more than one
operation in flight *per volume*? Note that I meant "should be" as in
"should be allowed to be".
@saad-ali answer was no, but in practice it was allowed because the
check was was based on volume + pod. That check has been shown to be
insufficient for one case. Can we be sure that this case (which gets
addressed with your PR) is the only one?
I genuinely don't know.
|
volumeObj.devicePath = devicePath | ||
if devicePath != "" { | ||
volumeObj.devicePath = devicePath | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need this? If we prevent concurrent operation on same volume - won't a pod that uses same volume will pretty much cause another MountDevice
operation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we need it. Consider UnmountDevice is in progress and a new pod arrives.
- UnmountDevice finishes. This calls
MarkDeviceAsUnmounted
which clears the devicePath, i.e. callsSetVolumeGloballyMounted(..., devicePath="", ...)
- New MountDevice (WaitForAttach) starts and sees empty devicePath -> problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry to put a comment this late. very good catch of this bug. I think to avoid such bugs, it is better to not reuse SetVolumeGloballyMounted for both MarkDeviceAsMounted and MarkDeviceAsUnmounted (this function just to set the bool globallyMounted to false so no need to pass devicePath or deviceMountPath)
I am not sure. However, in this case we did not see two operations in parallel, we saw UnmountDevice running and MountVolume enqueued. MountVolume was started after UnmountDevice finished and that is wrong - the device is not mounted, so MountVolume has nothing to bind-mount. We can:
|
Jan Šafránek <notifications@github.com> writes:
> is there there ever a situation where there should be more than one
operation in flight *per volume*? Note that I meant "should be" as in
"should be allowed to be".
I am not sure. However, in this case we did not see two operations in
parallel, we saw UnmountDevice running and MountVolume
enqueued. MountVolume was started after UnmountDevice finished and
*that* is wrong - the device is not mounted, so MountVolume has
nothing to bind-mount. We can:
* Either not enqueue MountVolume operation when UnmountDevice is in progress (that's this PR)
* Or update MountVolume operation to check if MountDevice was
completed.
Third option:
* make the check in the executor stricter by only using the volume as
key when checking for existing operations
The result in this case would have been that the MountVolume operation
would have been discarded by the executor itself, instead of having to
add that logic to the reconciler.
This would also catch other errors where operations are enqueued that
shouldn't be enqueued because the pending operation will change the
state of the world.
I'd expect the reconciler to simply re-create operations that are still
needed once the running operation is done.
But perhaps such a change has negative effects in cases where
(currently) pending or parallel operations are possible and
desirable. <shrug>
|
UnmountDevice must not clear devicepath, because such devicePath may come from node.status (e.g. on AWS) and subsequent MountDevice operation (that may be already enqueued) needs it.
fe6a419
to
de9689b
Compare
It seems I was wrong. Only one
So operation checks in this PR are useless and only |
Jan Šafránek <notifications@github.com> writes:
So operation checks in this PR are useless and only `devicePath` check
is needed. I reworked the PR, now it's one `if`.
Is the effect different from what I have in
#70746?
SetVolumeGloballyMounted is a function with poorly defined semantic, and
adding yet another special case doesn't make it better. IMHO deleting it
and moving the relevant code to MarkDeviceAsMounted and
MarkDeviceAsUnmounted is cleaner. Just my 2 cents of course, I don't
care which PR gets merged (if any).
|
Please add unit tests to this function and reconciler like I suggested here: #70746 (comment) |
de9689b
to
5283537
Compare
Unit test added. It fails without changes in |
/lgtm but I also think that - we should also merge #71095 because while we have fixed one race here, not relying on |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jsafrane, saad-ali The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/milestone v1.13 |
sgtm |
…4-upstream-release-1.12 Automated cherry pick of #71074: Fixed clearing of devicePath after UnmountDevice
…4-upstream-release-1.11 Automated cherry pick of #71074: Fixed clearing of devicePath after UnmountDevice
Is there any chance of getting this merge to 1.10 ? |
No, Kubernetes supports only 3 releases at time (1.11 - 1.13). |
Fix race between MountVolume and UnmountDevice (non-official backport to 1.10.5) Fix race between MountVolume and UnmountDevice (non-official backport to 1.10.5) FYI. this issue is fixed in by kubernetes#71074 k8s version fixed version v1.10 no fix v1.11 1.11.7 v1.12 1.12.5 v1.13 no such issue See merge request !55
When kubelet receives a new pod while UnmountDevice is in progress, it should not enqueue
MountVolume operation - the volume device is being unmounted right now.
In addition, don't clear devicePath after UnmountDevice, because subsequent MountDevice may need it.
/kind bug
/sig storage
Fixes #65246
Does this PR introduce a user-facing change?: