Fix race between MountVolume and UnmountDevice #71074

jsafrane · 2018-11-15T13:29:17Z

When kubelet receives a new pod while UnmountDevice is in progress, it should not enqueue
MountVolume operation - the volume device is being unmounted right now.

In addition, don't clear devicePath after UnmountDevice, because subsequent MountDevice may need it.

/kind bug
/sig storage
Fixes #65246

Does this PR introduce a user-facing change?:

Fixed kubelet reporting "resource name may not be empty" when mounting a volume very quickly after unmount.

jsafrane · 2018-11-15T13:31:12Z

@msau42 @gnufied @jingxu97 @saad-ali @pohly PTAL.

jsafrane · 2018-11-15T15:23:14Z

/priority important-soon

pohly · 2018-11-15T15:59:18Z

pkg/kubelet/volumemanager/reconciler/reconciler.go

@@ -166,6 +166,10 @@ func (rc *reconciler) reconcile() {
 	// Ensure volumes that should be unmounted are unmounted.
 	for _, mountedVolume := range rc.actualStateOfWorld.GetMountedVolumes() {
 		if !rc.desiredStateOfWorld.PodExistsInVolume(mountedVolume.PodName, mountedVolume.VolumeName) {
+			if rc.operationExecutor.IsOperationPending(mountedVolume.VolumeName, nestedpendingoperations.EmptyUniquePodName) {
+				klog.V(5).Infof("Skipping UnmountVolume, device operation is in progress")
+				continue


How is this different from the check that the executer will do internally? See here:

kubernetes/pkg/volume/util/nestedpendingoperations/nestedpendingoperations.go

Lines 94 to 108 in b7e2980

func (grm *nestedPendingOperations) Run(

volumeName v1.UniqueVolumeName,

podName types.UniquePodName,

generatedOperations types.GeneratedOperations) error {

grm.lock.Lock()

defer grm.lock.Unlock()

opExists, previousOpIndex := grm.isOperationExists(volumeName, podName)

if opExists {

previousOp := grm.operations[previousOpIndex]

// Operation already exists

if previousOp.operationPending {

// Operation is pending

operationName := getOperationName(volumeName, podName)

return NewAlreadyExistsError(operationName)

}

When I asked about it in #70319 and later in a CSI WG meeting, @saad-ali said that the executioner will prevent running multiple operations against the same volume in parallel.

It wasn't obvious how that works and I was still worried that the executioner would merely serialize operations. But after looking at the code above, my interpretation is that it will discard new operations instead of queuing them, so it should do something similar to what you propose here.

But I am not entirely sure - perhaps your check is stricter, or I misunderstood something.

The difference is podName - (Un)MountDevice operation uses empty pod name, it's global for all pods, while (Un)MountVolume uses pod names. IMO, (volumeName, podName) is the primary key in the operation map, so multiple MountVolumes can run in parallel on the same volume.

jsafrane · 2018-11-15T16:48:46Z

/retest

pohly · 2018-11-15T17:50:49Z

Jan Šafránek <notifications@github.com> writes:

The difference is `podName` - (Un)MountDevice operation uses empty pod name, it's global for all pods, while (Un)MountVolume uses pod names. IMO, (volumeName, podName) is the primary key in the operation map, so multiple MountVolumes can run in parallel on the same volume.

Then let me come back to my question from #70319 (comment): is there there ever a situation where there should be more than one operation in flight *per volume*? Note that I meant "should be" as in "should be allowed to be". @saad-ali answer was no, but in practice it was allowed because the check was was based on volume + pod. That check has been shown to be insufficient for one case. Can we be sure that this case (which gets addressed with your PR) is the only one? I genuinely don't know.

gnufied · 2018-11-15T21:00:52Z

pkg/kubelet/volumemanager/cache/actual_state_of_world.go

-	volumeObj.devicePath = devicePath
+	if devicePath != "" {
+		volumeObj.devicePath = devicePath
+	}


Do we really need this? If we prevent concurrent operation on same volume - won't a pod that uses same volume will pretty much cause another MountDevice operation?

Yes, we need it. Consider UnmountDevice is in progress and a new pod arrives.

UnmountDevice finishes. This calls MarkDeviceAsUnmounted which clears the devicePath, i.e. calls SetVolumeGloballyMounted(..., devicePath="", ...)

New MountDevice (WaitForAttach) starts and sees empty devicePath -> problems.

sorry to put a comment this late. very good catch of this bug. I think to avoid such bugs, it is better to not reuse SetVolumeGloballyMounted for both MarkDeviceAsMounted and MarkDeviceAsUnmounted (this function just to set the bool globallyMounted to false so no need to pass devicePath or deviceMountPath)

jsafrane · 2018-11-16T09:00:34Z

is there there ever a situation where there should be more than one
operation in flight per volume? Note that I meant "should be" as in
"should be allowed to be".

I am not sure. However, in this case we did not see two operations in parallel, we saw UnmountDevice running and MountVolume enqueued. MountVolume was started after UnmountDevice finished and that is wrong - the device is not mounted, so MountVolume has nothing to bind-mount. We can:

Either not enqueue MountVolume operation when UnmountDevice is in progress (that's this PR)
Or update MountVolume operation to check if MountDevice was completed.

pohly · 2018-11-16T11:12:52Z

Jan Šafránek <notifications@github.com> writes:

> is there there ever a situation where there should be more than one operation in flight *per volume*? Note that I meant "should be" as in "should be allowed to be". I am not sure. However, in this case we did not see two operations in parallel, we saw UnmountDevice running and MountVolume enqueued. MountVolume was started after UnmountDevice finished and *that* is wrong - the device is not mounted, so MountVolume has nothing to bind-mount. We can: * Either not enqueue MountVolume operation when UnmountDevice is in progress (that's this PR) * Or update MountVolume operation to check if MountDevice was completed.

Third option: * make the check in the executor stricter by only using the volume as key when checking for existing operations The result in this case would have been that the MountVolume operation would have been discarded by the executor itself, instead of having to add that logic to the reconciler. This would also catch other errors where operations are enqueued that shouldn't be enqueued because the pending operation will change the state of the world. I'd expect the reconciler to simply re-create operations that are still needed once the running operation is done. But perhaps such a change has negative effects in cases where (currently) pending or parallel operations are possible and desirable. <shrug>

UnmountDevice must not clear devicepath, because such devicePath may come from node.status (e.g. on AWS) and subsequent MountDevice operation (that may be already enqueued) needs it.

jsafrane · 2018-11-16T12:33:02Z

It seems I was wrong. Only one MountVolume() operation runs when multiple pods use the same volume, because pod name is not used in the operation name:

kubernetes/pkg/volume/util/operationexecutor/operation_executor.go

Line 742 in 954996e

podName := nestedpendingoperations.EmptyUniquePodName

So operation checks in this PR are useless and only devicePath check is needed. I reworked the PR, now it's one if.

pohly · 2018-11-16T13:09:56Z

Jan Šafránek <notifications@github.com> writes:

So operation checks in this PR are useless and only `devicePath` check is needed. I reworked the PR, now it's one `if`.

Is the effect different from what I have in #70746? SetVolumeGloballyMounted is a function with poorly defined semantic, and adding yet another special case doesn't make it better. IMHO deleting it and moving the relevant code to MarkDeviceAsMounted and MarkDeviceAsUnmounted is cleaner. Just my 2 cents of course, I don't care which PR gets merged (if any).

msau42 · 2018-11-16T13:36:36Z

Please add unit tests to this function and reconciler like I suggested here: #70746 (comment)

jsafrane · 2018-11-16T16:53:21Z

Unit test added. It fails without changes in actual_state_of_world.go and passes with them.

gnufied · 2018-11-16T19:57:47Z

/lgtm

but I also think that - we should also merge #71095 because while we have fixed one race here, not relying on attachID value present in actual state of the world, while calling WaitForAttach makes sense. Many other drivers as well discover the actual device path in WaitForAttach call itself.

saad-ali · 2018-11-16T21:28:19Z

/lgtm
/approve

k8s-ci-robot · 2018-11-16T21:28:25Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jsafrane, saad-ali

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/kubelet/volumemanager/OWNERS~~ [saad-ali]
~~pkg/volume/testing/OWNERS~~ [jsafrane,saad-ali]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

AishSundar · 2018-11-17T00:35:35Z

@saad-ali @msau42 looks like this is ready for merge. let me know if you want help stamping it with 1.13 milestone

AishSundar · 2018-11-17T00:38:24Z

/milestone v1.13
/priority critical-urgent
/remove-priority important-soon

gnufied · 2018-12-14T19:23:18Z

Lets cherry-pick this to 1.12 and 1.11. @jsafrane @msau42 any objectios. I think this will also fix long standing bug #67342

msau42 · 2018-12-17T03:57:55Z

sgtm

…4-upstream-release-1.12 Automated cherry pick of #71074: Fixed clearing of devicePath after UnmountDevice

…4-upstream-release-1.11 Automated cherry pick of #71074: Fixed clearing of devicePath after UnmountDevice

dennis-benzinger-hybris · 2019-03-11T17:34:42Z

Is there any chance of getting this merge to 1.10 ?

jsafrane · 2019-03-12T14:02:58Z

Is there any chance of getting this merge to 1.10 ?

No, Kubernetes supports only 3 releases at time (1.11 - 1.13).

dennis-benzinger-hybris · 2019-03-12T14:14:04Z

Thanks @jsafrane. That confirms what I expected from reading the docs. I thought I'll ask nevertheless because apparently #72856 was merged to 1.10.

Fix race between MountVolume and UnmountDevice (non-official backport to 1.10.5) Fix race between MountVolume and UnmountDevice (non-official backport to 1.10.5) FYI. this issue is fixed in by kubernetes#71074 k8s version fixed version v1.10 no fix v1.11 1.11.7 v1.12 1.12.5 v1.13 no such issue See merge request !55

k8s-ci-robot requested review from jingxu97 and msau42 November 15, 2018 13:29

jsafrane mentioned this pull request Nov 15, 2018

Second Pod Using Same PVC Fails WaitForAttach Flakily #65246

Closed

k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 15, 2018

pohly reviewed Nov 15, 2018

View reviewed changes

gnufied reviewed Nov 15, 2018

View reviewed changes

jsafrane mentioned this pull request Nov 16, 2018

Remove devicePath dependency for CSI volumes #71095

Merged

Fixed clearing of devicePath after UnmountDevice

5283537

UnmountDevice must not clear devicepath, because such devicePath may come from node.status (e.g. on AWS) and subsequent MountDevice operation (that may be already enqueued) needs it.

jsafrane force-pushed the volume-manager-races branch from fe6a419 to de9689b Compare November 16, 2018 12:32

k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 16, 2018

jsafrane force-pushed the volume-manager-races branch from de9689b to 5283537 Compare November 16, 2018 16:52

k8s-ci-robot assigned gnufied Nov 16, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 16, 2018

k8s-ci-robot assigned saad-ali Nov 16, 2018

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 16, 2018

k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Nov 17, 2018

k8s-ci-robot added this to the v1.13 milestone Nov 17, 2018

k8s-ci-robot removed the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Nov 17, 2018

k8s-ci-robot merged commit f877b22 into kubernetes:master Nov 17, 2018

pohly mentioned this pull request Nov 19, 2018

volumemanager: avoid removing device path information after unmount #70746

Closed

gnufied mentioned this pull request Dec 14, 2018

Storage: devicePath is empty while WaitForAttach in StatefulSets #67342

Closed

This was referenced Dec 17, 2018

Automated cherry pick of #71074: Fixed clearing of devicePath after UnmountDevice #72118

Merged

Automated cherry pick of #71074: Fixed clearing of devicePath after UnmountDevice #72119

Merged

k8s-ci-robot added a commit that referenced this pull request Dec 20, 2018

Merge pull request #72118 from gnufied/automated-cherry-pick-of-#7107…

f35a4c7

…4-upstream-release-1.12 Automated cherry pick of #71074: Fixed clearing of devicePath after UnmountDevice

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Jan 2, 2019

k8s-ci-robot added a commit that referenced this pull request Jan 4, 2019

Merge pull request #72119 from gnufied/automated-cherry-pick-of-#7107…

af68ba2

…4-upstream-release-1.11 Automated cherry pick of #71074: Fixed clearing of devicePath after UnmountDevice

msau42 mentioned this pull request Apr 3, 2019

fix:a volume " is not a mountpoint, deleting" and "device or resource busy" #75654

Closed

deitch mentioned this pull request Sep 6, 2019

statefulset pod recovery is inconsistent equinixmetal-archive/csi-packet#8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race between MountVolume and UnmountDevice #71074

Fix race between MountVolume and UnmountDevice #71074

jsafrane commented Nov 15, 2018 •

edited

Loading

jsafrane commented Nov 15, 2018

jsafrane commented Nov 15, 2018

pohly Nov 15, 2018

jsafrane Nov 15, 2018

jsafrane commented Nov 15, 2018

pohly commented Nov 15, 2018 via email

gnufied Nov 15, 2018

jsafrane Nov 16, 2018

jingxu97 Mar 12, 2019

jsafrane commented Nov 16, 2018

pohly commented Nov 16, 2018 via email

jsafrane commented Nov 16, 2018

pohly commented Nov 16, 2018 via email

msau42 commented Nov 16, 2018

jsafrane commented Nov 16, 2018

gnufied commented Nov 16, 2018

saad-ali commented Nov 16, 2018

k8s-ci-robot commented Nov 16, 2018

AishSundar commented Nov 17, 2018

AishSundar commented Nov 17, 2018

gnufied commented Dec 14, 2018

msau42 commented Dec 17, 2018

dennis-benzinger-hybris commented Mar 11, 2019

jsafrane commented Mar 12, 2019

dennis-benzinger-hybris commented Mar 12, 2019

	func (grm *nestedPendingOperations) Run(
	volumeName v1.UniqueVolumeName,
	podName types.UniquePodName,
	generatedOperations types.GeneratedOperations) error {
	grm.lock.Lock()
	defer grm.lock.Unlock()
	opExists, previousOpIndex := grm.isOperationExists(volumeName, podName)
	if opExists {
	previousOp := grm.operations[previousOpIndex]
	// Operation already exists
	if previousOp.operationPending {
	// Operation is pending
	operationName := getOperationName(volumeName, podName)
	return NewAlreadyExistsError(operationName)
	}

Fix race between MountVolume and UnmountDevice #71074

Fix race between MountVolume and UnmountDevice #71074

Conversation

jsafrane commented Nov 15, 2018 • edited Loading

jsafrane commented Nov 15, 2018

jsafrane commented Nov 15, 2018

pohly Nov 15, 2018

Choose a reason for hiding this comment

jsafrane Nov 15, 2018

Choose a reason for hiding this comment

jsafrane commented Nov 15, 2018

pohly commented Nov 15, 2018 via email

gnufied Nov 15, 2018

Choose a reason for hiding this comment

jsafrane Nov 16, 2018

Choose a reason for hiding this comment

jingxu97 Mar 12, 2019

Choose a reason for hiding this comment

jsafrane commented Nov 16, 2018

pohly commented Nov 16, 2018 via email

jsafrane commented Nov 16, 2018

pohly commented Nov 16, 2018 via email

msau42 commented Nov 16, 2018

jsafrane commented Nov 16, 2018

gnufied commented Nov 16, 2018

saad-ali commented Nov 16, 2018

k8s-ci-robot commented Nov 16, 2018

AishSundar commented Nov 17, 2018

AishSundar commented Nov 17, 2018

gnufied commented Dec 14, 2018

msau42 commented Dec 17, 2018

dennis-benzinger-hybris commented Mar 11, 2019

jsafrane commented Mar 12, 2019

dennis-benzinger-hybris commented Mar 12, 2019

jsafrane commented Nov 15, 2018 •

edited

Loading