fix unmount failure for SMB volume in `host is down` state #101305

andyzhangx · 2021-04-21T03:51:27Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

fix unmount failure for SMB volume in Host is down state

When SMB server is down, there is no way to terminate pod which is using SMB mount, would get following error. This PR regard host is down as corrupted mount dir, and then skip UnmountVolume.TearDown process if host is down

original error

Apr 20 11:11:52 aks-nonzone-17963928-vmss000000 kubelet[8516]: E0420 11:11:52.618206    8516 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/smb.csi.k8s.io^pvc-da830ab0-e8d2-4e0e-89f4-906a4e4398a4 podName:90ee5993-7f80-4fe3-9d88-7bc89b19cbfe nodeName:}" failed. No retries permitted until 2021-04-20 11:11:56.61815874 +0000 UTC m=+337.879942743 (durationBeforeRetry 4s). Error: "UnmountVolume.TearDown failed for volume \"persistent-storage\" (UniqueName: \"kubernetes.io/csi/smb.csi.k8s.io^pvc-da830ab0-e8d2-4e0e-89f4-906a4e4398a4\") pod \"90ee5993-7f80-4fe3-9d88-7bc89b19cbfe\" (UID: \"90ee5993-7f80-4fe3-9d88-7bc89b19cbfe\") : kubernetes.io/csi: mounter.TearDownAt failed to clean mount dir [/var/lib/kubelet/pods/90ee5993-7f80-4fe3-9d88-7bc89b19cbfe/volumes/kubernetes.io~csi/pvc-da830ab0-e8d2-4e0e-89f4-906a4e4398a4/mount]: stat /var/lib/kubelet/pods/90ee5993-7f80-4fe3-9d88-7bc89b19cbfe/volumes/kubernetes.io~csi/pvc-da830ab0-e8d2-4e0e-89f4-906a4e4398a4/mount: host is down"

With this PR

Apr 20 11:24:25 aks-nonzone-17963928-vmss000000 kubelet[18337]: E0420 11:24:25.770173   18337 kubelet_volumes.go:65] pod "90ee5993-7f80-4fe3-9d88-7bc89b19cbfe" found, but error fail to check mount point "/var/lib/kubelet/pods/90ee5993-7f80-4fe3-9d88-7bc89b19cbfe/volumes/kubernetes.io~csi/pvc-da830ab0-e8d2-4e0e-89f4-906a4e4398a4/mount": stat /var/lib/kubelet/pods/90ee5993-7f80-4fe3-9d88-7bc89b19cbfe/volumes/kubernetes.io~csi/pvc-da830ab0-e8d2-4e0e-89f4-906a4e4398a4/mount: host is down occurred during checking mounted volumes from disk
Apr 20 11:24:32 aks-nonzone-17963928-vmss000000 kubelet[18337]: E0420 11:24:32.938143   18337 csi_mounter.go:409] kubernetes.io/csi: isDirMounted IsLikelyNotMountPoint test failed for dir [/var/lib/kubelet/pods/90ee5993-7f80-4fe3-9d88-7bc89b19cbfe/volumes/kubernetes.io~csi/pvc-da830ab0-e8d2-4e0e-89f4-906a4e4398a4/mount]
Apr 20 11:24:43 aks-nonzone-17963928-vmss000000 kubelet[18337]: W0420 11:24:43.178229   18337 csi_mounter.go:368] kubernetes.io/csi: dir[/var/lib/kubelet/pods/90ee5993-7f80-4fe3-9d88-7bc89b19cbfe/volumes/kubernetes.io~csi/pvc-da830ab0-e8d2-4e0e-89f4-906a4e4398a4/mount] is corrupted, error: stat /var/lib/kubelet/pods/90ee5993-7f80-4fe3-9d88-7bc89b19cbfe/volumes/kubernetes.io~csi/pvc-da830ab0-e8d2-4e0e-89f4-906a4e4398a4/mount: host is down, skip mount dir removal

Which issue(s) this PR fixes:

Fixes kubernetes-csi/csi-driver-smb#64

Special notes for your reviewer:

Does this PR introduce a user-facing change?

fix unmount failure for SMB volume in `Host is down` state

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

fix unmount failure for SMB volume in `Host is down` state

/assign @msau42
/kind bug
/priority important-soon
/sig storage
/triage accepted

k8s-ci-robot · 2021-04-21T03:52:02Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: andyzhangx
To complete the pull request process, please assign jsafrane after the PR has been reviewed.
You can assign the PR to them by writing /assign @jsafrane in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

andyzhangx · 2021-04-21T04:59:21Z

/retest
/test pull-kubernetes-e2e-aks-engine-azure-disk-vmss
/test pull-kubernetes-e2e-aks-engine-azure-disk-windows-dockershim
/test pull-kubernetes-e2e-aks-engine-azure-file-windows-containerd
/test pull-kubernetes-e2e-aks-engine-azure-file

andyzhangx · 2021-04-21T06:58:00Z

/test pull-kubernetes-e2e-aks-engine-azure-file-windows-dockershim

andyzhangx · 2021-04-21T06:58:16Z

/test pull-kubernetes-e2e-kind-ipv6

k8s-ci-robot · 2021-04-21T07:31:35Z

@andyzhangx: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
pull-kubernetes-e2e-aks-engine-azure-file-windows-containerd	`e1e94cf`	link	`/test pull-kubernetes-e2e-aks-engine-azure-file-windows-containerd`
pull-kubernetes-e2e-kind-ipv6	`e1e94cf`	link	`/test pull-kubernetes-e2e-kind-ipv6`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

gnufied · 2021-04-21T15:14:59Z

pkg/volume/csi/csi_mounter.go

+		if isCorruptedDir(dir) {
+			klog.Warningf(log("dir[%s] is corrupted, error: %v, skip mount dir removal", dir, err))
+			return nil
+		}
 		return errors.New(log("mounter.TearDownAt failed to clean mount dir [%s]: %v", dir, err))


In other places - if we can't cleanup the mountDir because of stat failures - then we don't usually finish teardown process and we are in fact letting the pod hang on purpose.

I think at very minimum this would result in - orphan pod directories left over on the node.

that's original behavior if SMB server is down, and pod would be be terminating state forever, shall we fix this issue or mark as by design?

Yeah I am not sure. We do same thing (i.e leaving the pod hanging forever) for NFS too. I feel like we should do better job and somehow standardize around this. This also means that - code that touches unmount should have some safety built-in.

@drigz @andyzhangx the question is - do we even have to change this code at all? This code affects all CSI drivers and as per CSI spec - the removal of mount point directory should be done by the driver. I know traditionally before we fixed it, it was the kubelet which was creating the node-publish path, but it was a bug.

So after we fix that - shouldn't the CSI driver remove the publish path on NodeUnpublish and then just leave the cleanup of parent directories to the kubelet? And in which case - kubelet should have no trouble removing the path if NodeUnpublish succeeded.

see the CSI spec - https://github.com/container-storage-interface/spec/blob/master/spec.md#nodeunpublishvolume

The SP MUST delete the file or directory it created at this path. // This is a REQUIRED field.

so with #101332, we should remove Line 369 in this PR?

following mount dir removal is not necessary?

if err := removeMountDir(c.plugin, dir); err != nil { ... }

that call also removes the parent directory, but removal of dir by the kubelet is not necessary and should be done by the driver. Having said that - I took a look at existing drivers and I think most of them don't remove the publish_path correctly on NodeUnpublish - so we can't remove that code right away but we have to capture this via release notes and give time to users.

drigz · 2021-04-21T16:15:53Z

Hope you don't mind a drive-by comment: in the case of "host is down", I believe stat will fail while the volume is mounted, but umount will work, and stat will work after that point. Could kubelet umount and then stat/delete in this case?

If the existing behavior can't be changed, I think we'll need a potentially-dangerous workaround that automatically does the umount outside of kubelet, as the terminating pods block our rollouts from proceeding.

gnufied · 2021-04-21T17:45:37Z

I filed kubernetes-csi/csi-test#336 and #101332 as follow up items to move some of this responsibility into the driver.

andyzhangx · 2021-04-23T02:56:17Z

close this PR, would fix the IsCorruptedMnt in mount-utils folder first: #101398

fix unmount failure for SMB volume in Host is down state

e1e94cf

k8s-ci-robot assigned msau42 Apr 21, 2021

andyzhangx mentioned this pull request Apr 21, 2021

fix: set "host is down" as corrupted mount kubernetes/utils#203

Closed

andyzhangx mentioned this pull request Apr 21, 2021

When mount dies, it is not remounted kubernetes-csi/csi-driver-smb#164

Closed

k8s-ci-robot requested review from gnufied and humblec April 21, 2021 03:52

andyzhangx mentioned this pull request Apr 21, 2021

failed to unmount due to "host is down" kubernetes-csi/csi-driver-smb#64

Closed

andyzhangx changed the title ~~fix unmount failure for SMB volume in Host is down state~~ fix unmount failure for SMB volume in host is down state Apr 21, 2021

gnufied reviewed Apr 21, 2021

View reviewed changes

gnufied mentioned this pull request Apr 21, 2021

Deprecate and remove behaviour of kubelet removing CSI nodepublish path #101332

Closed

andyzhangx closed this Apr 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix unmount failure for SMB volume in `host is down` state #101305

fix unmount failure for SMB volume in `host is down` state #101305

andyzhangx commented Apr 21, 2021 •

edited

Loading

k8s-ci-robot commented Apr 21, 2021

andyzhangx commented Apr 21, 2021

andyzhangx commented Apr 21, 2021

andyzhangx commented Apr 21, 2021

k8s-ci-robot commented Apr 21, 2021

gnufied Apr 21, 2021

andyzhangx Apr 21, 2021

gnufied Apr 21, 2021

gnufied Apr 21, 2021 •

edited

Loading

gnufied Apr 21, 2021

andyzhangx Apr 22, 2021 •

edited

Loading

gnufied Apr 23, 2021

drigz commented Apr 21, 2021

gnufied commented Apr 21, 2021

andyzhangx commented Apr 23, 2021

fix unmount failure for SMB volume in host is down state #101305

fix unmount failure for SMB volume in host is down state #101305

Conversation

andyzhangx commented Apr 21, 2021 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Apr 21, 2021

andyzhangx commented Apr 21, 2021

andyzhangx commented Apr 21, 2021

andyzhangx commented Apr 21, 2021

k8s-ci-robot commented Apr 21, 2021

gnufied Apr 21, 2021

Choose a reason for hiding this comment

andyzhangx Apr 21, 2021

Choose a reason for hiding this comment

gnufied Apr 21, 2021

Choose a reason for hiding this comment

gnufied Apr 21, 2021 • edited Loading

Choose a reason for hiding this comment

gnufied Apr 21, 2021

Choose a reason for hiding this comment

andyzhangx Apr 22, 2021 • edited Loading

Choose a reason for hiding this comment

gnufied Apr 23, 2021

Choose a reason for hiding this comment

drigz commented Apr 21, 2021

gnufied commented Apr 21, 2021

andyzhangx commented Apr 23, 2021

fix unmount failure for SMB volume in `host is down` state #101305

fix unmount failure for SMB volume in `host is down` state #101305

andyzhangx commented Apr 21, 2021 •

edited

Loading

gnufied Apr 21, 2021 •

edited

Loading

andyzhangx Apr 22, 2021 •

edited

Loading