Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix unmount failure for SMB volume in host is down state #101305

Closed
wants to merge 1 commit into from

Conversation

andyzhangx
Copy link
Member

@andyzhangx andyzhangx commented Apr 21, 2021

What type of PR is this?

/kind bug

What this PR does / why we need it:

fix unmount failure for SMB volume in Host is down state

When SMB server is down, there is no way to terminate pod which is using SMB mount, would get following error. This PR regard host is down as corrupted mount dir, and then skip UnmountVolume.TearDown process if host is down

  • original error
Apr 20 11:11:52 aks-nonzone-17963928-vmss000000 kubelet[8516]: E0420 11:11:52.618206    8516 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/smb.csi.k8s.io^pvc-da830ab0-e8d2-4e0e-89f4-906a4e4398a4 podName:90ee5993-7f80-4fe3-9d88-7bc89b19cbfe nodeName:}" failed. No retries permitted until 2021-04-20 11:11:56.61815874 +0000 UTC m=+337.879942743 (durationBeforeRetry 4s). Error: "UnmountVolume.TearDown failed for volume \"persistent-storage\" (UniqueName: \"kubernetes.io/csi/smb.csi.k8s.io^pvc-da830ab0-e8d2-4e0e-89f4-906a4e4398a4\") pod \"90ee5993-7f80-4fe3-9d88-7bc89b19cbfe\" (UID: \"90ee5993-7f80-4fe3-9d88-7bc89b19cbfe\") : kubernetes.io/csi: mounter.TearDownAt failed to clean mount dir [/var/lib/kubelet/pods/90ee5993-7f80-4fe3-9d88-7bc89b19cbfe/volumes/kubernetes.io~csi/pvc-da830ab0-e8d2-4e0e-89f4-906a4e4398a4/mount]: stat /var/lib/kubelet/pods/90ee5993-7f80-4fe3-9d88-7bc89b19cbfe/volumes/kubernetes.io~csi/pvc-da830ab0-e8d2-4e0e-89f4-906a4e4398a4/mount: host is down"
  • With this PR
Apr 20 11:24:25 aks-nonzone-17963928-vmss000000 kubelet[18337]: E0420 11:24:25.770173   18337 kubelet_volumes.go:65] pod "90ee5993-7f80-4fe3-9d88-7bc89b19cbfe" found, but error fail to check mount point "/var/lib/kubelet/pods/90ee5993-7f80-4fe3-9d88-7bc89b19cbfe/volumes/kubernetes.io~csi/pvc-da830ab0-e8d2-4e0e-89f4-906a4e4398a4/mount": stat /var/lib/kubelet/pods/90ee5993-7f80-4fe3-9d88-7bc89b19cbfe/volumes/kubernetes.io~csi/pvc-da830ab0-e8d2-4e0e-89f4-906a4e4398a4/mount: host is down occurred during checking mounted volumes from disk
Apr 20 11:24:32 aks-nonzone-17963928-vmss000000 kubelet[18337]: E0420 11:24:32.938143   18337 csi_mounter.go:409] kubernetes.io/csi: isDirMounted IsLikelyNotMountPoint test failed for dir [/var/lib/kubelet/pods/90ee5993-7f80-4fe3-9d88-7bc89b19cbfe/volumes/kubernetes.io~csi/pvc-da830ab0-e8d2-4e0e-89f4-906a4e4398a4/mount]
Apr 20 11:24:43 aks-nonzone-17963928-vmss000000 kubelet[18337]: W0420 11:24:43.178229   18337 csi_mounter.go:368] kubernetes.io/csi: dir[/var/lib/kubelet/pods/90ee5993-7f80-4fe3-9d88-7bc89b19cbfe/volumes/kubernetes.io~csi/pvc-da830ab0-e8d2-4e0e-89f4-906a4e4398a4/mount] is corrupted, error: stat /var/lib/kubelet/pods/90ee5993-7f80-4fe3-9d88-7bc89b19cbfe/volumes/kubernetes.io~csi/pvc-da830ab0-e8d2-4e0e-89f4-906a4e4398a4/mount: host is down, skip mount dir removal

Which issue(s) this PR fixes:

Fixes kubernetes-csi/csi-driver-smb#64

Special notes for your reviewer:

Does this PR introduce a user-facing change?

fix unmount failure for SMB volume in `Host is down` state

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

fix unmount failure for SMB volume in `Host is down` state

/assign @msau42
/kind bug
/priority important-soon
/sig storage
/triage accepted

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/storage Categorizes an issue or PR as relevant to SIG Storage. triage/accepted Indicates an issue or PR is ready to be actively worked on. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 21, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: andyzhangx
To complete the pull request process, please assign jsafrane after the PR has been reviewed.
You can assign the PR to them by writing /assign @jsafrane in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@andyzhangx
Copy link
Member Author

/retest
/test pull-kubernetes-e2e-aks-engine-azure-disk-vmss
/test pull-kubernetes-e2e-aks-engine-azure-disk-windows-dockershim
/test pull-kubernetes-e2e-aks-engine-azure-file-windows-containerd
/test pull-kubernetes-e2e-aks-engine-azure-file

@andyzhangx andyzhangx changed the title fix unmount failure for SMB volume in Host is down state fix unmount failure for SMB volume in host is down state Apr 21, 2021
@andyzhangx
Copy link
Member Author

/test pull-kubernetes-e2e-aks-engine-azure-file-windows-dockershim

@andyzhangx
Copy link
Member Author

/test pull-kubernetes-e2e-kind-ipv6

@k8s-ci-robot
Copy link
Contributor

@andyzhangx: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-kubernetes-e2e-aks-engine-azure-file-windows-containerd e1e94cf link /test pull-kubernetes-e2e-aks-engine-azure-file-windows-containerd
pull-kubernetes-e2e-kind-ipv6 e1e94cf link /test pull-kubernetes-e2e-kind-ipv6

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

if isCorruptedDir(dir) {
klog.Warningf(log("dir[%s] is corrupted, error: %v, skip mount dir removal", dir, err))
return nil
}
return errors.New(log("mounter.TearDownAt failed to clean mount dir [%s]: %v", dir, err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other places - if we can't cleanup the mountDir because of stat failures - then we don't usually finish teardown process and we are in fact letting the pod hang on purpose.

I think at very minimum this would result in - orphan pod directories left over on the node.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's original behavior if SMB server is down, and pod would be be terminating state forever, shall we fix this issue or mark as by design?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I am not sure. We do same thing (i.e leaving the pod hanging forever) for NFS too. I feel like we should do better job and somehow standardize around this. This also means that - code that touches unmount should have some safety built-in.

Copy link
Member

@gnufied gnufied Apr 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drigz @andyzhangx the question is - do we even have to change this code at all? This code affects all CSI drivers and as per CSI spec - the removal of mount point directory should be done by the driver. I know traditionally before we fixed it, it was the kubelet which was creating the node-publish path, but it was a bug.

So after we fix that - shouldn't the CSI driver remove the publish path on NodeUnpublish and then just leave the cleanup of parent directories to the kubelet? And in which case - kubelet should have no trouble removing the path if NodeUnpublish succeeded.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see the CSI spec - https://github.com/container-storage-interface/spec/blob/master/spec.md#nodeunpublishvolume

 The SP MUST delete the file or directory it created at this path.
  // This is a REQUIRED field.

Copy link
Member Author

@andyzhangx andyzhangx Apr 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so with #101332, we should remove Line 369 in this PR?

following mount dir removal is not necessary?

if err := removeMountDir(c.plugin, dir); err != nil {
 ...
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that call also removes the parent directory, but removal of dir by the kubelet is not necessary and should be done by the driver. Having said that - I took a look at existing drivers and I think most of them don't remove the publish_path correctly on NodeUnpublish - so we can't remove that code right away but we have to capture this via release notes and give time to users.

@drigz
Copy link
Contributor

drigz commented Apr 21, 2021

Hope you don't mind a drive-by comment: in the case of "host is down", I believe stat will fail while the volume is mounted, but umount will work, and stat will work after that point. Could kubelet umount and then stat/delete in this case?

If the existing behavior can't be changed, I think we'll need a potentially-dangerous workaround that automatically does the umount outside of kubelet, as the terminating pods block our rollouts from proceeding.

@gnufied
Copy link
Member

gnufied commented Apr 21, 2021

I filed kubernetes-csi/csi-test#336 and #101332 as follow up items to move some of this responsibility into the driver.

@andyzhangx
Copy link
Member Author

close this PR, would fix the IsCorruptedMnt in mount-utils folder first: #101398

@andyzhangx andyzhangx closed this Apr 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

failed to unmount due to "host is down"
5 participants