-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix unmount failure for SMB volume in host is down
state
#101305
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: andyzhangx The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
Host is down
statehost is down
state
/test pull-kubernetes-e2e-aks-engine-azure-file-windows-dockershim |
/test pull-kubernetes-e2e-kind-ipv6 |
@andyzhangx: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
if isCorruptedDir(dir) { | ||
klog.Warningf(log("dir[%s] is corrupted, error: %v, skip mount dir removal", dir, err)) | ||
return nil | ||
} | ||
return errors.New(log("mounter.TearDownAt failed to clean mount dir [%s]: %v", dir, err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other places - if we can't cleanup the mountDir because of stat
failures - then we don't usually finish teardown process and we are in fact letting the pod hang on purpose.
I think at very minimum this would result in - orphan pod directories left over on the node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's original behavior if SMB server is down, and pod would be be terminating state forever, shall we fix this issue or mark as by design?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I am not sure. We do same thing (i.e leaving the pod hanging forever) for NFS too. I feel like we should do better job and somehow standardize around this. This also means that - code that touches unmount should have some safety built-in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@drigz @andyzhangx the question is - do we even have to change this code at all? This code affects all CSI drivers and as per CSI spec - the removal of mount point directory should be done by the driver. I know traditionally before we fixed it, it was the kubelet which was creating the node-publish path, but it was a bug.
So after we fix that - shouldn't the CSI driver remove the publish path on NodeUnpublish
and then just leave the cleanup of parent directories to the kubelet? And in which case - kubelet should have no trouble removing the path if NodeUnpublish
succeeded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see the CSI spec - https://github.com/container-storage-interface/spec/blob/master/spec.md#nodeunpublishvolume
The SP MUST delete the file or directory it created at this path.
// This is a REQUIRED field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so with #101332, we should remove Line 369 in this PR?
following mount dir removal is not necessary?
if err := removeMountDir(c.plugin, dir); err != nil {
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that call also removes the parent directory, but removal of dir
by the kubelet is not necessary and should be done by the driver. Having said that - I took a look at existing drivers and I think most of them don't remove the publish_path correctly on NodeUnpublish
- so we can't remove that code right away but we have to capture this via release notes and give time to users.
Hope you don't mind a drive-by comment: in the case of "host is down", I believe stat will fail while the volume is mounted, but If the existing behavior can't be changed, I think we'll need a potentially-dangerous workaround that automatically does the |
I filed kubernetes-csi/csi-test#336 and #101332 as follow up items to move some of this responsibility into the driver. |
close this PR, would fix the |
What type of PR is this?
/kind bug
What this PR does / why we need it:
fix unmount failure for SMB volume in
Host is down
stateWhen SMB server is down, there is no way to terminate pod which is using SMB mount, would get following error. This PR regard
host is down
as corrupted mount dir, and then skipUnmountVolume.TearDown
process ifhost is down
Which issue(s) this PR fixes:
Fixes kubernetes-csi/csi-driver-smb#64
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
/assign @msau42
/kind bug
/priority important-soon
/sig storage
/triage accepted