Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: set "host is down" as corrupted mount #101398

Merged
merged 1 commit into from
Apr 27, 2021

Conversation

andyzhangx
Copy link
Member

@andyzhangx andyzhangx commented Apr 23, 2021

What type of PR is this?

/kind bug

What this PR does / why we need it:

fix: set "host is down" as corrupted mount

When SMB server is down, there is no way to terminate pod which is using SMB mount, would get following error. This PR regard host is down as corrupted mount dir.

Apr 20 11:11:52 aks-nonzone-17963928-vmss000000 kubelet[8516]: E0420 11:11:52.618206    8516 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/smb.csi.k8s.io^pvc-da830ab0-e8d2-4e0e-89f4-906a4e4398a4 podName:90ee5993-7f80-4fe3-9d88-7bc89b19cbfe nodeName:}" failed. No retries permitted until 2021-04-20 11:11:56.61815874 +0000 UTC m=+337.879942743 (durationBeforeRetry 4s). Error: "UnmountVolume.TearDown failed for volume \"persistent-storage\" (UniqueName: \"kubernetes.io/csi/smb.csi.k8s.io^pvc-da830ab0-e8d2-4e0e-89f4-906a4e4398a4\") pod \"90ee5993-7f80-4fe3-9d88-7bc89b19cbfe\" (UID: \"90ee5993-7f80-4fe3-9d88-7bc89b19cbfe\") : kubernetes.io/csi: mounter.TearDownAt failed to clean mount dir [/var/lib/kubelet/pods/90ee5993-7f80-4fe3-9d88-7bc89b19cbfe/volumes/kubernetes.io~csi/pvc-da830ab0-e8d2-4e0e-89f4-906a4e4398a4/mount]: stat /var/lib/kubelet/pods/90ee5993-7f80-4fe3-9d88-7bc89b19cbfe/volumes/kubernetes.io~csi/pvc-da830ab0-e8d2-4e0e-89f4-906a4e4398a4/mount: host is down"

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

fix: set "host is down" as corrupted mount

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

fix: set "host is down" as corrupted mount

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 23, 2021
@andyzhangx
Copy link
Member Author

/kind bug
/assign @msau42
/priority important-soon
/sig cloud-provider
/area provider/azure
/triage accepted

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. area/provider/azure Issues or PRs related to azure provider triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 23, 2021
@andyzhangx
Copy link
Member Author

/retest

@andyzhangx
Copy link
Member Author

/test pull-kubernetes-e2e-aks-engine-azure-file
/test pull-kubernetes-e2e-aks-engine-azure-disk-vmss

@andyzhangx
Copy link
Member Author

/retest

1 similar comment
@andyzhangx
Copy link
Member Author

/retest

@gnufied
Copy link
Member

gnufied commented Apr 23, 2021

@andyzhangx The follow up PR will be in the CSI driver repo right? So as on NodeUnpublish the CSI driver removes the publish directory entirely and hence that operation will be a NO-OP in kubelet.

@andyzhangx
Copy link
Member Author

andyzhangx commented Apr 23, 2021

@andyzhangx The follow up PR will be in the CSI driver repo right? So as on NodeUnpublish the CSI driver removes the publish directory entirely and hence that operation will be a NO-OP in kubelet.

@gnufied I found this should be the only fix in CSI driver repo since SMB CSI driver also uses this IsCorruptedMnt func, so with this PR merged, and after mount-utils vendor updated in CSI driver, SMB CSI driver could unmount host is down mount point correctly.

@gnufied
Copy link
Member

gnufied commented Apr 23, 2021

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 23, 2021
@@ -52,7 +52,7 @@ func IsCorruptedMnt(err error) bool {
underlyingError = pe.Err
}

return underlyingError == syscall.ENOTCONN || underlyingError == syscall.ESTALE || underlyingError == syscall.EIO || underlyingError == syscall.EACCES
return underlyingError == syscall.ENOTCONN || underlyingError == syscall.ESTALE || underlyingError == syscall.EIO || underlyingError == syscall.EACCES || underlyingError == syscall.EHOSTDOWN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder though - if we should separate corrupted mounts from socket/host not connected errors. But may be we could consider them same. cc @chakri-nelluri

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the host is down does that actually mean the mount is gone and cleaned up on the client side? Wondering if this will cause stale/leaking mounts and we need to do something like a force unmount instead.

Copy link
Member Author

@andyzhangx andyzhangx Apr 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the way how this host is down mount point is unmounted in CSI driver, SMB CSI driver invoke CleanupMountPoint func to unmount mount point:

  • without this PR, PathExists return false, err directly, and then CleanupMountPoint returns Warning: Unmount skipped because path does not exist (actually do nothing)
  • with this PR, PathExists return true, err, and finally invoke doCleanupMountPoint to unmount, that will fix this issue.

So by this PR, CSI driver should unmount corrupted mount point if using CleanupMountPoint func. @jingxu97

func PathExists(path string) (bool, error) {
_, err := os.Stat(path)
if err == nil {
return true, nil
} else if os.IsNotExist(err) {
return false, nil
} else if IsCorruptedMnt(err) {
return true, err
}
return false, err
}

func CleanupMountPoint(mountPath string, mounter Interface, extensiveMountPointCheck bool) error {
pathExists, pathErr := PathExists(mountPath)
if !pathExists && pathErr == nil {
klog.Warningf("Warning: Unmount skipped because path does not exist: %v", mountPath)
return nil
}
corruptedMnt := IsCorruptedMnt(pathErr)
if pathErr != nil && !corruptedMnt {
return fmt.Errorf("Error checking path: %v", pathErr)
}
return doCleanupMountPoint(mountPath, mounter, extensiveMountPointCheck, corruptedMnt)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about Windows case? Is windows version of function can already handle this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jingxu97 good point, I added windows fix.
While this specific SMB unmount issue actually does not exists on Windows since CleanupMountPoint on Windows just removes directly. Anyway, should also fix IsCorruptedMnt issue on Windows.

func CleanupMountPoint(m *mount.SafeFormatAndMount, target string, extensiveMountCheck bool) error {
	proxy, ok := m.Interface.(*mounter.CSIProxyMounter)
	if !ok {
		return fmt.Errorf("could not cast to csi proxy class")
	}
	return proxy.Rmdir(target)
}

add HOSTDOWN code for Windows
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 24, 2021
@andyzhangx
Copy link
Member Author

/retest

@msau42
Copy link
Member

msau42 commented Apr 27, 2021

/lgtm

/assign @jingxu97

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 27, 2021
@msau42
Copy link
Member

msau42 commented Apr 27, 2021

/retest

@jingxu97
Copy link
Contributor

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx, jingxu97

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 27, 2021
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@andyzhangx
Copy link
Member Author

this PR is cherry-picked to 1.20.7, 1.21.1, 1.22.0 releases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants