Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return more useful error information when a persistent volume fails to mount #23122

Merged
merged 1 commit into from
Apr 13, 2016
Merged

Return more useful error information when a persistent volume fails to mount #23122

merged 1 commit into from
Apr 13, 2016

Conversation

screeley44
Copy link
Contributor

Fixes #22992
Fixes #23048 - resolves following issues - UXP issues

volume plugin type errors are not being surfaced to pod describe events properly when a failure occurs with PV/PVC volume building - for example plugin builder fails if secrets or endpoints are missing (rbd or glusterfs). The user only sees a generic error that has incorrect meaning "unsupported volume type"

see issues #22992 and #23048 for more details

@k8s-bot
Copy link

k8s-bot commented Mar 17, 2016

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

2 similar comments
@k8s-bot
Copy link

k8s-bot commented Mar 17, 2016

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@k8s-bot
Copy link

k8s-bot commented Mar 17, 2016

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@k8s-github-robot
Copy link

Labelling this PR as size/XS

@k8s-github-robot k8s-github-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Mar 17, 2016
@screeley44
Copy link
Contributor Author

@smarterclayton @pmorie @pweil- please review

@@ -34,7 +35,7 @@ import (
"k8s.io/kubernetes/pkg/volume"
)

var errUnsupportedVolumeType = fmt.Errorf("unsupported volume type")
var errUnsupportedVolumeType = fmt.Errorf("failed volume mount, check previous pod describe events or logs for more details")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@screeley44 @pmorie - I took a look at the usages of this error. It looks like some of the usages (NewWrapperBuilder, NewWrapperCleaner) will actually hide the underlying error from the New* call and some don't (mountExternalVolumes). Should we go ahead and standardize that too?

I'll defer to folks who are in this code more than me.

@pmorie
Copy link
Member

pmorie commented Mar 18, 2016

There are a few different problems to untangle here:

  1. NewWrapperBuilder and NewWrapperCleaner in volumes.go both swallow errors when making their respective result objects; we should instead be passing those back
  2. newVolumeBuilderFromPlugins should return a nil and an error, or a builder and no error, and not locating a volume plugin should be considered an error. Remove the check for a nil builder in mountExternalVolumes
  3. unsupported volume type doesn't mean anything to anyone -- the error message for this case should be something like "Unable to find a volume plugin that can handle volume of type X` -- there's some overlap with the stuff @pweil is doing with volumes and OpenShift SCCs (which is not yet ported upstream) that has some code for getting user-facing labels from Volume specs that could be used to provide a meaningful guess at the volume type, but I think in practice this error message should almost never happen.

I believe if we make these changes both the error message in the pod status and the message in the event will have the accurate root cause.

How does that sound @screeley44 @pweil-?

cc @saad-ali @thockin

@pweil-
Copy link
Contributor

pweil- commented Mar 18, 2016

NewWrapperBuilder and NewWrapperCleaner in volumes.go both swallow errors when making their respective result objects; we should instead be passing those back

Yes, agree

newVolumeBuilderFromPlugins should return a nil and an error, or a builder and no error, and not locating a volume plugin should be considered an error. Remove the check for a nil builder in mountExternalVolumes

I like this better. I noticed in the review (#23122 (comment)) that we have to go through extra gyrations because methods can return nil, nil which is "acceptable" but then just turned into an error anyway. I'd rather bubble a "plugin not registered" or similar error and change the calling code to be more simple.

@saad-ali
Copy link
Member

👍 to @pmorie's suggestions

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 26, 2016
@k8s-github-robot
Copy link

@screeley44 PR needs rebase

@dchen1107
Copy link
Member

+1 on @pmorie's comment: #23122 (comment)

@screeley44 Are you going to continue with pr with @pmorie's suggestion?

@screeley44
Copy link
Contributor Author

@dchen1107 @pmorie - Yes, I would like to continue, I've been off for spring break and Easter but back now and plan on picking back up on this

@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 29, 2016
@k8s-github-robot
Copy link

Labelling this PR as size/M

@k8s-github-robot k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Mar 30, 2016
@screeley44
Copy link
Contributor Author

@pmorie @pweil- please review what I did, I think this will expose all root cause errors to the user in the describe event as I tried to thoroughly test all conditions, but maybe I missed something or could have done something in a better way.

Events:
  FirstSeen LastSeen    Count   From            SubobjectPath   Type        Reason      Message
  --------- --------    -----   ----            -------------   --------    ------      -------
  12s       12s     1   {default-scheduler }            Normal      Scheduled   Successfully assigned bb-gluster-pod2 to 127.0.0.1
  12s       1s      2   {kubelet 127.0.0.1}         Warning     FailedMount Unable to mount volumes for pod "bb-gluster-pod2_default(9fd3f3a2-f61a-11e5-b038-52540092b5fb)": failed to instantiate volume: glustervol for plugin: kubernetes.io/persistent-claim with root cause: failed to instantiate volume: pv-gluster for plugin: kubernetes.io/glusterfs with root cause: endpoints "glusterfs-cluster" not found

I also removed the errUnsupportedVolumeType generic error and added a method, I tried to replicate error scenarios where that might get called, but I really couldn't find any, but maybe doesn't hurt to keep around just in case?

For this part:

func (kl *Kubelet) newVolumeMounterFromPlugins(spec *volume.Spec, pod *api.Pod, opts volume.VolumeOptions) (volume.Mounter, error) {
    plugin, err := kl.volumePluginMgr.FindPluginBySpec(spec)
    if err != nil {
        return nil, fmt.Errorf("can't use volume plugins for %s: %v", spec.Name(), err)
    }
    if plugin == nil {
        // Not found but not an error
        return nil, nil
    }

the if plugin == nil (not sure how this would ever happen, I mean, we would either get an error or we would get the plugin...correct?) so not sure if we should remove, leave as is or:

return nil, createVolumeTypeError(volName) 

@screeley44
Copy link
Contributor Author

@dchen1107 - any thoughts on this so far?

@erinboyd
Copy link

@pmorie @saad-ali @pweil- @smarterclayton where are we with this? Can we please merge this before ANOTHER rebase?

@smarterclayton
Copy link
Contributor

@kubernetes/sig-storage

@@ -34,8 +34,6 @@ import (
"k8s.io/kubernetes/pkg/volume"
)

var errUnsupportedVolumeType = fmt.Errorf("unsupported volume type")

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. do we really not need this or can we move it somewhere where we are sure that the volume is not supported ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is removed and replaced with a more generic error createVolumeTypeError - which in theory should never be called from what I can tell, so I was asking if we can remove it all together (errUnsupportedVolumeType was original and I removed in this PR) but replaced with a function to eventually help generate a better error...but as I wondered in the PR I don't know if that error would ever get hit, since if the mounter fails in anyway that error is now surfaced where as before it wasn't...so wonder if we can remove altogether or keep around for some unforeseen scenario?? @swagiaal

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I am convinced it is not needed after reading the rest of the patch :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should not be needed, and there's no utility to keeping it around 'just in case'. YAGNI

// - an attacher if one exists
// - an error if no plugin was found for the volume
// or the attacher failed to instantiate
// - nil if there is no appropriate attacher for this volume
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of the exception to the whole sentence rule -- this is fine as is re: the specification of return values

@screeley44
Copy link
Contributor Author

The goal of this PR was to surface/expose plugin errors that were previously being eaten (i.e. errors that come from nfs.go, glusterfs.go, etc...). And have them properly show in 'describe pod' so users could more easily see them without having to search the logs. This accomplishes that, and all other errors that traditionally already showed in the 'describe' are not really affected. After this I plan on continuing with additonal PRs and Issues to address other useability concerns in terms of error handling and other types of errors being eaten.

Below are a few example outputs from these changes, a newly surfaced plugin generated error that is now surfaced with true root cause and an existing non plugin type error that has always been surfaced and has not changed:

Missing endpoints from GlusterFS using PV (previously being eaten):

Events:
  FirstSeen LastSeen    Count   From            SubobjectPath   Type        Reason      Message
  --------- --------    -----   ----            -------------   --------    ------      -------
  15s       15s     1   {default-scheduler }            Normal      Scheduled   Successfully assigned bb-gluster-pod2 to 127.0.0.1
  15s       4s      2   {kubelet 127.0.0.1}         Warning     FailedMount Unable to mount volumes for pod "bb-gluster-pod2_default(72a91ba0-fccb-11e5-9f13-52540092b5fb)": failed to instantiate mounter for volume: glustervol using plugin: kubernetes.io/persistent-claim with a root cause: failed to instantiate mounter for volume: pv-gluster using plugin: kubernetes.io/glusterfs with a root cause: endpoints "glusterfs-cluster" not found

kubelet log output:
E0407 10:17:23.715015   17134 glusterfs.go:83] glusterfs: failed to get endpoints glusterfs-cluster[endpoints "glusterfs-cluster" not found]
E0407 10:17:23.715030   17134 persistent_claim.go:85] Error creating mounter for claim: glusterfs-claim
E0407 10:17:23.715037   17134 volumes.go:125] Could not create volume mounter for pod 72a91ba0-fccb-11e5-9f13-52540092b5fb: failed to instantiate mounter for volume: glustervol using plugin: kubernetes.io/persistent-claim with a root cause: failed to instantiate mounter for volume: pv-gluster using plugin: kubernetes.io/glusterfs with a root cause: endpoints "glusterfs-cluster" not found
E0407 10:17:23.715056   17134 kubelet.go:1780] Unable to mount volumes for pod "bb-gluster-pod2_default(72a91ba0-fccb-11e5-9f13-52540092b5fb)": failed to instantiate mounter for volume: glustervol using plugin: kubernetes.io/persistent-claim with a root cause: failed to instantiate mounter for volume: pv-gluster using plugin: kubernetes.io/glusterfs with a root cause: endpoints "glusterfs-cluster" not found; skipping pod
E0407 10:17:23.715062   17134 pod_workers.go:138] Error syncing pod 72a91ba0-fccb-11e5-9f13-52540092b5fb, skipping: failed to instantiate mounter for volume: glustervol using plugin: kubernetes.io/persistent-claim with a root cause: failed to instantiate mounter for volume: pv-gluster using plugin: kubernetes.io/glusterfs with a root cause: endpoints "glusterfs-cluster" not found
I0407 10:17:23.715099   17134 server.go:608] Event(api.ObjectReference{Kind:"Pod", Namespace:"default", Name:"bb-gluster-pod2", UID:"72a91ba0-fccb-11e5-9f13-52540092b5fb", APIVersion:"v1", ResourceVersion:"149", FieldPath:""}): type: 'Warning' reason: 'FailedMount' Unable to mount volumes for pod "bb-gluster-pod2_default(72a91ba0-fccb-11e5-9f13-52540092b5fb)": failed to instantiate mounter for volume: glustervol using plugin: kubernetes.io/persistent-claim with a root cause: failed to instantiate mounter for volume: pv-gluster using plugin: kubernetes.io/glusterfs with a root cause: endpoints "glusterfs-cluster" not found
I0407 10:17:23.715140   17134 server.go:608] Event(api.ObjectReference{Kind:"Pod", Namespace:"default", Name:"bb-gluster-pod2", UID:"72a91ba0-fccb-11e5-9f13-52540092b5fb", APIVersion:"v1", ResourceVersion:"149", FieldPath:""}): type: 'Warning' reason: 'FailedSync' Error syncing pod, skipping: failed to instantiate mounter for volume: glustervol using plugin: kubernetes.io/persistent-claim with a root cause: failed to instantiate mounter for volume: pv-gluster using plugin: kubernetes.io/glusterfs with a root cause: endpoints "glusterfs-cluster" not found

NFS Connection Timeout using PV (timeout error remains the same as this does not stem from plugins directly):

Events:
  FirstSeen LastSeen    Count   From            SubobjectPath   Type        Reason      Message
  --------- --------    -----   ----            -------------   --------    ------      -------
  4m        4m      1   {default-scheduler }            Normal      Scheduled   Successfully assigned nfs-bb-pod2-pvc to 127.0.0.1
  2m        8s      2   {kubelet 127.0.0.1}         Warning     FailedMount Unable to mount volumes for pod "nfs-bb-pod2-pvc_default(0304472c-fcca-11e5-9f13-52540092b5fb)": Mount failed: exit status 32
Mounting arguments: nfs1.rhs:/opt/data12 /var/lib/kubelet/pods/0304472c-fcca-11e5-9f13-52540092b5fb/volumes/kubernetes.io~nfs/pv-nfs nfs []
Output: mount.nfs: Connection timed out


kubelet log output:
I0407 10:07:06.898345   17134 kubelet.go:3242] Generating status for "nfs-bb-pod2-pvc_default(0304472c-fcca-11e5-9f13-52540092b5fb)"
I0407 10:07:06.911292   17134 manager.go:386] Status for pod "nfs-bb-pod2-pvc_default(0304472c-fcca-11e5-9f13-52540092b5fb)" updated successfully: {status:{Phase:Pending Conditions:[{Type:Ready Status:False LastProbeTime:{Time:{sec:0 nsec:0 loc:0x2e4da80}} LastTransitionTime:{Time:{sec:63595634826 nsec:0 loc:0x2e70c60}} Reason:ContainersNotReady Message:containers with unready status: [nfs-bb-pod2-pvc]}] Message: Reason: HostIP:127.0.0.1 PodIP: StartTime:0xc2089fde60 ContainerStatuses:[{Name:nfs-bb-pod2-pvc State:{Waiting:0xc2089fde00 Running:<nil> Terminated:<nil>} LastTerminationState:{Waiting:<nil> Running:<nil> Terminated:<nil>} Ready:false RestartCount:0 Image:busybox ImageID: ContainerID:}]} version:1 podName:nfs-bb-pod2-pvc podNamespace:default}
E0407 10:09:17.994381   17134 kubelet.go:1780] Unable to mount volumes for pod "nfs-bb-pod2-pvc_default(0304472c-fcca-11e5-9f13-52540092b5fb)": Mount failed: exit status 32
Mounting arguments: nfs1.rhs:/opt/data12 /var/lib/kubelet/pods/0304472c-fcca-11e5-9f13-52540092b5fb/volumes/kubernetes.io~nfs/pv-nfs nfs []
Output: mount.nfs: Connection timed out

; skipping pod
E0407 10:09:17.994399   17134 pod_workers.go:138] Error syncing pod 0304472c-fcca-11e5-9f13-52540092b5fb, skipping: Mount failed: exit status 32
Mounting arguments: nfs1.rhs:/opt/data12 /var/lib/kubelet/pods/0304472c-fcca-11e5-9f13-52540092b5fb/volumes/kubernetes.io~nfs/pv-nfs nfs []
Output: mount.nfs: Connection timed out

I0407 10:09:17.994429   17134 server.go:608] Event(api.ObjectReference{Kind:"Pod", Namespace:"default", Name:"nfs-bb-pod2-pvc", UID:"0304472c-fcca-11e5-9f13-52540092b5fb", APIVersion:"v1", ResourceVersion:"68", FieldPath:""}): type: 'Warning' reason: 'FailedMount' Unable to mount volumes for pod "nfs-bb-pod2-pvc_default(0304472c-fcca-11e5-9f13-52540092b5fb)": Mount failed: exit status 32
Mounting arguments: nfs1.rhs:/opt/data12 /var/lib/kubelet/pods/0304472c-fcca-11e5-9f13-52540092b5fb/volumes/kubernetes.io~nfs/pv-nfs nfs []
Output: mount.nfs: Connection timed out

I0407 10:09:17.994468   17134 server.go:608] Event(api.ObjectReference{Kind:"Pod", Namespace:"default", Name:"nfs-bb-pod2-pvc", UID:"0304472c-fcca-11e5-9f13-52540092b5fb", APIVersion:"v1", ResourceVersion:"68", FieldPath:""}): type: 'Warning' reason: 'FailedSync' Error syncing pod, skipping: Mount failed: exit status 32
Mounting arguments: nfs1.rhs:/opt/data12 /var/lib/kubelet/pods/0304472c-fcca-11e5-9f13-52540092b5fb/volumes/kubernetes.io~nfs/pv-nfs nfs []
Output: mount.nfs: Connection timed out

I0407 10:09:29.817521   17134 kubelet.go:3242] Generating status for "nfs-bb-pod2-pvc_default(0304472c-fcca-11e5-9f13-52540092b5fb)"
I0407 10:09:29.817652   17134 manager.go:277] Ignoring same status for pod "nfs-bb-pod2-pvc_default(0304472c-fcca-11e5-9f13-52540092b5fb)", status: {Phase:Pending Conditions:[{Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2016-04-07 10:07:06 -0400 EDT Reason:ContainersNotReady Message:containers with unready status: [nfs-bb-pod2-pvc]}] Message: Reason: HostIP:127.0.0.1 PodIP: StartTime:2016-04-07 10:07:06 -0400 EDT ContainerStatuses:[{Name:nfs-bb-pod2-pvc State:{Waiting:0xc208ab6740 Running:<nil> Terminated:<nil>} LastTerminationState:{Waiting:<nil> Running:<nil> Terminated:<nil>} Ready:false RestartCount:0 Image:busybox ImageID: ContainerID:}]}
E0407 10:11:40.906401   17134 kubelet.go:1780] Unable to mount volumes for pod "nfs-bb-pod2-pvc_default(0304472c-fcca-11e5-9f13-52540092b5fb)": Mount failed: exit status 32
Mounting arguments: nfs1.rhs:/opt/data12 /var/lib/kubelet/pods/0304472c-fcca-11e5-9f13-52540092b5fb/volumes/kubernetes.io~nfs/pv-nfs nfs []
Output: mount.nfs: Connection timed out

@pmorie

@pmorie
Copy link
Member

pmorie commented Apr 7, 2016

@screeley44 I would change the title of this PR (since it will become an entry in the release notes file) to be something like 'Return more useful error information when a persistent volume fails to mount'.

@saad-ali Do you have bandwidth to give this a review and then one of us can tag?

@pmorie pmorie assigned pmorie and unassigned dchen1107 Apr 7, 2016
@saad-ali
Copy link
Member

saad-ali commented Apr 7, 2016

@saad-ali Do you have bandwidth to give this a review and then one of us can tag?

Will do.

@screeley44 screeley44 changed the title reword volume mount errUnsupportedVolume error and surface true error in describe event Return more useful error information when a persistent volume fails to mount from volume type plugin generated error Apr 7, 2016
@screeley44 screeley44 changed the title Return more useful error information when a persistent volume fails to mount from volume type plugin generated error Return more useful error information when a persistent volume fails to mount Apr 7, 2016
@saad-ali saad-ali assigned saad-ali and unassigned pmorie Apr 7, 2016
@saad-ali
Copy link
Member

saad-ali commented Apr 7, 2016

LGTM

@saad-ali saad-ali added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 7, 2016
@pmorie
Copy link
Member

pmorie commented Apr 7, 2016

Thanks for the patch @screeley44. I think this is one of the big usability issues w/ PVs today.

@k8s-github-robot
Copy link

@k8s-bot ok to test
@k8s-bot test this

pr builder appears to be missing, activating due to 'lgtm' label.

@k8s-bot
Copy link

k8s-bot commented Apr 7, 2016

GCE e2e build/test passed for commit 36970de.

@k8s-github-robot
Copy link

The author of this PR is not in the whitelist for merge, can one of the admins add the 'ok-to-merge' label?

@markturansky
Copy link
Contributor

Thanks for the patch @screeley44 https://github.com/screeley44. I think
this is one of the big usability issues w/ PVs today.

+1

@k8s-github-robot
Copy link

@k8s-bot test this

Tests are more than 48 hours old. Re-running tests.

@k8s-bot
Copy link

k8s-bot commented Apr 9, 2016

GCE e2e build/test failed for commit 36970de.

Please reference the list of currently known flakes when examining this failure. If you request a re-test, you must reference the issue describing the flake.

@screeley44
Copy link
Contributor Author

@pmorie - am I supposed to do something with the failed Jenkins? I have not found the cause of the failure yet

@screeley44
Copy link
Contributor Author

@k8s-bot test this issue: #24148

1 similar comment
@ghost
Copy link

ghost commented Apr 12, 2016

@k8s-bot test this issue: #24148

@k8s-bot
Copy link

k8s-bot commented Apr 12, 2016

GCE e2e build/test passed for commit 36970de.

@lavalamp lavalamp merged commit 498dc41 into kubernetes:master Apr 13, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet