Return more useful error information when a persistent volume fails to mount #23122

screeley44 · 2016-03-17T14:25:27Z

Fixes #22992
Fixes #23048 - resolves following issues - UXP issues

volume plugin type errors are not being surfaced to pod describe events properly when a failure occurs with PV/PVC volume building - for example plugin builder fails if secrets or endpoints are missing (rbd or glusterfs). The user only sees a generic error that has incorrect meaning "unsupported volume type"

see issues #22992 and #23048 for more details

k8s-bot · 2016-03-17T14:26:15Z

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

k8s-bot · 2016-03-17T14:26:34Z

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

k8s-bot · 2016-03-17T14:27:14Z

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

k8s-github-robot · 2016-03-17T14:29:54Z

Labelling this PR as size/XS

screeley44 · 2016-03-17T16:04:39Z

@smarterclayton @pmorie @pweil- please review

pweil- · 2016-03-17T17:00:54Z

pkg/kubelet/volumes.go

@@ -34,7 +35,7 @@ import (
 	"k8s.io/kubernetes/pkg/volume"
 )

-var errUnsupportedVolumeType = fmt.Errorf("unsupported volume type")
+var errUnsupportedVolumeType = fmt.Errorf("failed volume mount, check previous pod describe events or logs for more details")


@screeley44 @pmorie - I took a look at the usages of this error. It looks like some of the usages (NewWrapperBuilder, NewWrapperCleaner) will actually hide the underlying error from the New* call and some don't (mountExternalVolumes). Should we go ahead and standardize that too?

I'll defer to folks who are in this code more than me.

pmorie · 2016-03-18T04:32:06Z

There are a few different problems to untangle here:

NewWrapperBuilder and NewWrapperCleaner in volumes.go both swallow errors when making their respective result objects; we should instead be passing those back
newVolumeBuilderFromPlugins should return a nil and an error, or a builder and no error, and not locating a volume plugin should be considered an error. Remove the check for a nil builder in mountExternalVolumes
unsupported volume type doesn't mean anything to anyone -- the error message for this case should be something like "Unable to find a volume plugin that can handle volume of type X` -- there's some overlap with the stuff @pweil is doing with volumes and OpenShift SCCs (which is not yet ported upstream) that has some code for getting user-facing labels from Volume specs that could be used to provide a meaningful guess at the volume type, but I think in practice this error message should almost never happen.

I believe if we make these changes both the error message in the pod status and the message in the event will have the accurate root cause.

How does that sound @screeley44 @pweil-?

cc @saad-ali @thockin

pweil- · 2016-03-18T12:43:22Z

NewWrapperBuilder and NewWrapperCleaner in volumes.go both swallow errors when making their respective result objects; we should instead be passing those back

Yes, agree

newVolumeBuilderFromPlugins should return a nil and an error, or a builder and no error, and not locating a volume plugin should be considered an error. Remove the check for a nil builder in mountExternalVolumes

I like this better. I noticed in the review (#23122 (comment)) that we have to go through extra gyrations because methods can return nil, nil which is "acceptable" but then just turned into an error anyway. I'd rather bubble a "plugin not registered" or similar error and change the calling code to be more simple.

saad-ali · 2016-03-18T18:16:42Z

👍 to @pmorie's suggestions

k8s-github-robot · 2016-03-26T20:16:24Z

@screeley44 PR needs rebase

dchen1107 · 2016-03-28T23:08:34Z

+1 on @pmorie's comment: #23122 (comment)

@screeley44 Are you going to continue with pr with @pmorie's suggestion?

screeley44 · 2016-03-29T13:22:02Z

@dchen1107 @pmorie - Yes, I would like to continue, I've been off for spring break and Easter but back now and plan on picking back up on this

k8s-github-robot · 2016-03-30T02:07:22Z

Labelling this PR as size/M

screeley44 · 2016-03-30T02:09:54Z

@pmorie @pweil- please review what I did, I think this will expose all root cause errors to the user in the describe event as I tried to thoroughly test all conditions, but maybe I missed something or could have done something in a better way.

Events:
  FirstSeen LastSeen    Count   From            SubobjectPath   Type        Reason      Message
  --------- --------    -----   ----            -------------   --------    ------      -------
  12s       12s     1   {default-scheduler }            Normal      Scheduled   Successfully assigned bb-gluster-pod2 to 127.0.0.1
  12s       1s      2   {kubelet 127.0.0.1}         Warning     FailedMount Unable to mount volumes for pod "bb-gluster-pod2_default(9fd3f3a2-f61a-11e5-b038-52540092b5fb)": failed to instantiate volume: glustervol for plugin: kubernetes.io/persistent-claim with root cause: failed to instantiate volume: pv-gluster for plugin: kubernetes.io/glusterfs with root cause: endpoints "glusterfs-cluster" not found

I also removed the errUnsupportedVolumeType generic error and added a method, I tried to replicate error scenarios where that might get called, but I really couldn't find any, but maybe doesn't hurt to keep around just in case?

For this part:

func (kl *Kubelet) newVolumeMounterFromPlugins(spec *volume.Spec, pod *api.Pod, opts volume.VolumeOptions) (volume.Mounter, error) {
    plugin, err := kl.volumePluginMgr.FindPluginBySpec(spec)
    if err != nil {
        return nil, fmt.Errorf("can't use volume plugins for %s: %v", spec.Name(), err)
    }
    if plugin == nil {
        // Not found but not an error
        return nil, nil
    }

the if plugin == nil (not sure how this would ever happen, I mean, we would either get an error or we would get the plugin...correct?) so not sure if we should remove, leave as is or:

return nil, createVolumeTypeError(volName)

screeley44 · 2016-03-31T14:24:41Z

@dchen1107 - any thoughts on this so far?

erinboyd · 2016-03-31T21:01:38Z

@pmorie @saad-ali @pweil- @smarterclayton where are we with this? Can we please merge this before ANOTHER rebase?

smarterclayton · 2016-03-31T21:31:13Z

@kubernetes/sig-storage

ghost · 2016-04-04T18:37:28Z

pkg/kubelet/volumes.go

@@ -34,8 +34,6 @@ import (
 	"k8s.io/kubernetes/pkg/volume"
 )

-var errUnsupportedVolumeType = fmt.Errorf("unsupported volume type")
-


Hmm.. do we really not need this or can we move it somewhere where we are sure that the volume is not supported ?

this is removed and replaced with a more generic error createVolumeTypeError - which in theory should never be called from what I can tell, so I was asking if we can remove it all together (errUnsupportedVolumeType was original and I removed in this PR) but replaced with a function to eventually help generate a better error...but as I wondered in the PR I don't know if that error would ever get hit, since if the mounter fails in anyway that error is now surfaced where as before it wasn't...so wonder if we can remove altogether or keep around for some unforeseen scenario?? @swagiaal

Okay I am convinced it is not needed after reading the rest of the patch :)

It should not be needed, and there's no utility to keeping it around 'just in case'. YAGNI

pmorie · 2016-04-07T13:12:44Z

pkg/kubelet/volumes.go

+//  - an attacher if one exists
+//  - an error if no plugin was found for the volume
+//    or the attacher failed to instantiate
+//  - nil if there is no appropriate attacher for this volume


This is kind of the exception to the whole sentence rule -- this is fine as is re: the specification of return values

… to describe event

screeley44 · 2016-04-07T14:57:43Z

The goal of this PR was to surface/expose plugin errors that were previously being eaten (i.e. errors that come from nfs.go, glusterfs.go, etc...). And have them properly show in 'describe pod' so users could more easily see them without having to search the logs. This accomplishes that, and all other errors that traditionally already showed in the 'describe' are not really affected. After this I plan on continuing with additonal PRs and Issues to address other useability concerns in terms of error handling and other types of errors being eaten.

Below are a few example outputs from these changes, a newly surfaced plugin generated error that is now surfaced with true root cause and an existing non plugin type error that has always been surfaced and has not changed:

Missing endpoints from GlusterFS using PV (previously being eaten):

Events:
  FirstSeen LastSeen    Count   From            SubobjectPath   Type        Reason      Message
  --------- --------    -----   ----            -------------   --------    ------      -------
  15s       15s     1   {default-scheduler }            Normal      Scheduled   Successfully assigned bb-gluster-pod2 to 127.0.0.1
  15s       4s      2   {kubelet 127.0.0.1}         Warning     FailedMount Unable to mount volumes for pod "bb-gluster-pod2_default(72a91ba0-fccb-11e5-9f13-52540092b5fb)": failed to instantiate mounter for volume: glustervol using plugin: kubernetes.io/persistent-claim with a root cause: failed to instantiate mounter for volume: pv-gluster using plugin: kubernetes.io/glusterfs with a root cause: endpoints "glusterfs-cluster" not found

kubelet log output:
E0407 10:17:23.715015   17134 glusterfs.go:83] glusterfs: failed to get endpoints glusterfs-cluster[endpoints "glusterfs-cluster" not found]
E0407 10:17:23.715030   17134 persistent_claim.go:85] Error creating mounter for claim: glusterfs-claim
E0407 10:17:23.715037   17134 volumes.go:125] Could not create volume mounter for pod 72a91ba0-fccb-11e5-9f13-52540092b5fb: failed to instantiate mounter for volume: glustervol using plugin: kubernetes.io/persistent-claim with a root cause: failed to instantiate mounter for volume: pv-gluster using plugin: kubernetes.io/glusterfs with a root cause: endpoints "glusterfs-cluster" not found
E0407 10:17:23.715056   17134 kubelet.go:1780] Unable to mount volumes for pod "bb-gluster-pod2_default(72a91ba0-fccb-11e5-9f13-52540092b5fb)": failed to instantiate mounter for volume: glustervol using plugin: kubernetes.io/persistent-claim with a root cause: failed to instantiate mounter for volume: pv-gluster using plugin: kubernetes.io/glusterfs with a root cause: endpoints "glusterfs-cluster" not found; skipping pod
E0407 10:17:23.715062   17134 pod_workers.go:138] Error syncing pod 72a91ba0-fccb-11e5-9f13-52540092b5fb, skipping: failed to instantiate mounter for volume: glustervol using plugin: kubernetes.io/persistent-claim with a root cause: failed to instantiate mounter for volume: pv-gluster using plugin: kubernetes.io/glusterfs with a root cause: endpoints "glusterfs-cluster" not found
I0407 10:17:23.715099   17134 server.go:608] Event(api.ObjectReference{Kind:"Pod", Namespace:"default", Name:"bb-gluster-pod2", UID:"72a91ba0-fccb-11e5-9f13-52540092b5fb", APIVersion:"v1", ResourceVersion:"149", FieldPath:""}): type: 'Warning' reason: 'FailedMount' Unable to mount volumes for pod "bb-gluster-pod2_default(72a91ba0-fccb-11e5-9f13-52540092b5fb)": failed to instantiate mounter for volume: glustervol using plugin: kubernetes.io/persistent-claim with a root cause: failed to instantiate mounter for volume: pv-gluster using plugin: kubernetes.io/glusterfs with a root cause: endpoints "glusterfs-cluster" not found
I0407 10:17:23.715140   17134 server.go:608] Event(api.ObjectReference{Kind:"Pod", Namespace:"default", Name:"bb-gluster-pod2", UID:"72a91ba0-fccb-11e5-9f13-52540092b5fb", APIVersion:"v1", ResourceVersion:"149", FieldPath:""}): type: 'Warning' reason: 'FailedSync' Error syncing pod, skipping: failed to instantiate mounter for volume: glustervol using plugin: kubernetes.io/persistent-claim with a root cause: failed to instantiate mounter for volume: pv-gluster using plugin: kubernetes.io/glusterfs with a root cause: endpoints "glusterfs-cluster" not found

NFS Connection Timeout using PV (timeout error remains the same as this does not stem from plugins directly):

Events:
  FirstSeen LastSeen    Count   From            SubobjectPath   Type        Reason      Message
  --------- --------    -----   ----            -------------   --------    ------      -------
  4m        4m      1   {default-scheduler }            Normal      Scheduled   Successfully assigned nfs-bb-pod2-pvc to 127.0.0.1
  2m        8s      2   {kubelet 127.0.0.1}         Warning     FailedMount Unable to mount volumes for pod "nfs-bb-pod2-pvc_default(0304472c-fcca-11e5-9f13-52540092b5fb)": Mount failed: exit status 32
Mounting arguments: nfs1.rhs:/opt/data12 /var/lib/kubelet/pods/0304472c-fcca-11e5-9f13-52540092b5fb/volumes/kubernetes.io~nfs/pv-nfs nfs []
Output: mount.nfs: Connection timed out


kubelet log output:
I0407 10:07:06.898345   17134 kubelet.go:3242] Generating status for "nfs-bb-pod2-pvc_default(0304472c-fcca-11e5-9f13-52540092b5fb)"
I0407 10:07:06.911292   17134 manager.go:386] Status for pod "nfs-bb-pod2-pvc_default(0304472c-fcca-11e5-9f13-52540092b5fb)" updated successfully: {status:{Phase:Pending Conditions:[{Type:Ready Status:False LastProbeTime:{Time:{sec:0 nsec:0 loc:0x2e4da80}} LastTransitionTime:{Time:{sec:63595634826 nsec:0 loc:0x2e70c60}} Reason:ContainersNotReady Message:containers with unready status: [nfs-bb-pod2-pvc]}] Message: Reason: HostIP:127.0.0.1 PodIP: StartTime:0xc2089fde60 ContainerStatuses:[{Name:nfs-bb-pod2-pvc State:{Waiting:0xc2089fde00 Running:<nil> Terminated:<nil>} LastTerminationState:{Waiting:<nil> Running:<nil> Terminated:<nil>} Ready:false RestartCount:0 Image:busybox ImageID: ContainerID:}]} version:1 podName:nfs-bb-pod2-pvc podNamespace:default}
E0407 10:09:17.994381   17134 kubelet.go:1780] Unable to mount volumes for pod "nfs-bb-pod2-pvc_default(0304472c-fcca-11e5-9f13-52540092b5fb)": Mount failed: exit status 32
Mounting arguments: nfs1.rhs:/opt/data12 /var/lib/kubelet/pods/0304472c-fcca-11e5-9f13-52540092b5fb/volumes/kubernetes.io~nfs/pv-nfs nfs []
Output: mount.nfs: Connection timed out

; skipping pod
E0407 10:09:17.994399   17134 pod_workers.go:138] Error syncing pod 0304472c-fcca-11e5-9f13-52540092b5fb, skipping: Mount failed: exit status 32
Mounting arguments: nfs1.rhs:/opt/data12 /var/lib/kubelet/pods/0304472c-fcca-11e5-9f13-52540092b5fb/volumes/kubernetes.io~nfs/pv-nfs nfs []
Output: mount.nfs: Connection timed out

I0407 10:09:17.994429   17134 server.go:608] Event(api.ObjectReference{Kind:"Pod", Namespace:"default", Name:"nfs-bb-pod2-pvc", UID:"0304472c-fcca-11e5-9f13-52540092b5fb", APIVersion:"v1", ResourceVersion:"68", FieldPath:""}): type: 'Warning' reason: 'FailedMount' Unable to mount volumes for pod "nfs-bb-pod2-pvc_default(0304472c-fcca-11e5-9f13-52540092b5fb)": Mount failed: exit status 32
Mounting arguments: nfs1.rhs:/opt/data12 /var/lib/kubelet/pods/0304472c-fcca-11e5-9f13-52540092b5fb/volumes/kubernetes.io~nfs/pv-nfs nfs []
Output: mount.nfs: Connection timed out

I0407 10:09:17.994468   17134 server.go:608] Event(api.ObjectReference{Kind:"Pod", Namespace:"default", Name:"nfs-bb-pod2-pvc", UID:"0304472c-fcca-11e5-9f13-52540092b5fb", APIVersion:"v1", ResourceVersion:"68", FieldPath:""}): type: 'Warning' reason: 'FailedSync' Error syncing pod, skipping: Mount failed: exit status 32
Mounting arguments: nfs1.rhs:/opt/data12 /var/lib/kubelet/pods/0304472c-fcca-11e5-9f13-52540092b5fb/volumes/kubernetes.io~nfs/pv-nfs nfs []
Output: mount.nfs: Connection timed out

I0407 10:09:29.817521   17134 kubelet.go:3242] Generating status for "nfs-bb-pod2-pvc_default(0304472c-fcca-11e5-9f13-52540092b5fb)"
I0407 10:09:29.817652   17134 manager.go:277] Ignoring same status for pod "nfs-bb-pod2-pvc_default(0304472c-fcca-11e5-9f13-52540092b5fb)", status: {Phase:Pending Conditions:[{Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2016-04-07 10:07:06 -0400 EDT Reason:ContainersNotReady Message:containers with unready status: [nfs-bb-pod2-pvc]}] Message: Reason: HostIP:127.0.0.1 PodIP: StartTime:2016-04-07 10:07:06 -0400 EDT ContainerStatuses:[{Name:nfs-bb-pod2-pvc State:{Waiting:0xc208ab6740 Running:<nil> Terminated:<nil>} LastTerminationState:{Waiting:<nil> Running:<nil> Terminated:<nil>} Ready:false RestartCount:0 Image:busybox ImageID: ContainerID:}]}
E0407 10:11:40.906401   17134 kubelet.go:1780] Unable to mount volumes for pod "nfs-bb-pod2-pvc_default(0304472c-fcca-11e5-9f13-52540092b5fb)": Mount failed: exit status 32
Mounting arguments: nfs1.rhs:/opt/data12 /var/lib/kubelet/pods/0304472c-fcca-11e5-9f13-52540092b5fb/volumes/kubernetes.io~nfs/pv-nfs nfs []
Output: mount.nfs: Connection timed out

@pmorie

pmorie · 2016-04-07T18:15:33Z

@screeley44 I would change the title of this PR (since it will become an entry in the release notes file) to be something like 'Return more useful error information when a persistent volume fails to mount'.

@saad-ali Do you have bandwidth to give this a review and then one of us can tag?

saad-ali · 2016-04-07T18:18:41Z

@saad-ali Do you have bandwidth to give this a review and then one of us can tag?

Will do.

saad-ali · 2016-04-07T22:09:16Z

LGTM

pmorie · 2016-04-07T22:20:31Z

Thanks for the patch @screeley44. I think this is one of the big usability issues w/ PVs today.

k8s-github-robot · 2016-04-07T22:21:37Z

@k8s-bot ok to test
@k8s-bot test this

pr builder appears to be missing, activating due to 'lgtm' label.

k8s-bot · 2016-04-07T23:05:19Z

GCE e2e build/test passed for commit 36970de.

k8s-github-robot · 2016-04-07T23:14:18Z

The author of this PR is not in the whitelist for merge, can one of the admins add the 'ok-to-merge' label?

markturansky · 2016-04-08T00:27:56Z

Thanks for the patch @screeley44 https://github.com/screeley44. I think
this is one of the big usability issues w/ PVs today.

+1

k8s-github-robot · 2016-04-09T23:05:03Z

@k8s-bot test this

Tests are more than 48 hours old. Re-running tests.

k8s-bot · 2016-04-09T23:50:22Z

GCE e2e build/test failed for commit 36970de.

Please reference the list of currently known flakes when examining this failure. If you request a re-test, you must reference the issue describing the flake.

screeley44 · 2016-04-11T16:24:48Z

@pmorie - am I supposed to do something with the failed Jenkins? I have not found the cause of the failure yet

screeley44 · 2016-04-12T20:46:23Z

@k8s-bot test this issue: #24148

ghost · 2016-04-12T20:51:37Z

@k8s-bot test this issue: #24148

k8s-bot · 2016-04-12T21:23:55Z

GCE e2e build/test passed for commit 36970de.

googlebot added the cla: yes label Mar 17, 2016

k8s-github-robot assigned dchen1107 Mar 17, 2016

k8s-github-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Mar 17, 2016

pweil- reviewed Mar 17, 2016
View reviewed changes

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 26, 2016

k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 29, 2016

k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Mar 30, 2016

k8s-github-robot added the release-note-label-needed label Mar 31, 2016

ghost reviewed Apr 4, 2016
View reviewed changes

pmorie reviewed Apr 7, 2016
View reviewed changes

reword volume mount errUnsupportedVolume error and surface true error…

36970de

… to describe event

screeley44 mentioned this pull request Apr 7, 2016

UXP: Confusing or Unclear errors in 'oc describe' when mounting volumes - bubble up useful errors as events to be displayed openshift/origin#7902

Closed

pmorie assigned pmorie and unassigned dchen1107 Apr 7, 2016

screeley44 changed the title ~~reword volume mount errUnsupportedVolume error and surface true error in describe event~~ Return more useful error information when a persistent volume fails to mount from volume type plugin generated error Apr 7, 2016

screeley44 changed the title ~~Return more useful error information when a persistent volume fails to mount from volume type plugin generated error~~ Return more useful error information when a persistent volume fails to mount Apr 7, 2016

saad-ali assigned saad-ali and unassigned pmorie Apr 7, 2016

saad-ali added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 7, 2016

k8s-github-robot added the needs-ok-to-merge label Apr 7, 2016

saad-ali added ok-to-merge and removed needs-ok-to-merge labels Apr 8, 2016

lavalamp merged commit 498dc41 into kubernetes:master Apr 13, 2016

Return more useful error information when a persistent volume fails to mount #23122

Return more useful error information when a persistent volume fails to mount #23122

Conversation

screeley44 commented Mar 17, 2016

k8s-bot commented Mar 17, 2016

k8s-bot commented Mar 17, 2016

k8s-bot commented Mar 17, 2016

k8s-github-robot commented Mar 17, 2016

screeley44 commented Mar 17, 2016

pweil- Mar 17, 2016

Choose a reason for hiding this comment

pmorie commented Mar 18, 2016

pweil- commented Mar 18, 2016

saad-ali commented Mar 18, 2016

k8s-github-robot commented Mar 26, 2016

dchen1107 commented Mar 28, 2016

screeley44 commented Mar 29, 2016

k8s-github-robot commented Mar 30, 2016

screeley44 commented Mar 30, 2016

screeley44 commented Mar 31, 2016

erinboyd commented Mar 31, 2016

smarterclayton commented Mar 31, 2016

ghost Apr 4, 2016

Choose a reason for hiding this comment

screeley44 Apr 4, 2016

Choose a reason for hiding this comment

ghost Apr 4, 2016

Choose a reason for hiding this comment

pmorie Apr 4, 2016

Choose a reason for hiding this comment

pmorie Apr 7, 2016

Choose a reason for hiding this comment

screeley44 commented Apr 7, 2016

pmorie commented Apr 7, 2016

saad-ali commented Apr 7, 2016

saad-ali commented Apr 7, 2016

pmorie commented Apr 7, 2016

k8s-github-robot commented Apr 7, 2016

k8s-bot commented Apr 7, 2016

k8s-github-robot commented Apr 7, 2016

markturansky commented Apr 8, 2016

k8s-github-robot commented Apr 9, 2016

k8s-bot commented Apr 9, 2016

screeley44 commented Apr 11, 2016

screeley44 commented Apr 12, 2016

ghost commented Apr 12, 2016

k8s-bot commented Apr 12, 2016