Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add resolution tips for volume mount errors via describe events #26366

Closed
wants to merge 1 commit into from
Closed

[WIP] Add resolution tips for volume mount errors via describe events #26366

wants to merge 1 commit into from

Conversation

screeley44
Copy link
Contributor

@screeley44 screeley44 commented May 26, 2016

This is based on UXP and the idea of offering users some tips and hints on how to resolve common mounting errors. I've tossed around several implementation ideas for this but think this is the best approach since it centralizes the logic within the volumes.mountExternalVolumes.

Also there is some additional discussion for this in issue #23982

This also depends (at least for Glusterfs) that PR #24808 merges.

Anyway wanted to start some discussion on this. Below are some examples of the output a user would see based on this PR

Events:
  FirstSeen LastSeen    Count   From            SubobjectPath   Type        Reason      Message
  --------- --------    -----   ----            -------------   --------    ------      -------
  12s       12s     1   {default-scheduler }            Normal      Scheduled   Successfully assigned bb-gluster-pod1 to 127.0.0.1
  12s       2s      2   {kubelet 127.0.0.1}         Warning     FailedMount Unable to mount volumes for pod "bb-gluster-pod1_default(86253c16-2282-11e6-a845-525400495970)": failed to instantiate mounter for volume: glustervol using plugin: kubernetes.io/glusterfs with a root cause: endpoints "glusterfs-cluster" not found
Resolution hint: (glustervol) Make sure the above endpoint exists. To persist endpoints, they should be created as a service.
 14s    14s 1   {kubelet 127.0.0.1}     Warning FailedMount Unable to mount volumes for pod "bb-gluster-pod1_default(86253c16-2282-11e6-a845-525400495970)": glusterfs: mount failed: Mount failed: exit status 32
Mounting arguments: 192.168.122.222:myVol2 /var/lib/kubelet/pods/86253c16-2282-11e6-a845-525400495970/volumes/kubernetes.io~glusterfs/glustervol glusterfs [log-file=/var/lib/kubelet/plugins/kubernetes.io/glusterfs/glustervol/glusterfs.log]
Output: mount: unknown filesystem type 'glusterfs'
Resolution hint: (glustervol) Check and make sure the glusterfs-client package is installed (rpm -qa 'gluster*') on your nodes.
If not, install the client package on your nodes (i.e. yum install glusterfs-client -y).
  6s        5s      2   {kubelet 127.0.0.1}                 Warning     FailedMount Unable to mount volumes for pod "nfs-bb-pod1_default(de44ed68-21ea-11e6-be31-525400495970)": lstat /var/lib/kubelet/pods/de44ed68-21ea-11e6-be31-525400495970/volumes/kubernetes.io~nfs/nfsvol/..: permission denied
Resolution hint: (nfsvol) The pod is running, and the mount succeeded, however the mount is not accessbile due to permissions.  
Check the POSIX based permissions (owner, groups and others) on your mounted directory.  
If needed containers and pods can utilize and pass in a securityContext specifying runAsUser (uid/owner), or additional linux groups such as fsGroup (for block) or SupplementalGroups (for shared).
Work with the storage adminstrator to properly set up access
Events:
  FirstSeen LastSeen    Count   From                            SubobjectPath   Type        Reason      Message
  --------- --------    -----   ----                            -------------   --------    ------      -------
  1m        1m      1   {default-scheduler }                            Normal      Scheduled   Successfully assigned aws-ebs-bb-pod2 to ip-172-30-0-215.us-west-2.compute.internal
  19s       19s     1   {kubelet ip-172-30-0-215.us-west-2.compute.internal}            Warning     FailedMount Unable to mount volumes for pod "aws-ebs-bb-pod2_default(fb68166a-1ea6-11e6-88cc-06155fc6b4db)": Could not attach EBS Disk "vol-5634f7f2": Error attaching EBS volume: InvalidVolume.NotFound: The volume 'vol-5634f7f2' does not exist.
        status code: 400, request id: 
Resolution hint: (ebsvol) Check AWS available volumes for the appropriate availability zone, and make sure the specified volumeID exists and is spelled correctly.
Events:
  FirstSeen LastSeen    Count   From                            SubobjectPath   Type        Reason      Message
  --------- --------    -----   ----                            -------------   --------    ------      -------
  1m        1m      1   {default-scheduler }                            Normal      Scheduled   Successfully assigned aws-ebs-bb-pod1 to ip-172-30-0-113.us-west-2.compute.internal
  52s       52s     1   {kubelet ip-172-30-0-113.us-west-2.compute.internal}            Warning     FailedMount Unable to mount volumes for pod "aws-ebs-bb-pod1_default(1b7c79a2-1ea7-11e6-802a-06cfea9d6949)": Could not attach EBS Disk "vol-b877020a": Error attaching EBS volume: VolumeInUse: vol-b877020a is already attached to an instance
        status code: 400, request id: 
Resolution hint: (ebsvol) The AWS volume is already attached to another instance and only one node per volume is allowed for EBS block devices (can not share across nodes). Another volume will need to be provisioned for use with this pod

if no match is found, then the normal error is returned without any additional hints...

Alternative Approaches could be:

  1. at each error point in the plugin add the resolution hint (nfs.go, aws_ebs.go, etc...). I didn't do this approach because it would result in more code and more files being touched as opposed to catching the error in the centralized mountExternalVolume
  2. rather than keep the logic in code, could externalize into a file/resource that could be added to, edited/customized by admins, seemed like overkill at this point, but might be a good future direction.

@pmorie @erinboyd

@k8s-bot
Copy link

k8s-bot commented May 26, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

2 similar comments
@k8s-bot
Copy link

k8s-bot commented May 26, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@k8s-bot
Copy link

k8s-bot commented May 26, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@pmorie
Copy link
Member

pmorie commented May 26, 2016

@k8s-bot ok to test

@pmorie
Copy link
Member

pmorie commented May 26, 2016

cc @kubernetes/sig-storage

@k8s-github-robot k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. release-note-label-needed labels May 26, 2016
@pmorie pmorie added the release-note-none Denotes a PR that doesn't merit a release note. label May 26, 2016
// AddMountErrorHint performs some basic analysis
// on the current mount error returned and will
// add a user hint or resolution tip for enhanced UXP
func (kl *Kubelet) AddMountErrorHint(volpath string, volname string, inerr error) error{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how I feel about this logic being centralized like this. It feels very much like a cross-cut, reading this logic for different volumes all in the same place. I don't think it's so bad to keep the logic at the call site it's relevant to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record I can totally understand the desire to keep this orthogonal from the volume plugins, but I think it's a fine start (and likely to work better in the Kubelet we have today) to keep the logic about possible error causes at the sites where they occur. I think that is the simplest way to start, and maybe a better pattern will become evident.

@pmorie pmorie added area/volumes sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels May 27, 2016
@pmorie pmorie assigned pmorie and unassigned dchen1107 May 27, 2016
@pmorie
Copy link
Member

pmorie commented May 27, 2016

I like the concept here but I think this can just be additional information at the sites where these errors occur to start with. See #26366 (comment)

@screeley44
Copy link
Contributor Author

based on comments above going to create a 2nd PR with implementation of logic in each plugin

@k8s-github-robot
Copy link

@screeley44 PR needs rebase

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 3, 2016
@k8s-bot
Copy link

k8s-bot commented Jun 14, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

2 similar comments
@k8s-bot
Copy link

k8s-bot commented Jun 19, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@k8s-bot
Copy link

k8s-bot commented Jun 23, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@eparis
Copy link
Contributor

eparis commented Jun 24, 2016

ok to test

@k8s-bot
Copy link

k8s-bot commented Jun 24, 2016

GCE e2e build/test passed for commit ff03d3f.

@k8s-github-robot
Copy link

This PR hasn't been active in 30 days. It will be closed in 59 days (Sep 22, 2016).

cc @screeley44 @pmorie

You can add 'keep-open' label to prevent this from happening, or add a comment to keep it open another 90 days

@screeley44 screeley44 closed this Jul 25, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note-none Denotes a PR that doesn't merit a release note. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants