Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UXP Expose volume lstat errors in describe event #24624

Closed
wants to merge 1 commit into from
Closed

UXP Expose volume lstat errors in describe event #24624

wants to merge 1 commit into from

Conversation

screeley44
Copy link
Contributor

Fixes #23982

If a user passes in a bad endpoint (invalid ip) the error being exposed in the pod describe is vague, by exposing a more true error, the user will not have to go searching the logs for the issue, in the example below it tells the user that the endoints file is not valid:

Current Behavior:

  7s        7s      1   {kubelet 127.0.0.1}         Warning     FailedMount Unable to mount volumes for pod "bb-gluster-pod2_default(a11bc084-fccc-11e5-9f13-52540092b5fb)": glusterfs: mount failed: Mount failed: exit status 1
Mounting arguments: 192.168.122.226:myVol2 /var/lib/kubelet/pods/a11bc084-fccc-11e5-9f13-52540092b5fb/volumes/kubernetes.io~glusterfs/pv-gluster glusterfs [log-file=/var/lib/kubelet/plugins/kubernetes.io/glusterfs/pv-gluster/glusterfs.log]
Output: Mount failed. Please check the log file for more details.

New Behavior:
Now volumes.go when trying to read the disks will pick up on this error first indicating that the volume can not be read because it's invalid (no connection to the source volume) which will give a clue to the user to check that particular volumes endpoints and make sure the IP is correct and you can actually access the server/volume, etc...:

Events:
  FirstSeen LastSeen    Count   From            SubobjectPath   Type        Reason      Message
  --------- --------    -----   ----            -------------   --------    ------      -------
  2m        2m      1   {default-scheduler }            Normal      Scheduled   Successfully assigned bb-multi-pod1 to 127.0.0.1
  5s        5s      1   {kubelet 127.0.0.1}         Warning     FailedMount Failure: Could not read directory: 
  Path: /var/lib/kubelet/pods/db1ebf34-0801-11e6-b54b-52540092b5fb/volumes/kubernetes.io~glusterfs 
  Error: lstat /var/lib/kubelet/pods/db1ebf34-0801-11e6-b54b-52540092b5fb/volumes/kubernetes.io~glusterfs/pv-gluster2: transport endpoint is not connected 

  13s   13s 1   {kubelet 127.0.0.1}     Warning FailedMount Unable to mount volumes for pod "bb-gluster-pod2_default(34b18c6b-070d-11e6-8e95-52540092b5fb)": glusterfs: mount failed: Mount failed: exit status 1
Mounting arguments: 192.168.234.147:myVol2 /var/lib/kubelet/pods/34b18c6b-070d-11e6-8e95-52540092b5fb/volumes/kubernetes.io~glusterfs/pv-gluster glusterfs [log-file=/var/lib/kubelet/plugins/kubernetes.io/glusterfs/pv-gluster2/glusterfs.log]
Output: Mount failed. Please check the log file for more details.

@screeley44
Copy link
Contributor Author

@kubernetes/sig-storage

@k8s-bot
Copy link

k8s-bot commented Apr 21, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in hack/jenkins/job-configs/kubernetes-jenkins-pull/ instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@k8s-github-robot k8s-github-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. release-note-label-needed labels Apr 21, 2016
@k8s-bot
Copy link

k8s-bot commented Apr 21, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in hack/jenkins/job-configs/kubernetes-jenkins-pull/ instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

1 similar comment
@k8s-bot
Copy link

k8s-bot commented Apr 21, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in hack/jenkins/job-configs/kubernetes-jenkins-pull/ instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@screeley44
Copy link
Contributor Author

@swagiaal @pmorie @erinboyd @jeffvance - any thoughts on this one?

@k8s-github-robot
Copy link

@screeley44
You must link to the test flake issue which caused you to request this manual re-test.
Re-test requests should be in the form of: k8s-bot test this issue: #<number>
Here is the list of open test flakes.

glog.Errorf("Could not read directory %s: %v", podVolDir, volumeNameDirsStat[i])
// failed syscall type error, expose it
glog.Errorf("Failure: Could not read directory %s: %v", podVolDir, volumeNameDirsStat[i])
return volumes, fmt.Errorf("Failure: Could not read directory: \n Path: %s \n Error: %v \n", volumeKindPath, volumeNameDirsStat[i])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove "Failure"

error messages must start with a lowercase letter

remove the newlines ("error reading directory %q: %v", volumeKindPath, volumeNameDirsStat[i])

@pmorie
Copy link
Member

pmorie commented Apr 22, 2016

@screeley44 so, the motivation for this change is that the mount call doesn't return enough context to provide meaningful feedback?

@screeley44
Copy link
Contributor Author

@pmorie - yes, so in the test case that I mention above, the syscall will return an error that I think is useful for the user, it will say the "endpoint transport is not connected" - which is a good indication that your endpoints are wrong - you are not able to communicate with the server. So I'm trying to expose as much as possible in the describe events with the long term goal to then take those errors and analyze/massage them with real meaningful messages and possible resolution tips for user experience. But first making sure we expose as much useful error information as possible

@pmorie
Copy link
Member

pmorie commented Apr 22, 2016

@screeley44 and mount has no info about this?

@screeley44
Copy link
Contributor Author

@pmorie - mount does not expose the underlying sycall failure - that will be in the logs, but it's not carried up through the code and exposed to user in describe event... if you see the examples above (first comment), what eventually gets exposed is with old behavior is a generic error:

Output: Mount failed. Please check the log file for more details.

So I noticed that there was some more useful info in the logs from trying to read the disks and I wanted to expose that as well...so new behavior would give this basically:

lstat /var/lib/kubelet/pods/db1ebf34-0801-11e6-b54b-52540092b5fb/volumes/kubernetes.io~glusterfs/pv-gluster2: transport endpoint is not connected 

Output: Mount failed. Please check the log file for more details.

if ok {
ref, errGetRef := api.GetReference(pod)
if errGetRef == nil && ref != nil {
kl.recorder.Event(ref, api.EventTypeWarning, kubecontainer.FailedMountVolume, err.Error())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't the right place to create an event of this type

@ghost
Copy link

ghost commented Apr 25, 2016

@screeley44 hmm... why is this not failing during the SetUp call of the glusterfs plugin ? Why do we have to wait until the kubelet is inspecting pods' volumes ?

@screeley44
Copy link
Contributor Author

@swagiaal - yeah, that is what I was trying to figure out as well, and it does show up in glusterfs.go but the error is hidden in a log file for glusterfs, so I spoke with Paul on Friday and we are going to put this one on the back burner for now and he wants me to try another approach where maybe we can pull the relevant error out of the glusterfs.log and return that instead of waiting for the kubelet to inspect

@ghost
Copy link

ghost commented Apr 25, 2016

maybe we can pull the relevant error out of the glusterfs.log and return that instead of waiting for the kubelet to inspect

+1 to not doing this in kubelet code, but is there nothing we can do to get the mount call to return a good error ? Is there a -v options ? Does the error also not get printed when you do this manually in command line ? Also it looks like the glusterfs call is passing a log-file option. Could that be acting to suppress error output and sending it to the logfile ?

If parsing the logfile is our only option then so be it but I would like to avoid that if possible.

@screeley44
Copy link
Contributor Author

@swagiaal - yep, agree, I will look into it and status here, thanks for all the input

@pmorie pmorie added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label May 19, 2016
@pmorie
Copy link
Member

pmorie commented May 19, 2016

@screeley44 I think this turned into something else -- can we close this one?

@screeley44
Copy link
Contributor Author

@pmorie - yes, closing this PR. see #24808

@screeley44 screeley44 closed this May 19, 2016
k8s-github-robot pushed a commit that referenced this pull request Jun 2, 2016
Automatic merge from submit-queue

read gluster log to surface glusterfs plugin errors properly in describe events

glusterfs.go does not properly expose errors as all mount errors go to a log file, I propose we read the log file to expose the errors without asking the users to 'go look at this log'

This PR does the following:
1.  adds a gluster option for log-level=ERROR to remove all noise from log file
2.  change log file name and path based on PV + Pod name - so specific per PV and Pod
3.  create a utility to read the last two lines of the log file when failure occurs

old behavior:
```
  13s	13s	1	{kubelet 127.0.0.1}		Warning	FailedMount	Unable to mount volumes for pod "bb-gluster-pod2_default(34b18c6b-070d-11e6-8e95-52540092b5fb)": glusterfs: mount failed: Mount failed: exit status 1
Mounting arguments: 192.168.234.147:myVol2 /var/lib/kubelet/pods/34b18c6b-070d-11e6-8e95-52540092b5fb/volumes/kubernetes.io~glusterfs/pv-gluster glusterfs [log-file=/var/lib/kubelet/plugins/kubernetes.io/glusterfs/pv-gluster/glusterfs.log]
Output: Mount failed. Please check the log file for more details.
```

improved behavior: (updated after suggestions from community)
```
  34m		34m		1	{kubelet 127.0.0.1}			Warning		FailedMount	Unable to mount volumes for pod "bb-multi-pod1_default(e7d7f790-0d4b-11e6-a275-52540092b5fb)": glusterfs: mount failed: Mount failed: exit status 1
Mounting arguments: 192.168.123.222:myVol2 /var/lib/kubelet/pods/e7d7f790-0d4b-11e6-a275-52540092b5fb/volumes/kubernetes.io~glusterfs/pv-gluster2 glusterfs [log-level=ERROR log-file=/var/lib/kubelet/plugins/kubernetes.io/glusterfs/pv-gluster2/bb-multi-pod1-glusterfs.log]
Output: Mount failed. Please check the log file for more details.

 the following error information was pulled from the log to help resolve this issue: 
[2016-04-28 14:21:29.109697] E [socket.c:2332:socket_connect_finish] 0-glusterfs: connection to 192.168.123.222:24007 failed (Connection timed out)
[2016-04-28 14:21:29.109767] E [glusterfsd-mgmt.c:1819:mgmt_rpc_notify] 0-glusterfsd-mgmt: failed to connect with remote-host: 192.168.123.222 (Transport endpoint is not connected)

```

also this PR is alternate approach to :  #24624
openshift-publish-robot pushed a commit to openshift/kubernetes that referenced this pull request Mar 21, 2020
…delete-43

[4.3] Bug 1808548: UPSTREAM: 88146: remove duplicate pv delete commands

Origin-commit: 8dbb9dcd4b5dbb945a3f1c5192d4e0b7f637c7ec
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants