UXP Expose volume lstat errors in describe event #24624

screeley44 · 2016-04-21T21:07:28Z

If a user passes in a bad endpoint (invalid ip) the error being exposed in the pod describe is vague, by exposing a more true error, the user will not have to go searching the logs for the issue, in the example below it tells the user that the endoints file is not valid:

Current Behavior:

  7s        7s      1   {kubelet 127.0.0.1}         Warning     FailedMount Unable to mount volumes for pod "bb-gluster-pod2_default(a11bc084-fccc-11e5-9f13-52540092b5fb)": glusterfs: mount failed: Mount failed: exit status 1
Mounting arguments: 192.168.122.226:myVol2 /var/lib/kubelet/pods/a11bc084-fccc-11e5-9f13-52540092b5fb/volumes/kubernetes.io~glusterfs/pv-gluster glusterfs [log-file=/var/lib/kubelet/plugins/kubernetes.io/glusterfs/pv-gluster/glusterfs.log]
Output: Mount failed. Please check the log file for more details.

New Behavior:
Now volumes.go when trying to read the disks will pick up on this error first indicating that the volume can not be read because it's invalid (no connection to the source volume) which will give a clue to the user to check that particular volumes endpoints and make sure the IP is correct and you can actually access the server/volume, etc...:

Events:
  FirstSeen LastSeen    Count   From            SubobjectPath   Type        Reason      Message
  --------- --------    -----   ----            -------------   --------    ------      -------
  2m        2m      1   {default-scheduler }            Normal      Scheduled   Successfully assigned bb-multi-pod1 to 127.0.0.1
  5s        5s      1   {kubelet 127.0.0.1}         Warning     FailedMount Failure: Could not read directory: 
  Path: /var/lib/kubelet/pods/db1ebf34-0801-11e6-b54b-52540092b5fb/volumes/kubernetes.io~glusterfs 
  Error: lstat /var/lib/kubelet/pods/db1ebf34-0801-11e6-b54b-52540092b5fb/volumes/kubernetes.io~glusterfs/pv-gluster2: transport endpoint is not connected 

  13s   13s 1   {kubelet 127.0.0.1}     Warning FailedMount Unable to mount volumes for pod "bb-gluster-pod2_default(34b18c6b-070d-11e6-8e95-52540092b5fb)": glusterfs: mount failed: Mount failed: exit status 1
Mounting arguments: 192.168.234.147:myVol2 /var/lib/kubelet/pods/34b18c6b-070d-11e6-8e95-52540092b5fb/volumes/kubernetes.io~glusterfs/pv-gluster glusterfs [log-file=/var/lib/kubelet/plugins/kubernetes.io/glusterfs/pv-gluster2/glusterfs.log]
Output: Mount failed. Please check the log file for more details.

screeley44 · 2016-04-21T21:07:44Z

@kubernetes/sig-storage

k8s-bot · 2016-04-21T21:08:15Z

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in hack/jenkins/job-configs/kubernetes-jenkins-pull/ instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

k8s-bot · 2016-04-21T21:09:10Z

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in hack/jenkins/job-configs/kubernetes-jenkins-pull/ instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

k8s-bot · 2016-04-21T21:09:27Z

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in hack/jenkins/job-configs/kubernetes-jenkins-pull/ instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

screeley44 · 2016-04-22T16:18:25Z

@swagiaal @pmorie @erinboyd @jeffvance - any thoughts on this one?

k8s-github-robot · 2016-04-22T18:37:56Z

@screeley44
You must link to the test flake issue which caused you to request this manual re-test.
Re-test requests should be in the form of: k8s-bot test this issue: #<number>
Here is the list of open test flakes.

ncdc · 2016-04-22T18:55:32Z

pkg/kubelet/volumes.go

-				glog.Errorf("Could not read directory %s: %v", podVolDir, volumeNameDirsStat[i])
+				// failed syscall type error, expose it
+				glog.Errorf("Failure: Could not read directory %s: %v", podVolDir, volumeNameDirsStat[i])
+				return volumes, fmt.Errorf("Failure: Could not read directory: \n  Path: %s \n  Error: %v \n", volumeKindPath, volumeNameDirsStat[i])


remove "Failure"

error messages must start with a lowercase letter

remove the newlines ("error reading directory %q: %v", volumeKindPath, volumeNameDirsStat[i])

pmorie · 2016-04-22T19:09:04Z

@screeley44 so, the motivation for this change is that the mount call doesn't return enough context to provide meaningful feedback?

screeley44 · 2016-04-22T19:22:03Z

@pmorie - yes, so in the test case that I mention above, the syscall will return an error that I think is useful for the user, it will say the "endpoint transport is not connected" - which is a good indication that your endpoints are wrong - you are not able to communicate with the server. So I'm trying to expose as much as possible in the describe events with the long term goal to then take those errors and analyze/massage them with real meaningful messages and possible resolution tips for user experience. But first making sure we expose as much useful error information as possible

pmorie · 2016-04-22T19:25:50Z

@screeley44 and mount has no info about this?

screeley44 · 2016-04-22T19:32:38Z

@pmorie - mount does not expose the underlying sycall failure - that will be in the logs, but it's not carried up through the code and exposed to user in describe event... if you see the examples above (first comment), what eventually gets exposed is with old behavior is a generic error:

Output: Mount failed. Please check the log file for more details.

So I noticed that there was some more useful info in the logs from trying to read the disks and I wanted to expose that as well...so new behavior would give this basically:

lstat /var/lib/kubelet/pods/db1ebf34-0801-11e6-b54b-52540092b5fb/volumes/kubernetes.io~glusterfs/pv-gluster2: transport endpoint is not connected 

Output: Mount failed. Please check the log file for more details.

pmorie · 2016-04-22T19:50:37Z

pkg/kubelet/volumes.go

+			if ok {
+				ref, errGetRef := api.GetReference(pod)
+				if errGetRef == nil && ref != nil {
+					kl.recorder.Event(ref, api.EventTypeWarning, kubecontainer.FailedMountVolume, err.Error())


This isn't the right place to create an event of this type

ghost · 2016-04-25T14:51:13Z

@screeley44 hmm... why is this not failing during the SetUp call of the glusterfs plugin ? Why do we have to wait until the kubelet is inspecting pods' volumes ?

screeley44 · 2016-04-25T15:12:37Z

@swagiaal - yeah, that is what I was trying to figure out as well, and it does show up in glusterfs.go but the error is hidden in a log file for glusterfs, so I spoke with Paul on Friday and we are going to put this one on the back burner for now and he wants me to try another approach where maybe we can pull the relevant error out of the glusterfs.log and return that instead of waiting for the kubelet to inspect

ghost · 2016-04-25T15:22:56Z

maybe we can pull the relevant error out of the glusterfs.log and return that instead of waiting for the kubelet to inspect

+1 to not doing this in kubelet code, but is there nothing we can do to get the mount call to return a good error ? Is there a -v options ? Does the error also not get printed when you do this manually in command line ? Also it looks like the glusterfs call is passing a log-file option. Could that be acting to suppress error output and sending it to the logfile ?

If parsing the logfile is our only option then so be it but I would like to avoid that if possible.

screeley44 · 2016-04-25T15:33:40Z

@swagiaal - yep, agree, I will look into it and status here, thanks for all the input

pmorie · 2016-05-19T15:17:02Z

@screeley44 I think this turned into something else -- can we close this one?

screeley44 · 2016-05-19T15:48:56Z

@pmorie - yes, closing this PR. see #24808

Automatic merge from submit-queue read gluster log to surface glusterfs plugin errors properly in describe events glusterfs.go does not properly expose errors as all mount errors go to a log file, I propose we read the log file to expose the errors without asking the users to 'go look at this log' This PR does the following: 1. adds a gluster option for log-level=ERROR to remove all noise from log file 2. change log file name and path based on PV + Pod name - so specific per PV and Pod 3. create a utility to read the last two lines of the log file when failure occurs old behavior: ``` 13s 13s 1 {kubelet 127.0.0.1} Warning FailedMount Unable to mount volumes for pod "bb-gluster-pod2_default(34b18c6b-070d-11e6-8e95-52540092b5fb)": glusterfs: mount failed: Mount failed: exit status 1 Mounting arguments: 192.168.234.147:myVol2 /var/lib/kubelet/pods/34b18c6b-070d-11e6-8e95-52540092b5fb/volumes/kubernetes.io~glusterfs/pv-gluster glusterfs [log-file=/var/lib/kubelet/plugins/kubernetes.io/glusterfs/pv-gluster/glusterfs.log] Output: Mount failed. Please check the log file for more details. ``` improved behavior: (updated after suggestions from community) ``` 34m 34m 1 {kubelet 127.0.0.1} Warning FailedMount Unable to mount volumes for pod "bb-multi-pod1_default(e7d7f790-0d4b-11e6-a275-52540092b5fb)": glusterfs: mount failed: Mount failed: exit status 1 Mounting arguments: 192.168.123.222:myVol2 /var/lib/kubelet/pods/e7d7f790-0d4b-11e6-a275-52540092b5fb/volumes/kubernetes.io~glusterfs/pv-gluster2 glusterfs [log-level=ERROR log-file=/var/lib/kubelet/plugins/kubernetes.io/glusterfs/pv-gluster2/bb-multi-pod1-glusterfs.log] Output: Mount failed. Please check the log file for more details. the following error information was pulled from the log to help resolve this issue: [2016-04-28 14:21:29.109697] E [socket.c:2332:socket_connect_finish] 0-glusterfs: connection to 192.168.123.222:24007 failed (Connection timed out) [2016-04-28 14:21:29.109767] E [glusterfsd-mgmt.c:1819:mgmt_rpc_notify] 0-glusterfsd-mgmt: failed to connect with remote-host: 192.168.123.222 (Transport endpoint is not connected) ``` also this PR is alternate approach to : #24624

…delete-43 [4.3] Bug 1808548: UPSTREAM: 88146: remove duplicate pv delete commands Origin-commit: 8dbb9dcd4b5dbb945a3f1c5192d4e0b7f637c7ec

googlebot added the cla: yes label Apr 21, 2016

k8s-github-robot assigned dchen1107 Apr 21, 2016

k8s-github-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. release-note-label-needed labels Apr 21, 2016

ncdc reviewed Apr 22, 2016
View reviewed changes

Expose volume lstat errors in describe event

c0801dc

pmorie reviewed Apr 22, 2016
View reviewed changes

screeley44 mentioned this pull request Apr 26, 2016

read gluster log to surface glusterfs plugin errors properly in describe events #24808

Merged

pmorie added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label May 19, 2016

screeley44 closed this May 19, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UXP Expose volume lstat errors in describe event #24624

UXP Expose volume lstat errors in describe event #24624

screeley44 commented Apr 21, 2016

screeley44 commented Apr 21, 2016

k8s-bot commented Apr 21, 2016

k8s-bot commented Apr 21, 2016

k8s-bot commented Apr 21, 2016

screeley44 commented Apr 22, 2016

k8s-github-robot commented Apr 22, 2016

ncdc Apr 22, 2016

pmorie commented Apr 22, 2016

screeley44 commented Apr 22, 2016

pmorie commented Apr 22, 2016

screeley44 commented Apr 22, 2016

pmorie Apr 22, 2016

ghost commented Apr 25, 2016

screeley44 commented Apr 25, 2016

ghost commented Apr 25, 2016

screeley44 commented Apr 25, 2016

pmorie commented May 19, 2016

screeley44 commented May 19, 2016

UXP Expose volume lstat errors in describe event #24624

UXP Expose volume lstat errors in describe event #24624

Conversation

screeley44 commented Apr 21, 2016

screeley44 commented Apr 21, 2016

k8s-bot commented Apr 21, 2016

k8s-bot commented Apr 21, 2016

k8s-bot commented Apr 21, 2016

screeley44 commented Apr 22, 2016

k8s-github-robot commented Apr 22, 2016

ncdc Apr 22, 2016

Choose a reason for hiding this comment

pmorie commented Apr 22, 2016

screeley44 commented Apr 22, 2016

pmorie commented Apr 22, 2016

screeley44 commented Apr 22, 2016

pmorie Apr 22, 2016

Choose a reason for hiding this comment

ghost commented Apr 25, 2016

screeley44 commented Apr 25, 2016

ghost commented Apr 25, 2016

screeley44 commented Apr 25, 2016

pmorie commented May 19, 2016

screeley44 commented May 19, 2016