[BACKPORT][v1.5.5][IMPROVEMENT] Improve logging in CSI plugin when mount fails. #8286

james-munson · 2024-03-31T22:12:38Z

backport #7931

Specifically, the improvement to CSI host namespace and CSI logging of environment when mount fails.

james-munson · 2024-03-31T22:14:45Z

Created this backport issue manually, since part of the previous work on the environment script was already backported.

longhorn-io-github-bot · 2024-03-31T22:17:31Z

Pre Ready-For-Testing Checklist

Where is the reproduce steps/test steps documented?
The reproduce steps/test steps are at: See base issue [IMPROVEMENT] Improve environment_check script for NFS protocol bug and the host system self diagnosis #7931
Is there a workaround for the issue? If so, where is it documented?
The workaround is at:
Does the PR include the explanation for the fix or the feature?

chriscchien · 2024-04-02T02:46:13Z

Hi @james-munson ,

I can reproduce the RWX pod mount fail on Ubuntu 5.15.0-94-generic, the RWX pod failed to mount, but the messeage Failed to mount volume %v on node %s with kernel release you added in PR can not be found. Could you take a look on this? thank you.

> uname -r
5.15.0-94-generic
>
> k get volume -A
NAMESPACE         NAME                                       STATE      ROBUSTNESS   SCHEDULED   SIZE         NODE   AGE
longhorn-system   pvc-d5ca45f2-7d92-429b-b7bd-14b084c2d48b   attached   healthy                  1073741824   cha    117s
>
> k -n longhorn-system get volume pvc-d5ca45f2-7d92-429b-b7bd-14b084c2d48b -o yaml | grep accessMode
  accessMode: rwx
> 
> k get pods 
NAME                                READY   STATUS              RESTARTS   AGE
longhorn-nfs-installation-t7pfk     1/1     Running             0          4m47s
longhorn-iscsi-installation-m7jb4   1/1     Running             0          4m47s
test-deployment-754dd9fc66-68qsp    0/1     ContainerCreating   0          2m20s
> 
> k describe pod test-deployment-754dd9fc66-68qsp | grep Event -A 20
Events:
  Type     Reason                  Age                 From                     Message
  ----     ------                  ----                ----                     -------
  Warning  FailedScheduling        2m37s               default-scheduler        0/1 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
  Warning  FailedScheduling        2m35s               default-scheduler        0/1 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
  Normal   Scheduled               2m33s               default-scheduler        Successfully assigned default/test-deployment-754dd9fc66-68qsp to cha
  Normal   SuccessfulAttachVolume  2m17s               attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-d5ca45f2-7d92-429b-b7bd-14b084c2d48b"
  Warning  FailedMount             8s (x9 over 2m17s)  kubelet                  MountVolume.MountDevice failed for volume "pvc-d5ca45f2-7d92-429b-b7bd-14b084c2d48b" : rpc error: code = Internal desc = mount failed: exit status 32
Mounting command: /usr/local/sbin/nsmounter
Mounting arguments: mount -t nfs -o vers=4.1,noresvport,timeo=600,retrans=5,softerr 10.43.132.6:/pvc-d5ca45f2-7d92-429b-b7bd-14b084c2d48b /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/6381505fa73cc1887c25ca8b3079326a4843af3ac8af2f500e818253fefc3afb/globalmount
Output: mount.nfs: Protocol not supported

supportbundle_5e584417-1cc9-44ad-8d2f-6c8c3661f36f_2024-04-02T02-19-38Z.zip

In addition. use kernel 5.15.0-101-generic can make RWX pod mount success.

james-munson · 2024-04-02T17:27:08Z

I think I see the problem. Testing the fixup.

james-munson · 2024-04-03T20:28:25Z

This is strange. I'm having a hard time inducing the failure. I installed my test build of longhorn-manager for both daemonsets (longhorn-manager and longhorn-csi-plugin), changed the kernel on one of my Ubuntu 22.04 worker nodes to 5.15.0-94-generic, and made a new RWX deployment. The pods all mounted successfully, including the one with the problematic kernel.

The pod events:

k describe pod rwx-test-79dc5564d4-x6gzj | grep Event -A 20
Events:
  Type     Reason                  Age   From                     Message
  ----     ------                  ----  ----                     -------
  Warning  FailedScheduling        22m   default-scheduler        0/4 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling..
  Warning  FailedScheduling        22m   default-scheduler        0/4 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling..
  Normal   Scheduled               22m   default-scheduler        Successfully assigned default/rwx-test-79dc5564d4-x6gzj to jbm-u22-pool2-ca5a5aa4-5fcq7
  Normal   SuccessfulAttachVolume  22m   attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-8787aa8f-8496-47b6-8bee-d49a27e68356"
  Normal   Pulled                  22m   kubelet                  Container image "ubuntu:xenial" already present on machine
  Normal   Created                 22m   kubelet                  Created container rwx-test
  Normal   Started                 22m   kubelet                  Started container rwx-test
  Normal   Pulled                  22m   kubelet                  Container image "nginx:stable" already present on machine
  Normal   Created                 22m   kubelet                  Created container nginx
  Normal   Started                 22m   kubelet                  Started container nginx

On the node itself,

 uname -a
Linux jbm-u22-pool2-ca5a5aa4-5fcq7 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
root@jbm-u22-pool2-ca5a5aa4-5fcq7:~# mount -l | grep nfs
10.43.80.98:/pvc-8787aa8f-8496-47b6-8bee-d49a27e68356 on /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/8d63416da5e7a35390a0bb4456ec55a10c6dbb3ce0a42b3eab6c9fa221481328/globalmount type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,softerr,softreval,noresvport,proto=tcp,timeo=600,retrans=5,sec=sys,clientaddr=146.190.134.16,local_lock=none,addr=10.43.80.98)
10.43.80.98:/pvc-8787aa8f-8496-47b6-8bee-d49a27e68356 on /var/lib/kubelet/pods/c0ad2c30-c146-4bce-ba95-f9af4176547f/volumes/kubernetes.io~csi/pvc-8787aa8f-8496-47b6-8bee-d49a27e68356/mount type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,softerr,softreval,noresvport,proto=tcp,timeo=600,retrans=5,sec=sys,clientaddr=146.190.134.16,local_lock=none,addr=10.43.80.98)

And in the pod, everything is happy

uname -a
Linux rwx-test-79dc5564d4-x6gzj 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
root@rwx-test-79dc5564d4-x6gzj:/# mount -l | grep nfs
10.43.80.98:/pvc-8787aa8f-8496-47b6-8bee-d49a27e68356 on /data type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,softerr,softreval,noresvport,proto=tcp,timeo=600,retrans=5,sec=sys,clientaddr=146.190.134.16,local_lock=none,addr=10.43.80.98)

I'm not sure what's going on.

james-munson · 2024-04-04T15:49:08Z

Repeated the test with 5.15.0-97-generic which is also in the broken range of Ubuntu kernels, and got the same result. Mount did not fail.

PhanLe1010 · 2024-04-04T19:01:08Z

@james-munson Is it reproducible if you use Longhorn 1.6.0 instead?

james-munson · 2024-04-05T17:11:34Z

Actually, that was with 1.6.0 for everything but my custom longhorn-manager and csi-plugin. Perhaps I'll give it a try with a 1.5.x release. Or compare test procedure with @chriscchien.

james-munson · 2024-04-06T18:52:46Z

So, since I can't repro the kernel-based Protocol not supported error, I tested using a different mount failure, but still exercising the code path that was missed. The logging from CSI-plugin is as hoped:

[longhorn-csi-plugin-k46f6 longhorn-csi-plugin] E0406 18:18:12.533095    5056 mount_linux.go:230] Mount failed: exit status 32
[longhorn-csi-plugin-k46f6 longhorn-csi-plugin] time="2024-04-06T18:18:12Z" level=warning msg="Failed to mount volume pvc-fdf504c7-1616-4309-a5b1-25365d74d02e on node jbm-u22-pool2-ca5a5aa4-5fcq7 with kernel release 5.15.0-94-generic, os distro ubuntu" func="csi.(*NodeServer).nodeStageSharedVolume" file="node_server.go:295" component=csi-node-server error="mount failed: exit status 32\nMounting command: /usr/local/sbin/nsmounter\nMounting arguments: mount -t nfs -o rw,vers=4.1,bunny,softerr,timeo=600,retrans=5,noresvport 10.43.163.213:/pvc-fdf504c7-1616-4309-a5b1-25365d74d02e /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/e61d4e8525a632f675b44d7b8ba3704bd179e4e6ae7dbbe5ba811af540ebfa86/globalmount\nOutput: mount.nfs: an incorrect mount option was specified\n" function=nodeStageSharedVolume

In particular,

msg="Failed to mount volume pvc-fdf504c7-1616-4309-a5b1-25365d74d02e on node jbm-u22-pool2-ca5a5aa4-5fcq7 with kernel release 5.15.0-94-generic, os distro ubuntu"

which shows what we wanted to capture in the logs.

james-munson · 2024-04-08T22:44:57Z

With longhorn/longhorn-manager#2724 committed, this should be testable again.

chriscchien · 2024-04-09T01:50:14Z

Verified pass on longhorn v1.5.x (longhorn-manager b160f0)

Deploy Longhorn v1.5.x on Ubuntu 5.15.0-94-generic which RWX volume can not mount success(ref), create and attach a RWX volume, in the longhorn-csi-plugin.log we can find related kernel information as below:

2024-04-09T01:32:45.954023014Z time="2024-04-09T01:32:45Z" level=warning msg="Failed to mount volume pvc-60fce72a-acab-4214-9144-8305cf99e932 on node ccc with kernel release 5.15.0-94-generic, os distro ubuntu" func="csi.(*NodeServer).nodeStageSharedVolume" file="node_server.go:284" component=csi-node-server error="mount failed: exit status 32\nMounting command: /usr/local/sbin/nsmounter\nMounting arguments: mount -t nfs -o vers=4.1,noresvport,timeo=600,retrans=5,softerr 10.43.183.167:/pvc-60fce72a-acab-4214-9144-8305cf99e932 /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/67d50915e9bd94523ce9dfb7240f1b491572361b8cad3537c5274f6184eb3cc0/globalmount\nOutput: mount.nfs: Protocol not supported\n"

james-munson added kind/bug require/qa-review-coverage Require QA to review coverage require/backport Require backport. Only used when the specific versions to backport have not been definied. labels Mar 31, 2024

james-munson mentioned this issue Mar 31, 2024

fix(csi): add node info to failed mount logging. (backport #2686) longhorn/longhorn-manager#2714

Merged

james-munson self-assigned this Mar 31, 2024

james-munson added this to the v1.5.5 milestone Mar 31, 2024

chriscchien self-assigned this Apr 1, 2024

james-munson mentioned this issue Apr 3, 2024

fix(csi): correct error found in test. longhorn/longhorn-manager#2717

Merged

This was referenced Apr 8, 2024

fix(csi): correct error found in test. (backport #2717) longhorn/longhorn-manager#2724

Merged

fix(csi): correct error found in test. (backport #2717) longhorn/longhorn-manager#2725

Merged

chriscchien closed this as completed Apr 9, 2024

derekbit added this to Longhorn Sprint Aug 3, 2024

derekbit moved this to Closed in Longhorn Sprint Aug 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BACKPORT][v1.5.5][IMPROVEMENT] Improve logging in CSI plugin when mount fails. #8286

[BACKPORT][v1.5.5][IMPROVEMENT] Improve logging in CSI plugin when mount fails. #8286

james-munson commented Mar 31, 2024 •

edited

Loading

james-munson commented Mar 31, 2024

longhorn-io-github-bot commented Mar 31, 2024 •

edited by james-munson

Loading

chriscchien commented Apr 2, 2024

james-munson commented Apr 2, 2024 •

edited

Loading

james-munson commented Apr 3, 2024 •

edited

Loading

james-munson commented Apr 4, 2024

PhanLe1010 commented Apr 4, 2024

james-munson commented Apr 5, 2024

james-munson commented Apr 6, 2024 •

edited

Loading

james-munson commented Apr 8, 2024

chriscchien commented Apr 9, 2024

[BACKPORT][v1.5.5][IMPROVEMENT] Improve logging in CSI plugin when mount fails. #8286

[BACKPORT][v1.5.5][IMPROVEMENT] Improve logging in CSI plugin when mount fails. #8286

Comments

james-munson commented Mar 31, 2024 • edited Loading

james-munson commented Mar 31, 2024

longhorn-io-github-bot commented Mar 31, 2024 • edited by james-munson Loading

Pre Ready-For-Testing Checklist

chriscchien commented Apr 2, 2024

james-munson commented Apr 2, 2024 • edited Loading

james-munson commented Apr 3, 2024 • edited Loading

james-munson commented Apr 4, 2024

PhanLe1010 commented Apr 4, 2024

james-munson commented Apr 5, 2024

james-munson commented Apr 6, 2024 • edited Loading

james-munson commented Apr 8, 2024

chriscchien commented Apr 9, 2024

james-munson commented Mar 31, 2024 •

edited

Loading

longhorn-io-github-bot commented Mar 31, 2024 •

edited by james-munson

Loading

james-munson commented Apr 2, 2024 •

edited

Loading

james-munson commented Apr 3, 2024 •

edited

Loading

james-munson commented Apr 6, 2024 •

edited

Loading