[BUG] v2 volume workload gets stuck in ContainerCreating
or Unknown
state with FailedMount
error #10111
Description
Describe the bug
While running negative test case Reboot Node One By One While Workload Heavy Writing
with enabled RWX Volume Fast Failover
on Longhorn v1.8.0-rc2
, after 2 ~ 3 rounds of node reboots, a v2 volume workload could get stuck in ContainerCreating
or Unknown
state with FailedMount
related errors.
case 1:
https://ci.longhorn.io/job/public/job/v1.8.x/job/v1.8.x-longhorn-e2e-tests-sles-amd64/22
# kubectl get pods
NAME READY STATUS RESTARTS AGE
e2e-test-deployment-0-6456f6c484-k4q5x 1/1 Running 0 9h
e2e-test-deployment-1-7df9978646-pz4hx 1/1 Running 2 (9h ago) 10h
e2e-test-deployment-2-748bd66996-xgcrx 1/1 Running 0 9h
e2e-test-statefulset-0-0 0/1 ContainerCreating 0 9h
# kubectl describe pod e2e-test-statefulset-0-0
Name: e2e-test-statefulset-0-0
Namespace: default
Priority: 0
Service Account: default
Node: ip-10-0-2-92/10.0.2.92
Start Time: Tue, 31 Dec 2024 19:58:38 +0000
Labels: app=e2e-test-statefulset-0
apps.kubernetes.io/pod-index=0
controller-revision-hash=e2e-test-statefulset-0-6854c66d5c
statefulset.kubernetes.io/pod-name=e2e-test-statefulset-0-0
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: StatefulSet/e2e-test-statefulset-0
Containers:
sleep:
Container ID:
Image: busybox:1.34.0
Image ID:
Port: <none>
Host Port: <none>
Args:
/bin/sh
-c
while true;do date;sleep 5; done
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/data from pod-data (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-th2fw (ro)
Conditions:
Type Status
PodReadyToStartContainers False
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
pod-data:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: pod-data-e2e-test-statefulset-0-0
ReadOnly: false
kube-api-access-th2fw:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 70s (x277 over 9h) kubelet MountVolume.MountDevice failed for volume "pvc-248b2618-cab6-4ab3-b4f3-202cfcd54c16" : rpc error: code = Internal desc = 'fsck' found errors on device /dev/longhorn/pvc-248b2618-cab6-4ab3-b4f3-202cfcd54c16 but could not correct them: fsck from util-linux 2.39.3
/dev/longhorn/pvc-248b2618-cab6-4ab3-b4f3-202cfcd54c16 contains a file system with errors, check forced.
/dev/longhorn/pvc-248b2618-cab6-4ab3-b4f3-202cfcd54c16: Resize inode not valid.
/dev/longhorn/pvc-248b2618-cab6-4ab3-b4f3-202cfcd54c16: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
(i.e., without -a or -p options)
case 2:
https://ci.longhorn.io/job/public/job/v1.8.x/job/v1.8.x-longhorn-e2e-tests-sles-amd64/20/
# kubectl get pods
NAME READY STATUS RESTARTS AGE
e2e-test-deployment-0-6456f6c484-rsd27 0/1 ContainerCreating 0 21h
e2e-test-deployment-1-7df9978646-fd9pb 0/1 Unknown 0 22h
e2e-test-deployment-2-748bd66996-9gnkp 1/1 Running 0 22h
# kubectl describe pod e2e-test-deployment-1-7df9978646-fd9pb
Name: e2e-test-deployment-1-7df9978646-fd9pb
Namespace: default
Priority: 0
Service Account: default
Node: ip-10-0-2-197/10.0.2.197
Start Time: Tue, 31 Dec 2024 07:27:27 +0000
Labels: app=e2e-test-deployment-1
pod-template-hash=7df9978646
test.longhorn.io=e2e
Annotations: <none>
Status: Running
IP:
IPs: <none>
Controlled By: ReplicaSet/e2e-test-deployment-1-7df9978646
Containers:
sleep:
Container ID: containerd://9299079466c183fba2d4a4aa39c1f0376bd13d4a83f5e23bbb888a6bfb98360c
Image: busybox
Image ID: docker.io/library/busybox@sha256:2919d0172f7524b2d8df9e50066a682669e6d170ac0f6a49676d54358fe970b5
Port: <none>
Host Port: <none>
Args:
/bin/sh
-c
while true;do date;sleep 5; done
State: Terminated
Reason: Unknown
Exit Code: 255
Started: Tue, 31 Dec 2024 07:27:33 +0000
Finished: Tue, 31 Dec 2024 07:47:48 +0000
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/data from pod-data (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wxvr2 (ro)
Conditions:
Type Status
PodReadyToStartContainers False
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
pod-data:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: e2e-test-claim-1
ReadOnly: false
kube-api-access-wxvr2:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 4m33s (x645 over 21h) kubelet MountVolume.MountDevice failed for volume "pvc-c90b8b13-d51a-4ce0-acba-e8037733a861" : rpc error: code = Aborted desc = volume pvc-c90b8b13-d51a-4ce0-acba-e8037733a861 share not yet available
To Reproduce
Run negative test case with options: -t \"Reboot Node One By One While Workload Heavy Writing\" -v LOOP_COUNT:5 -v RETRY_COUNT:259200 -v DATA_ENGINE:v2 -v RWX_VOLUME_FAST_FAILOVER:true
Expected behavior
Support bundle for troubleshooting
case 1:
supportbundle_af263b87-36c0-47b6-be2d-e489f42bc151_2025-01-01T05-48-37Z.zip
case 2:
supportbundle_671313ec-8671-4055-93f3-364db4bc3e44_2025-01-01T05-55-32Z.zip
Environment
- Longhorn version: v1.8.0-rc2
- Impacted volume (PV):
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.31.2+k3s1
- Number of control plane nodes in the cluster:
- Number of worker nodes in the cluster:
- Node config
- OS type and version: sles 15-sp6
- Kernel version:
- CPU per node:
- Memory per node:
- Disk type (e.g. SSD/NVMe/HDD):
- Network bandwidth between the nodes (Gbps):
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
- Number of Longhorn volumes in the cluster:
Additional context
In v1.8.0-rc1, hit #10033 instead of this issue:
https://ci.longhorn.io/job/private/job/longhorn-e2e-test/2257/
https://ci.longhorn.io/job/private/job/longhorn-e2e-test/2258/
Workaround and Mitigation
Metadata
Assignees
Labels
Type
Projects
Status
New Issues