Skip to content

[BUG] v2 volume workload gets stuck in ContainerCreating or Unknown state with FailedMount error #10111

Open
@yangchiu

Description

Describe the bug

While running negative test case Reboot Node One By One While Workload Heavy Writing with enabled RWX Volume Fast Failover on Longhorn v1.8.0-rc2, after 2 ~ 3 rounds of node reboots, a v2 volume workload could get stuck in ContainerCreating or Unknown state with FailedMount related errors.

case 1:

http://44.196.208.159:30000

https://ci.longhorn.io/job/public/job/v1.8.x/job/v1.8.x-longhorn-e2e-tests-sles-amd64/22

# kubectl get pods
NAME                                     READY   STATUS              RESTARTS      AGE
e2e-test-deployment-0-6456f6c484-k4q5x   1/1     Running             0             9h
e2e-test-deployment-1-7df9978646-pz4hx   1/1     Running             2 (9h ago)    10h
e2e-test-deployment-2-748bd66996-xgcrx   1/1     Running             0             9h
e2e-test-statefulset-0-0                 0/1     ContainerCreating   0             9h
# kubectl describe pod e2e-test-statefulset-0-0
Name:             e2e-test-statefulset-0-0
Namespace:        default
Priority:         0
Service Account:  default
Node:             ip-10-0-2-92/10.0.2.92
Start Time:       Tue, 31 Dec 2024 19:58:38 +0000
Labels:           app=e2e-test-statefulset-0
                  apps.kubernetes.io/pod-index=0
                  controller-revision-hash=e2e-test-statefulset-0-6854c66d5c
                  statefulset.kubernetes.io/pod-name=e2e-test-statefulset-0-0
Annotations:      <none>
Status:           Pending
IP:               
IPs:              <none>
Controlled By:    StatefulSet/e2e-test-statefulset-0
Containers:
  sleep:
    Container ID:  
    Image:         busybox:1.34.0
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Args:
      /bin/sh
      -c
      while true;do date;sleep 5; done
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /data from pod-data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-th2fw (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   False 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  pod-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  pod-data-e2e-test-statefulset-0-0
    ReadOnly:   false
  kube-api-access-th2fw:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason       Age                 From     Message
  ----     ------       ----                ----     -------
  Warning  FailedMount  70s (x277 over 9h)  kubelet  MountVolume.MountDevice failed for volume "pvc-248b2618-cab6-4ab3-b4f3-202cfcd54c16" : rpc error: code = Internal desc = 'fsck' found errors on device /dev/longhorn/pvc-248b2618-cab6-4ab3-b4f3-202cfcd54c16 but could not correct them: fsck from util-linux 2.39.3
/dev/longhorn/pvc-248b2618-cab6-4ab3-b4f3-202cfcd54c16 contains a file system with errors, check forced.
/dev/longhorn/pvc-248b2618-cab6-4ab3-b4f3-202cfcd54c16: Resize inode not valid.  

/dev/longhorn/pvc-248b2618-cab6-4ab3-b4f3-202cfcd54c16: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
  (i.e., without -a or -p options)

case 2:

http://54.235.158.116:30000

https://ci.longhorn.io/job/public/job/v1.8.x/job/v1.8.x-longhorn-e2e-tests-sles-amd64/20/

# kubectl get pods
NAME                                     READY   STATUS              RESTARTS      AGE
e2e-test-deployment-0-6456f6c484-rsd27   0/1     ContainerCreating   0             21h
e2e-test-deployment-1-7df9978646-fd9pb   0/1     Unknown             0             22h
e2e-test-deployment-2-748bd66996-9gnkp   1/1     Running             0             22h
# kubectl describe pod e2e-test-deployment-1-7df9978646-fd9pb
Name:             e2e-test-deployment-1-7df9978646-fd9pb
Namespace:        default
Priority:         0
Service Account:  default
Node:             ip-10-0-2-197/10.0.2.197
Start Time:       Tue, 31 Dec 2024 07:27:27 +0000
Labels:           app=e2e-test-deployment-1
                  pod-template-hash=7df9978646
                  test.longhorn.io=e2e
Annotations:      <none>
Status:           Running
IP:               
IPs:              <none>
Controlled By:    ReplicaSet/e2e-test-deployment-1-7df9978646
Containers:
  sleep:
    Container ID:  containerd://9299079466c183fba2d4a4aa39c1f0376bd13d4a83f5e23bbb888a6bfb98360c
    Image:         busybox
    Image ID:      docker.io/library/busybox@sha256:2919d0172f7524b2d8df9e50066a682669e6d170ac0f6a49676d54358fe970b5
    Port:          <none>
    Host Port:     <none>
    Args:
      /bin/sh
      -c
      while true;do date;sleep 5; done
    State:          Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Tue, 31 Dec 2024 07:27:33 +0000
      Finished:     Tue, 31 Dec 2024 07:47:48 +0000
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /data from pod-data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wxvr2 (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   False 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  pod-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  e2e-test-claim-1
    ReadOnly:   false
  kube-api-access-wxvr2:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason       Age                    From     Message
  ----     ------       ----                   ----     -------
  Warning  FailedMount  4m33s (x645 over 21h)  kubelet  MountVolume.MountDevice failed for volume "pvc-c90b8b13-d51a-4ce0-acba-e8037733a861" : rpc error: code = Aborted desc = volume pvc-c90b8b13-d51a-4ce0-acba-e8037733a861 share not yet available

To Reproduce

Run negative test case with options: -t \"Reboot Node One By One While Workload Heavy Writing\" -v LOOP_COUNT:5 -v RETRY_COUNT:259200 -v DATA_ENGINE:v2 -v RWX_VOLUME_FAST_FAILOVER:true

Expected behavior

Support bundle for troubleshooting

case 1:

supportbundle_af263b87-36c0-47b6-be2d-e489f42bc151_2025-01-01T05-48-37Z.zip

case 2:

supportbundle_671313ec-8671-4055-93f3-364db4bc3e44_2025-01-01T05-55-32Z.zip

Environment

  • Longhorn version: v1.8.0-rc2
  • Impacted volume (PV):
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.31.2+k3s1
    • Number of control plane nodes in the cluster:
    • Number of worker nodes in the cluster:
  • Node config
    • OS type and version: sles 15-sp6
    • Kernel version:
    • CPU per node:
    • Memory per node:
    • Disk type (e.g. SSD/NVMe/HDD):
    • Network bandwidth between the nodes (Gbps):
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster:

Additional context

In v1.8.0-rc1, hit #10033 instead of this issue:
https://ci.longhorn.io/job/private/job/longhorn-e2e-test/2257/
https://ci.longhorn.io/job/private/job/longhorn-e2e-test/2258/

Workaround and Mitigation

Metadata

Assignees

Labels

area/resilienceSystem or volume resiliencearea/v2-data-enginev2 data engine (SPDK)area/volume-replica-rebuildVolume replica rebuilding relatedkind/bugpriority/0Must be implement or fixed in this release (managed by PO)reproduce/always100% reproduciblerequire/backportRequire backport. Only used when the specific versions to backport have not been definied.severity/2Function working but has a major issue w/o workaround (a major incident with significant impact)

Type

No type

Projects

  • Status

    New Issues

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions