[BUG] v2 volume workload gets stuck in `ContainerCreating` or `Unknown` state with `FailedMount` error

## Describe the bug

While running negative test case `Reboot Node One By One While Workload Heavy Writing` with enabled `RWX Volume Fast Failover` on Longhorn `v1.8.0-rc2`, after 2 ~ 3 rounds of node reboots, a v2 volume workload could get stuck in `ContainerCreating` or `Unknown` state with `FailedMount` related errors.

#### case 1:

http://44.196.208.159:30000

https://ci.longhorn.io/job/public/job/v1.8.x/job/v1.8.x-longhorn-e2e-tests-sles-amd64/22

```
# kubectl get pods
NAME                                     READY   STATUS              RESTARTS      AGE
e2e-test-deployment-0-6456f6c484-k4q5x   1/1     Running             0             9h
e2e-test-deployment-1-7df9978646-pz4hx   1/1     Running             2 (9h ago)    10h
e2e-test-deployment-2-748bd66996-xgcrx   1/1     Running             0             9h
e2e-test-statefulset-0-0                 0/1     ContainerCreating   0             9h
```

```
# kubectl describe pod e2e-test-statefulset-0-0
Name:             e2e-test-statefulset-0-0
Namespace:        default
Priority:         0
Service Account:  default
Node:             ip-10-0-2-92/10.0.2.92
Start Time:       Tue, 31 Dec 2024 19:58:38 +0000
Labels:           app=e2e-test-statefulset-0
                  apps.kubernetes.io/pod-index=0
                  controller-revision-hash=e2e-test-statefulset-0-6854c66d5c
                  statefulset.kubernetes.io/pod-name=e2e-test-statefulset-0-0
Annotations:      <none>
Status:           Pending
IP:               
IPs:              <none>
Controlled By:    StatefulSet/e2e-test-statefulset-0
Containers:
  sleep:
    Container ID:  
    Image:         busybox:1.34.0
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Args:
      /bin/sh
      -c
      while true;do date;sleep 5; done
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /data from pod-data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-th2fw (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   False 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  pod-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  pod-data-e2e-test-statefulset-0-0
    ReadOnly:   false
  kube-api-access-th2fw:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason       Age                 From     Message
  ----     ------       ----                ----     -------
  Warning  FailedMount  70s (x277 over 9h)  kubelet  MountVolume.MountDevice failed for volume "pvc-248b2618-cab6-4ab3-b4f3-202cfcd54c16" : rpc error: code = Internal desc = 'fsck' found errors on device /dev/longhorn/pvc-248b2618-cab6-4ab3-b4f3-202cfcd54c16 but could not correct them: fsck from util-linux 2.39.3
/dev/longhorn/pvc-248b2618-cab6-4ab3-b4f3-202cfcd54c16 contains a file system with errors, check forced.
/dev/longhorn/pvc-248b2618-cab6-4ab3-b4f3-202cfcd54c16: Resize inode not valid.  

/dev/longhorn/pvc-248b2618-cab6-4ab3-b4f3-202cfcd54c16: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
  (i.e., without -a or -p options)
```

#### case 2:

http://54.235.158.116:30000

https://ci.longhorn.io/job/public/job/v1.8.x/job/v1.8.x-longhorn-e2e-tests-sles-amd64/20/

```
# kubectl get pods
NAME                                     READY   STATUS              RESTARTS      AGE
e2e-test-deployment-0-6456f6c484-rsd27   0/1     ContainerCreating   0             21h
e2e-test-deployment-1-7df9978646-fd9pb   0/1     Unknown             0             22h
e2e-test-deployment-2-748bd66996-9gnkp   1/1     Running             0             22h
```

```
# kubectl describe pod e2e-test-deployment-1-7df9978646-fd9pb
Name:             e2e-test-deployment-1-7df9978646-fd9pb
Namespace:        default
Priority:         0
Service Account:  default
Node:             ip-10-0-2-197/10.0.2.197
Start Time:       Tue, 31 Dec 2024 07:27:27 +0000
Labels:           app=e2e-test-deployment-1
                  pod-template-hash=7df9978646
                  test.longhorn.io=e2e
Annotations:      <none>
Status:           Running
IP:               
IPs:              <none>
Controlled By:    ReplicaSet/e2e-test-deployment-1-7df9978646
Containers:
  sleep:
    Container ID:  containerd://9299079466c183fba2d4a4aa39c1f0376bd13d4a83f5e23bbb888a6bfb98360c
    Image:         busybox
    Image ID:      docker.io/library/busybox@sha256:2919d0172f7524b2d8df9e50066a682669e6d170ac0f6a49676d54358fe970b5
    Port:          <none>
    Host Port:     <none>
    Args:
      /bin/sh
      -c
      while true;do date;sleep 5; done
    State:          Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Tue, 31 Dec 2024 07:27:33 +0000
      Finished:     Tue, 31 Dec 2024 07:47:48 +0000
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /data from pod-data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wxvr2 (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   False 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  pod-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  e2e-test-claim-1
    ReadOnly:   false
  kube-api-access-wxvr2:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason       Age                    From     Message
  ----     ------       ----                   ----     -------
  Warning  FailedMount  4m33s (x645 over 21h)  kubelet  MountVolume.MountDevice failed for volume "pvc-c90b8b13-d51a-4ce0-acba-e8037733a861" : rpc error: code = Aborted desc = volume pvc-c90b8b13-d51a-4ce0-acba-e8037733a861 share not yet available
```

## To Reproduce

Run negative test case with options: `-t \"Reboot Node One By One While Workload Heavy Writing\" -v LOOP_COUNT:5 -v RETRY_COUNT:259200 -v DATA_ENGINE:v2 -v RWX_VOLUME_FAST_FAILOVER:true`

## Expected behavior



## Support bundle for troubleshooting

#### case 1:
[supportbundle_af263b87-36c0-47b6-be2d-e489f42bc151_2025-01-01T05-48-37Z.zip](https://github.com/user-attachments/files/18284574/supportbundle_af263b87-36c0-47b6-be2d-e489f42bc151_2025-01-01T05-48-37Z.zip)

#### case 2:
[supportbundle_671313ec-8671-4055-93f3-364db4bc3e44_2025-01-01T05-55-32Z.zip](https://github.com/user-attachments/files/18284575/supportbundle_671313ec-8671-4055-93f3-364db4bc3e44_2025-01-01T05-55-32Z.zip)

## Environment



 - Longhorn version: v1.8.0-rc2
 - Impacted volume (PV): 
 - Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
 - Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.31.2+k3s1
   - Number of control plane nodes in the cluster:
   - Number of worker nodes in the cluster:
 - Node config
   - OS type and version: sles 15-sp6
   - Kernel version:
   - CPU per node:
   - Memory per node:
   - Disk type (e.g. SSD/NVMe/HDD):
   - Network bandwidth between the nodes (Gbps):
 - Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
 - Number of Longhorn volumes in the cluster:

## Additional context

In v1.8.0-rc1, hit https://github.com/longhorn/longhorn/issues/10033 instead of this issue:
https://ci.longhorn.io/job/private/job/longhorn-e2e-test/2257/
https://ci.longhorn.io/job/private/job/longhorn-e2e-test/2258/

## Workaround and Mitigation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] v2 volume workload gets stuck in `ContainerCreating` or `Unknown` state with `FailedMount` error #10111

Describe the bug

case 1:

case 2:

To Reproduce

Expected behavior

Support bundle for troubleshooting

case 1:

case 2:

Environment

Additional context

Workaround and Mitigation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] v2 volume workload gets stuck in ContainerCreating or Unknown state with FailedMount error #10111

Description

Describe the bug

case 1:

case 2:

To Reproduce

Expected behavior

Support bundle for troubleshooting

case 1:

case 2:

Environment

Additional context

Workaround and Mitigation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[BUG] v2 volume workload gets stuck in `ContainerCreating` or `Unknown` state with `FailedMount` error #10111