[BUG] RWX workload gets stuck in ContainerCreating after cluster restart #6924

yangchiu · 2023-10-19T10:38:07Z

Describe the bug (🐛 if you encounter this issue)

When running cluster restart negative test case on v1.4.4-rc1, an issue happened.

After cluster restart (reboot all nodes including control plane), a deployment workload with rwx volume get stuck in ContainerCreating:

NAME                                                READY   STATUS              RESTARTS        AGE     IP           NODE            NOMINATED NODE   READINESS GATES
longhorn-test-minio                                 1/1     Running             2 (4h25m ago)   4h51m   10.42.2.55   ip-10-0-2-157   <none>           <none>
test-deployment-rwx-69c97d4ffb-778xp                0/1     ContainerCreating   0               4h25m   <none>       ip-10-0-2-157   <none>           <none>
longhorn-test-nfs                                   1/1     Running             5 (4h25m ago)   4h51m   10.42.3.45   ip-10-0-2-73    <none>           <none>
test-deployment-rwo-strict-local-65db96fb96-58sfp   1/1     Running             0               4h24m   10.42.3.64   ip-10-0-2-73    <none>           <none>
test-statefulset-rwo-strict-local-0                 1/1     Running             0               4h24m   10.42.1.65   ip-10-0-2-247   <none>           <none>
test-statefulset-rwo-0                              1/1     Running             0               4h24m   10.42.1.64   ip-10-0-2-247   <none>           <none>
test-deployment-rwo-7f4687c9b-z72tt                 1/1     Running             0               4h25m   10.42.2.60   ip-10-0-2-157   <none>           <none>
test-statefulset-rwx-0                              1/1     Running             0               4h23m   10.42.3.65   ip-10-0-2-73    <none>           <none>

But the corresponding volume has already been healthy:

NAME                                       STATE      ROBUSTNESS   SCHEDULED   SIZE         NODE            AGE
pvc-3b57896f-63d9-4e8e-9ca0-3120d70bed3e   attached   healthy                  3221225472   ip-10-0-2-247   4h43m
pvc-5b968dee-cd11-4a4c-b971-59c90651d064   attached   healthy                  3221225472   ip-10-0-2-247   4h43m
pvc-6edb46b1-e4db-46a6-b561-9014d83a1402   attached   healthy                  3221225472   ip-10-0-2-157   4h44m
pvc-4961318a-2f48-451b-a890-226a098d614c   attached   healthy                  3221225472   ip-10-0-2-73    4h43m
pvc-e099192f-9e7b-4f95-951a-d5d0eccddcb0   attached   healthy                  3221225472   ip-10-0-2-247   4h43m
pvc-fba39a83-864c-491e-8bf6-5630d70b9f89   attached   healthy                  3221225472   ip-10-0-2-73    4h44m

There are some errors in this test-deployment-rwx-69c97d4ffb-778xp:

Events:
  Type     Reason       Age                      From     Message
  ----     ------       ----                     ----     -------
  Warning  FailedMount  9m19s (x133 over 4h24m)  kubelet  MountVolume.MountDevice failed for volume "pvc-fba39a83-864c-491e-8bf6-5630d70b9f89" : rpc error: code = Internal desc = mount failed: exit status 32
Mounting command: /usr/local/sbin/nsmounter
Mounting arguments: mount -t nfs -o vers=4.1,noresvport,timeo=600,retrans=5,softerr 10.43.67.54:/pvc-fba39a83-864c-491e-8bf6-5630d70b9f89 /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/09dc47ada56192ced7cede241c6ff814faf18eb03e2ca3be6df358806d9e8d37/globalmount
Output: mount.nfs: mounting 10.43.67.54:/pvc-fba39a83-864c-491e-8bf6-5630d70b9f89 failed, reason given by server: No such file or directory
  Warning  FailedMount  4m11s (x115 over 4h22m)  kubelet  Unable to attach or mount volumes: unmounted volumes=[pod-data], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition

And there is no share-manager running on the same node as this pod, but i'm not sure if it's necessary:

share-manager-pvc-fba39a83-864c-491e-8bf6-5630d70b9f89   1/1     Running   0               4h25m   10.42.3.58   ip-10-0-2-73    <none>           <none>
share-manager-pvc-e099192f-9e7b-4f95-951a-d5d0eccddcb0   1/1     Running   0               4h23m   10.42.1.66   ip-10-0-2-247   <none>

To Reproduce

Run Restart Cluster While Workload Heavy Writing negative test case:

    Create deployment 0 with rwo volume
    Create deployment 1 with rwx volume
    Create deployment 2 with rwo and strict-local volume
    Create statefulset 0 with rwo volume
    Create statefulset 1 with rwx volume
    Create statefulset 2 with rwo and strict-local volume
    FOR    ${i}    IN RANGE    ${LOOP_COUNT}
        Keep writing data to deployment 0
        Keep writing data to deployment 1
        Keep writing data to deployment 2
        Keep writing data to statefulset 0
        Keep writing data to statefulset 1
        Keep writing data to statefulset 2
        Restart cluster
        Check deployment 0 works
        Check deployment 1 works
        Check deployment 2 works
        Check statefulset 0 works
        Check statefulset 1 works
        Check statefulset 2 works
    END

Expected behavior

Support bundle for troubleshooting

supportbundle_0ed39870-8d1a-483c-a645-ae056e81ace5_2023-10-19T08-09-38Z.zip

Environment

Longhorn version: v1.4.4-rc1
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.27.1+k3s1
- Number of management node in the cluster: 1
- Number of worker node in the cluster: 3
Node config
- OS type and version: sles 15-sp5
- Kernel version:
- CPU per node:
- Memory per node:
- Disk type(e.g. SSD/NVMe/HDD):
- Network bandwidth between the nodes:
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
Number of Longhorn volumes in the cluster:
Impacted Longhorn resources:
- Volume names:

Additional context

The text was updated successfully, but these errors were encountered:

longhorn-io-github-bot · 2023-10-19T14:05:46Z

Pre Ready-For-Testing Checklist

Where is the reproduce steps/test steps documented?
The reproduce steps/test steps are at:
Is there a workaround for the issue? If so, where is it documented?
The workaround is at:

Detach and reattach RWX volume

Does the PR include the explanation for the fix or the feature?

Originally, the volume export config is created when a share manager pod starts running. However, the volume detachment might clean up the config when share manager daemon is waiting for volume after introducing #6829.

To fix the issue, create volume export config when volume is ready instead.

Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
The PR is at

longhorn/longhorn-share-manager#89

Which areas/issues this PR might have potential impacts on?
Area: RWX volume
Issues: node reboot

innobead · 2023-10-19T15:05:22Z

This is the value of the negative testing we are working on. Well done. @yangchiu

cc @longhorn/qa

innobead · 2023-10-19T15:10:13Z

This is a regression of #6829 (@chriscchien verified this, so it's a good chance to understand how to improve the following verification when verifying ready-for-testing issues in the future)

yangchiu · 2023-10-20T07:23:30Z

Verified passed on v1.4.x-head (longhorn-share-manager 4931b9f) by running test case Restart Cluster While Workload Heavy Writing.

Test result: https://ci.longhorn.io/job/private/job/longhorn-e2e-test/40/

yangchiu added kind/bug require/qa-review-coverage Require QA to review coverage require/backport Require backport. Only used when the specific versions to backport have not been definied. labels Oct 19, 2023

derekbit mentioned this issue Oct 19, 2023

Create volume export config when volume is ready longhorn/longhorn-share-manager#89

Merged

derekbit added backport/1.4.4 backport/1.5.2 labels Oct 19, 2023

derekbit self-assigned this Oct 19, 2023

This was referenced Oct 19, 2023

[BACKPORT][v1.5.2][BUG][v1.4.4-rc1] rwx workload gets stuck in ContainerCreating after cluster restart #6927

Closed

[BACKPORT][v1.4.4][BUG][v1.4.4-rc1] rwx workload gets stuck in ContainerCreating after cluster restart #6928

Closed

innobead added the priority/0 Must be implement or fixed in this release (managed by PO) label Oct 19, 2023

innobead added area/volume-rwx Volume RWX related area/resilience System or volume resilience component/longhorn-share-manager Longhorn share manager (control plane for NFS server, RWX) labels Oct 19, 2023

khushboo-rancher assigned yangchiu Oct 19, 2023

yangchiu closed this as completed Oct 20, 2023

innobead changed the title ~~[BUG][v1.4.4-rc1] rwx workload gets stuck in ContainerCreating after cluster restart~~ [BUG] RWX workload gets stuck in ContainerCreating after cluster restart Oct 29, 2023

derekbit added this to Longhorn Sprint Aug 3, 2024

derekbit moved this to Closed in Longhorn Sprint Aug 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] RWX workload gets stuck in ContainerCreating after cluster restart #6924

[BUG] RWX workload gets stuck in ContainerCreating after cluster restart #6924

yangchiu commented Oct 19, 2023

longhorn-io-github-bot commented Oct 19, 2023 •

edited by derekbit

Loading

innobead commented Oct 19, 2023

innobead commented Oct 19, 2023 •

edited

Loading

yangchiu commented Oct 20, 2023

[BUG] RWX workload gets stuck in ContainerCreating after cluster restart #6924

[BUG] RWX workload gets stuck in ContainerCreating after cluster restart #6924

Comments

yangchiu commented Oct 19, 2023

Describe the bug (🐛 if you encounter this issue)

To Reproduce

Expected behavior

Support bundle for troubleshooting

Environment

Additional context

longhorn-io-github-bot commented Oct 19, 2023 • edited by derekbit Loading

Pre Ready-For-Testing Checklist

innobead commented Oct 19, 2023

innobead commented Oct 19, 2023 • edited Loading

yangchiu commented Oct 20, 2023

longhorn-io-github-bot commented Oct 19, 2023 •

edited by derekbit

Loading

innobead commented Oct 19, 2023 •

edited

Loading