Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] RWX workload gets stuck in ContainerCreating after cluster restart #6924

Closed
yangchiu opened this issue Oct 19, 2023 · 4 comments
Closed
Assignees
Labels
area/resilience System or volume resilience area/volume-rwx Volume RWX related backport/1.4.4 backport/1.5.2 component/longhorn-share-manager Longhorn share manager (control plane for NFS server, RWX) kind/bug priority/0 Must be implement or fixed in this release (managed by PO) require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage

Comments

@yangchiu
Copy link
Member

Describe the bug (🐛 if you encounter this issue)

When running cluster restart negative test case on v1.4.4-rc1, an issue happened.

After cluster restart (reboot all nodes including control plane), a deployment workload with rwx volume get stuck in ContainerCreating:

NAME                                                READY   STATUS              RESTARTS        AGE     IP           NODE            NOMINATED NODE   READINESS GATES
longhorn-test-minio                                 1/1     Running             2 (4h25m ago)   4h51m   10.42.2.55   ip-10-0-2-157   <none>           <none>
test-deployment-rwx-69c97d4ffb-778xp                0/1     ContainerCreating   0               4h25m   <none>       ip-10-0-2-157   <none>           <none>
longhorn-test-nfs                                   1/1     Running             5 (4h25m ago)   4h51m   10.42.3.45   ip-10-0-2-73    <none>           <none>
test-deployment-rwo-strict-local-65db96fb96-58sfp   1/1     Running             0               4h24m   10.42.3.64   ip-10-0-2-73    <none>           <none>
test-statefulset-rwo-strict-local-0                 1/1     Running             0               4h24m   10.42.1.65   ip-10-0-2-247   <none>           <none>
test-statefulset-rwo-0                              1/1     Running             0               4h24m   10.42.1.64   ip-10-0-2-247   <none>           <none>
test-deployment-rwo-7f4687c9b-z72tt                 1/1     Running             0               4h25m   10.42.2.60   ip-10-0-2-157   <none>           <none>
test-statefulset-rwx-0                              1/1     Running             0               4h23m   10.42.3.65   ip-10-0-2-73    <none>           <none>

But the corresponding volume has already been healthy:

NAME                                       STATE      ROBUSTNESS   SCHEDULED   SIZE         NODE            AGE
pvc-3b57896f-63d9-4e8e-9ca0-3120d70bed3e   attached   healthy                  3221225472   ip-10-0-2-247   4h43m
pvc-5b968dee-cd11-4a4c-b971-59c90651d064   attached   healthy                  3221225472   ip-10-0-2-247   4h43m
pvc-6edb46b1-e4db-46a6-b561-9014d83a1402   attached   healthy                  3221225472   ip-10-0-2-157   4h44m
pvc-4961318a-2f48-451b-a890-226a098d614c   attached   healthy                  3221225472   ip-10-0-2-73    4h43m
pvc-e099192f-9e7b-4f95-951a-d5d0eccddcb0   attached   healthy                  3221225472   ip-10-0-2-247   4h43m
pvc-fba39a83-864c-491e-8bf6-5630d70b9f89   attached   healthy                  3221225472   ip-10-0-2-73    4h44m

There are some errors in this test-deployment-rwx-69c97d4ffb-778xp:

Events:
  Type     Reason       Age                      From     Message
  ----     ------       ----                     ----     -------
  Warning  FailedMount  9m19s (x133 over 4h24m)  kubelet  MountVolume.MountDevice failed for volume "pvc-fba39a83-864c-491e-8bf6-5630d70b9f89" : rpc error: code = Internal desc = mount failed: exit status 32
Mounting command: /usr/local/sbin/nsmounter
Mounting arguments: mount -t nfs -o vers=4.1,noresvport,timeo=600,retrans=5,softerr 10.43.67.54:/pvc-fba39a83-864c-491e-8bf6-5630d70b9f89 /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/09dc47ada56192ced7cede241c6ff814faf18eb03e2ca3be6df358806d9e8d37/globalmount
Output: mount.nfs: mounting 10.43.67.54:/pvc-fba39a83-864c-491e-8bf6-5630d70b9f89 failed, reason given by server: No such file or directory
  Warning  FailedMount  4m11s (x115 over 4h22m)  kubelet  Unable to attach or mount volumes: unmounted volumes=[pod-data], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition

And there is no share-manager running on the same node as this pod, but i'm not sure if it's necessary:

share-manager-pvc-fba39a83-864c-491e-8bf6-5630d70b9f89   1/1     Running   0               4h25m   10.42.3.58   ip-10-0-2-73    <none>           <none>
share-manager-pvc-e099192f-9e7b-4f95-951a-d5d0eccddcb0   1/1     Running   0               4h23m   10.42.1.66   ip-10-0-2-247   <none>        

To Reproduce

Run Restart Cluster While Workload Heavy Writing negative test case:

    Create deployment 0 with rwo volume
    Create deployment 1 with rwx volume
    Create deployment 2 with rwo and strict-local volume
    Create statefulset 0 with rwo volume
    Create statefulset 1 with rwx volume
    Create statefulset 2 with rwo and strict-local volume
    FOR    ${i}    IN RANGE    ${LOOP_COUNT}
        Keep writing data to deployment 0
        Keep writing data to deployment 1
        Keep writing data to deployment 2
        Keep writing data to statefulset 0
        Keep writing data to statefulset 1
        Keep writing data to statefulset 2
        Restart cluster
        Check deployment 0 works
        Check deployment 1 works
        Check deployment 2 works
        Check statefulset 0 works
        Check statefulset 1 works
        Check statefulset 2 works
    END

Expected behavior

Support bundle for troubleshooting

supportbundle_0ed39870-8d1a-483c-a645-ae056e81ace5_2023-10-19T08-09-38Z.zip

Environment

  • Longhorn version: v1.4.4-rc1
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.27.1+k3s1
    • Number of management node in the cluster: 1
    • Number of worker node in the cluster: 3
  • Node config
    • OS type and version: sles 15-sp5
    • Kernel version:
    • CPU per node:
    • Memory per node:
    • Disk type(e.g. SSD/NVMe/HDD):
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster:
  • Impacted Longhorn resources:
    • Volume names:

Additional context

@yangchiu yangchiu added kind/bug require/qa-review-coverage Require QA to review coverage require/backport Require backport. Only used when the specific versions to backport have not been definied. labels Oct 19, 2023
@derekbit derekbit self-assigned this Oct 19, 2023
@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Oct 19, 2023

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at:

  • Is there a workaround for the issue? If so, where is it documented?
    The workaround is at:

Detach and reattach RWX volume

  • Does the PR include the explanation for the fix or the feature?

Originally, the volume export config is created when a share manager pod starts running. However, the volume detachment might clean up the config when share manager daemon is waiting for volume after introducing #6829.

To fix the issue, create volume export config when volume is ready instead.

  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at

longhorn/longhorn-share-manager#89

  • Which areas/issues this PR might have potential impacts on?
    Area: RWX volume
    Issues: node reboot

@innobead
Copy link
Member

This is the value of the negative testing we are working on. Well done. @yangchiu

cc @longhorn/qa

@innobead innobead added area/volume-rwx Volume RWX related area/resilience System or volume resilience component/longhorn-share-manager Longhorn share manager (control plane for NFS server, RWX) labels Oct 19, 2023
@innobead
Copy link
Member

innobead commented Oct 19, 2023

This is a regression of #6829 (@chriscchien verified this, so it's a good chance to understand how to improve the following verification when verifying ready-for-testing issues in the future)

@yangchiu
Copy link
Member Author

Verified passed on v1.4.x-head (longhorn-share-manager 4931b9f) by running test case Restart Cluster While Workload Heavy Writing.

Test result: https://ci.longhorn.io/job/private/job/longhorn-e2e-test/40/

@innobead innobead changed the title [BUG][v1.4.4-rc1] rwx workload gets stuck in ContainerCreating after cluster restart [BUG] RWX workload gets stuck in ContainerCreating after cluster restart Oct 29, 2023
@derekbit derekbit moved this to Closed in Longhorn Sprint Aug 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/resilience System or volume resilience area/volume-rwx Volume RWX related backport/1.4.4 backport/1.5.2 component/longhorn-share-manager Longhorn share manager (control plane for NFS server, RWX) kind/bug priority/0 Must be implement or fixed in this release (managed by PO) require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage
Projects
Status: Closed
Development

No branches or pull requests

4 participants