Volume Attaching Timeout - Failed to attache #7629

aljabertareq · 2024-01-11T06:35:12Z

Describe the bug

Volume attaching timeout. Kakfa and Minio instances failed to attaché.

To Reproduce

Expected behavior

Support bundle for troubleshooting

https://drive.google.com/file/d/15Y3qSLjWtKADKL0NWE9HMkMCjaD8VO7u/view?usp=sharing

Environment

Longhorn version: 1.4.2
Impacted volume (PV): pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): helm
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: k8s v1.23.5
- Number of control plane nodes in the cluster: 2
- Number of worker nodes in the cluster: 2
Node config
- OS type and version: Ubuntu 20.04.6 LTS
- Kernel version: 5.4.0-166-generic
- CPU per node: 96 and 128 for 2 nodes
- Memory per node: 33 and 40 for 2 nodes
- Disk type (e.g. SSD/NVMe/HDD): One node is HDD and the other one is SSD
- Network bandwidth between the nodes (Gbps):
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): baremetal
Number of Longhorn volumes in the cluster: 17

Additional context

<!-Please add any other context about the problem here.-->

aljabertareq · 2024-01-11T08:04:14Z

@derekbit Can you please help? It took us 3 hours to get this up and running.

derekbit · 2024-01-11T08:16:31Z

@aljabertareq
~~Would you be able to provide a support bundle for further investigation? Thank you.~~
Saw it. Thanks.

cc @james-munson for the community coordinator of this sprint

james-munson · 2024-01-11T23:51:58Z

Just to be clear, it did eventually attach, correct? The issue is about the time it took?
The volumes.yaml shows that volume as attached:

    currentNodeID: node1
      expansionRequired: false
      frontendDisabled: false
      isStandby: false
      kubernetesStatus:
        lastPVCRefAt: "null"
        lastPodRefAt: "null"
        namespace: avs
        pvName: pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf
        pvStatus: Bound
        pvcName: minio-pv-claim-cloned
        workloadsStatus:
          - podName: minio-6dd56bd767-qp67r
            podStatus: Running
            workloadName: minio-6dd56bd767
            workloadType: ReplicaSet
      lastBackup: "null"
      lastBackupAt: "null"
      lastDegradedAt: "null"
      ownerID: node1
      pendingNodeID: "null"
      remountRequestedAt: "2024-01-10T22:17:01Z"
      restoreInitiated: false
      restoreRequired: false
      robustness: healthy
      shareEndpoint: "null"
      shareState: "null"
      state: attached

And cluster/storage.k8s.io/v1/volumeattachments.yaml shows it attached as well:

  - apiVersion: storage.k8s.io/v1
    kind: VolumeAttachment
    metadata:
      annotations:
        csi.alpha.kubernetes.io/node-id: node1
      creationTimestamp: "2024-01-10T21:59:34Z"
      finalizers:
        - external-attacher/driver-longhorn-io
      managedFields:
...
      name: csi-baef1bd4fbd4f963e849e1345359e542eaf2bab7a7e33135f546f14a7d9a5949
      resourceVersion: "124700040"
      uid: 3a412db5-daaa-43bf-9f67-39302c839d46
    spec:
      attacher: driver.longhorn.io
      nodeName: node1
      source:
        persistentVolumeName: pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf
    status:
      attached: true

By the way, the volume also shows:

      numberOfReplicas: 1

which is a dangerous way to operate.

james-munson · 2024-01-12T00:11:07Z

It did have some trouble getting there though.

longhorn-manager-h82kr/longhorn-manager.log shows

2024-01-10T22:12:18.448219871Z time="2024-01-10T22:12:18Z" level=debug msg="Instance manager pod from instance-manager-r-1e8d2b6ac4bdab53558aa36fa56425b5 not exist in datastore"
...
2024-01-10T22:12:18.456416621Z time="2024-01-10T22:12:18Z" level=warning msg="Cannot find the instance manager for the running instance pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf-r-f8e1b060, will mark the instance as state ERROR"
...
2024-01-10T22:12:18.463570796Z time="2024-01-10T22:12:18Z" level=warning msg="Replica pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf-r-f8e1b060 that not in the engine mode map is marked as failed, current state error, engine name pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf-e-4fe0d576, active true" accessMode=rwo controller=longhorn-volume currentEngine=pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf-e-4fe0d576 frontend=blockdev migratable=false node=node1 owner=node1 state=attached volume=pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf
2024-01-10T22:12:18.463615530Z time="2024-01-10T22:12:18Z" level=info msg="All replicas are failed, set engine salvageRequested to true" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=node1 owner=node1 state=attached volume=pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf

and it seems to recover here:

2024-01-10T22:17:05.564872979Z time="2024-01-10T22:17:05Z" level=debug msg="Instance pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf-e-4fe0d576 starts running, Storage IP 10.233.90.134"
2024-01-10T22:17:05.564915308Z time="2024-01-10T22:17:05Z" level=debug msg="Instance pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf-e-4fe0d576 starts running, IP 10.233.90.134"
2024-01-10T22:17:05.564924465Z time="2024-01-10T22:17:05Z" level=debug msg="Instance pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf-e-4fe0d576 starts running, Port 10001"
2024-01-10T22:17:05.564931789Z time="2024-01-10T22:17:05Z" level=debug msg="Instance handler updated instance pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf-e-4fe0d576 state, old state starting, new state running"
2024-01-10T22:17:05.564992211Z time="2024-01-10T22:17:05Z" level=debug msg="Starting monitoring engine" controller=longhorn-engine engine=pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf-e-4fe0d576 node=node1
2024-01-10T22:17:05.572147609Z time="2024-01-10T22:17:05Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf\", UID:\"b0b1ea58-89b5-4e95-95f7-70bf420ea931\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"124707409\", FieldPath:\"\"}): type: 'Normal' reason: 'Attached' volume pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf has been attached to node1"
2024-01-10T22:17:18.325927988Z time="2024-01-10T22:17:18Z" level=info msg="Deleted pod minio-6dd56bd767-k2dsr so that Kubernetes will handle remounting volume pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf" controller=longhorn-kubernetes-pod node=node1

And instance-manager-r-1e8d2b6ac4bdab53558aa36fa56425b5/replica-manager.log logs this

024-01-10T22:12:56.604493717Z [longhorn-instance-manager] time="2024-01-10T22:12:56Z" level=info msg="Process Manager: prepare to delete process pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf-r-f8e1b060"
2024-01-10T22:12:56.604515107Z [longhorn-instance-manager] time="2024-01-10T22:12:56Z" level=info msg="Process Manager: deleted process pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf-r-f8e1b060"
2024-01-10T22:12:56.604666276Z [longhorn-instance-manager] time="2024-01-10T22:12:56Z" level=info msg="Process Manager: trying to stop process pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf-r-f8e1b060"
2024-01-10T22:12:56.604672758Z [longhorn-instance-manager] time="2024-01-10T22:12:56Z" level=info msg="Wait for process pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf-r-f8e1b060 to shutdown"

and then thousands of these between 22:12 and 22:17,

2024-01-10T22:16:52.728401277Z [longhorn-instance-manager] time="2024-01-10T22:16:52Z" level=debug msg="Process Manager: wait for process pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf-r-f8e1b060 to shutdown before unregistering process"

aljabertareq · 2024-01-12T22:13:06Z

@james-munson not just trouble. We had to try and do so many things to make it attach. It took us around 4 hours. These are production environment that's supposed to be 24/7, so eventually is an issue. Any help replicating the issue and suggesting a solution would be much appreciated

@4mohamedalaa please add a detailed steps that we tried to solve this issue.

aljabertareq · 2024-01-13T18:48:27Z

@james-munson geres the command that they used

"
If this happens again please follow this for both nodes:
sudo -s
systemctl status iscsi
and if it’s not working:

systemctl restart iscsi

"

aljabertareq · 2024-01-16T09:27:27Z

@james-munson @derekbit any thoughts on this?

james-munson · 2024-02-09T23:18:17Z

If this happens again, it would be interesting to see the full output from systemctl status iscsi, and also for systemctl status iscsid

github-actions · 2024-11-25T02:06:04Z

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

aljabertareq · 2024-11-29T03:40:59Z

@james-munson any findings here?

github-actions · 2025-01-02T02:00:25Z

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

aljabertareq added kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage labels Jan 11, 2024

longhorn-io-github-bot added this to Community Review Sprint Jan 11, 2024

github-project-automation bot moved this to New in Community Review Sprint Jan 11, 2024

derekbit moved this from New to Pending user response in Community Review Sprint Jan 11, 2024

derekbit moved this from Pending user response to New in Community Review Sprint Jan 11, 2024

james-munson moved this from New to Team review required in Community Review Sprint Jan 16, 2024

james-munson moved this from Team review required to Pending user response in Community Review Sprint Mar 7, 2024

longhorn-io-github-bot added this to Longhorn Sprint Aug 2, 2024

longhorn-io-github-bot moved this to New Issues in Longhorn Sprint Aug 2, 2024

github-actions bot added the stale label Nov 25, 2024

longhorn-io-github-bot moved this from Waiting For Reporter to In Progress in Community Review Sprint Nov 29, 2024

github-actions bot removed the stale label Dec 2, 2024

github-actions bot added the stale label Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Volume Attaching Timeout - Failed to attache #7629

Volume Attaching Timeout - Failed to attache #7629

aljabertareq commented Jan 11, 2024 •

edited

Loading

aljabertareq commented Jan 11, 2024

derekbit commented Jan 11, 2024 •

edited

Loading

james-munson commented Jan 11, 2024

james-munson commented Jan 12, 2024

aljabertareq commented Jan 12, 2024 •

edited

Loading

aljabertareq commented Jan 13, 2024

aljabertareq commented Jan 16, 2024

james-munson commented Feb 9, 2024

github-actions bot commented Nov 25, 2024

aljabertareq commented Nov 29, 2024

github-actions bot commented Jan 2, 2025

Volume Attaching Timeout - Failed to attache #7629

Volume Attaching Timeout - Failed to attache #7629

Comments

aljabertareq commented Jan 11, 2024 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Support bundle for troubleshooting

Environment

Additional context

aljabertareq commented Jan 11, 2024

derekbit commented Jan 11, 2024 • edited Loading

james-munson commented Jan 11, 2024

james-munson commented Jan 12, 2024

aljabertareq commented Jan 12, 2024 • edited Loading

aljabertareq commented Jan 13, 2024

aljabertareq commented Jan 16, 2024

james-munson commented Feb 9, 2024

github-actions bot commented Nov 25, 2024

aljabertareq commented Nov 29, 2024

github-actions bot commented Jan 2, 2025

aljabertareq commented Jan 11, 2024 •

edited

Loading

derekbit commented Jan 11, 2024 •

edited

Loading

aljabertareq commented Jan 12, 2024 •

edited

Loading