Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Volume Attaching Timeout - Failed to attache #7629

Open
aljabertareq opened this issue Jan 11, 2024 · 11 comments
Open

Volume Attaching Timeout - Failed to attache #7629

aljabertareq opened this issue Jan 11, 2024 · 11 comments
Labels
kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage stale

Comments

@aljabertareq
Copy link

aljabertareq commented Jan 11, 2024

Describe the bug

Volume attaching timeout. Kakfa and Minio instances failed to attaché.

To Reproduce

Expected behavior

Support bundle for troubleshooting

https://drive.google.com/file/d/15Y3qSLjWtKADKL0NWE9HMkMCjaD8VO7u/view?usp=sharing

Environment

  • Longhorn version: 1.4.2
  • Impacted volume (PV): pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): helm
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: k8s v1.23.5
    • Number of control plane nodes in the cluster: 2
    • Number of worker nodes in the cluster: 2
  • Node config
    • OS type and version: Ubuntu 20.04.6 LTS
    • Kernel version: 5.4.0-166-generic
    • CPU per node: 96 and 128 for 2 nodes
    • Memory per node: 33 and 40 for 2 nodes
    • Disk type (e.g. SSD/NVMe/HDD): One node is HDD and the other one is SSD
    • Network bandwidth between the nodes (Gbps):
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): baremetal
  • Number of Longhorn volumes in the cluster: 17

Additional context

<!-Please add any other context about the problem here.-->

@aljabertareq aljabertareq added kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage labels Jan 11, 2024
@aljabertareq
Copy link
Author

@derekbit Can you please help? It took us 3 hours to get this up and running.

@derekbit
Copy link
Member

derekbit commented Jan 11, 2024

@aljabertareq
Would you be able to provide a support bundle for further investigation? Thank you.
Saw it. Thanks.

cc @james-munson for the community coordinator of this sprint

@derekbit derekbit moved this from New to Pending user response in Community Review Sprint Jan 11, 2024
@derekbit derekbit moved this from Pending user response to New in Community Review Sprint Jan 11, 2024
@james-munson
Copy link
Contributor

Just to be clear, it did eventually attach, correct? The issue is about the time it took?
The volumes.yaml shows that volume as attached:

    currentNodeID: node1
      expansionRequired: false
      frontendDisabled: false
      isStandby: false
      kubernetesStatus:
        lastPVCRefAt: "null"
        lastPodRefAt: "null"
        namespace: avs
        pvName: pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf
        pvStatus: Bound
        pvcName: minio-pv-claim-cloned
        workloadsStatus:
          - podName: minio-6dd56bd767-qp67r
            podStatus: Running
            workloadName: minio-6dd56bd767
            workloadType: ReplicaSet
      lastBackup: "null"
      lastBackupAt: "null"
      lastDegradedAt: "null"
      ownerID: node1
      pendingNodeID: "null"
      remountRequestedAt: "2024-01-10T22:17:01Z"
      restoreInitiated: false
      restoreRequired: false
      robustness: healthy
      shareEndpoint: "null"
      shareState: "null"
      state: attached

And cluster/storage.k8s.io/v1/volumeattachments.yaml shows it attached as well:

  - apiVersion: storage.k8s.io/v1
    kind: VolumeAttachment
    metadata:
      annotations:
        csi.alpha.kubernetes.io/node-id: node1
      creationTimestamp: "2024-01-10T21:59:34Z"
      finalizers:
        - external-attacher/driver-longhorn-io
      managedFields:
...
      name: csi-baef1bd4fbd4f963e849e1345359e542eaf2bab7a7e33135f546f14a7d9a5949
      resourceVersion: "124700040"
      uid: 3a412db5-daaa-43bf-9f67-39302c839d46
    spec:
      attacher: driver.longhorn.io
      nodeName: node1
      source:
        persistentVolumeName: pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf
    status:
      attached: true

By the way, the volume also shows:

      numberOfReplicas: 1

which is a dangerous way to operate.

@james-munson
Copy link
Contributor

It did have some trouble getting there though.

longhorn-manager-h82kr/longhorn-manager.log shows

2024-01-10T22:12:18.448219871Z time="2024-01-10T22:12:18Z" level=debug msg="Instance manager pod from instance-manager-r-1e8d2b6ac4bdab53558aa36fa56425b5 not exist in datastore"
...
2024-01-10T22:12:18.456416621Z time="2024-01-10T22:12:18Z" level=warning msg="Cannot find the instance manager for the running instance pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf-r-f8e1b060, will mark the instance as state ERROR"
...
2024-01-10T22:12:18.463570796Z time="2024-01-10T22:12:18Z" level=warning msg="Replica pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf-r-f8e1b060 that not in the engine mode map is marked as failed, current state error, engine name pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf-e-4fe0d576, active true" accessMode=rwo controller=longhorn-volume currentEngine=pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf-e-4fe0d576 frontend=blockdev migratable=false node=node1 owner=node1 state=attached volume=pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf
2024-01-10T22:12:18.463615530Z time="2024-01-10T22:12:18Z" level=info msg="All replicas are failed, set engine salvageRequested to true" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=node1 owner=node1 state=attached volume=pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf

and it seems to recover here:

2024-01-10T22:17:05.564872979Z time="2024-01-10T22:17:05Z" level=debug msg="Instance pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf-e-4fe0d576 starts running, Storage IP 10.233.90.134"
2024-01-10T22:17:05.564915308Z time="2024-01-10T22:17:05Z" level=debug msg="Instance pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf-e-4fe0d576 starts running, IP 10.233.90.134"
2024-01-10T22:17:05.564924465Z time="2024-01-10T22:17:05Z" level=debug msg="Instance pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf-e-4fe0d576 starts running, Port 10001"
2024-01-10T22:17:05.564931789Z time="2024-01-10T22:17:05Z" level=debug msg="Instance handler updated instance pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf-e-4fe0d576 state, old state starting, new state running"
2024-01-10T22:17:05.564992211Z time="2024-01-10T22:17:05Z" level=debug msg="Starting monitoring engine" controller=longhorn-engine engine=pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf-e-4fe0d576 node=node1
2024-01-10T22:17:05.572147609Z time="2024-01-10T22:17:05Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf\", UID:\"b0b1ea58-89b5-4e95-95f7-70bf420ea931\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"124707409\", FieldPath:\"\"}): type: 'Normal' reason: 'Attached' volume pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf has been attached to node1"
2024-01-10T22:17:18.325927988Z time="2024-01-10T22:17:18Z" level=info msg="Deleted pod minio-6dd56bd767-k2dsr so that Kubernetes will handle remounting volume pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf" controller=longhorn-kubernetes-pod node=node1

And instance-manager-r-1e8d2b6ac4bdab53558aa36fa56425b5/replica-manager.log logs this

024-01-10T22:12:56.604493717Z [longhorn-instance-manager] time="2024-01-10T22:12:56Z" level=info msg="Process Manager: prepare to delete process pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf-r-f8e1b060"
2024-01-10T22:12:56.604515107Z [longhorn-instance-manager] time="2024-01-10T22:12:56Z" level=info msg="Process Manager: deleted process pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf-r-f8e1b060"
2024-01-10T22:12:56.604666276Z [longhorn-instance-manager] time="2024-01-10T22:12:56Z" level=info msg="Process Manager: trying to stop process pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf-r-f8e1b060"
2024-01-10T22:12:56.604672758Z [longhorn-instance-manager] time="2024-01-10T22:12:56Z" level=info msg="Wait for process pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf-r-f8e1b060 to shutdown"

and then thousands of these between 22:12 and 22:17,

2024-01-10T22:16:52.728401277Z [longhorn-instance-manager] time="2024-01-10T22:16:52Z" level=debug msg="Process Manager: wait for process pvc-e9b4c5e1-1c5d-4428-bd0e-a02167136cbf-r-f8e1b060 to shutdown before unregistering process"

@aljabertareq
Copy link
Author

aljabertareq commented Jan 12, 2024

@james-munson not just trouble. We had to try and do so many things to make it attach. It took us around 4 hours. These are production environment that's supposed to be 24/7, so eventually is an issue. Any help replicating the issue and suggesting a solution would be much appreciated

@4mohamedalaa please add a detailed steps that we tried to solve this issue.

@aljabertareq
Copy link
Author

@james-munson geres the command that they used

"
If this happens again please follow this for both nodes:
sudo -s
systemctl status iscsi
and if it’s not working:

systemctl restart iscsi

"

@aljabertareq
Copy link
Author

@james-munson @derekbit any thoughts on this?

@james-munson james-munson moved this from New to Team review required in Community Review Sprint Jan 16, 2024
@james-munson
Copy link
Contributor

If this happens again, it would be interesting to see the full output from systemctl status iscsi, and also for systemctl status iscsid

@james-munson james-munson moved this from Team review required to Pending user response in Community Review Sprint Mar 7, 2024
@longhorn-io-github-bot longhorn-io-github-bot moved this to New Issues in Longhorn Sprint Aug 2, 2024
Copy link

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the stale label Nov 25, 2024
@aljabertareq
Copy link
Author

@james-munson any findings here?

@longhorn-io-github-bot longhorn-io-github-bot moved this from Waiting For Reporter to In Progress in Community Review Sprint Nov 29, 2024
@github-actions github-actions bot removed the stale label Dec 2, 2024
Copy link

github-actions bot commented Jan 2, 2025

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the stale label Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage stale
Projects
Status: In Progress
Status: New Issues
Development

No branches or pull requests

3 participants