Skip to content

[BUG] Longhorn fails to recover from a node restart #8403

Open
@CC007

Description

Describe the bug

After #8091 was fixed, I updated to 1.6.1, but got the another error.

To Reproduce

What I did:

  • I uninstalled longhorn, along with ALL resources (making sure to not leave anything behind)
  • I installed longhorn v1.6.1 (through rancher's chart: 103.3.0+up1.6.1)
  • I set up the storage nodes (like before)
  • I reinstalled nexus
  • I checked if the replicas were all healthy (they were)
  • I restarted one of my cluster nodes (making sure that that one contained one of the replicas)

Expected behavior

I noticed the node becoming unhealthy (showing red or gray). This was expected, since it doesn't surprise me that it would first try to wait to see if the node comes back.

I expected that the node's replica would either rejoin the volume (after it updated to the latest state) or, if the replica was too out of sync, that a new replica would be created from one of the 2 remaining ones

Actual behavior

After a while, when the node comes back, the replica on that node disappears and the other two replicas become faulted, switching between attaching and detaching because of issues.

When the node comes back I see these messages:

Start rebuilding replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-51bf25ab with Address 10.42.2.195:10000 for normal engine pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-e-0 and volume pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490

I then get this error (probably because the node wasn't fully ready yet):

Failed rebuilding replica with Address 10.42.2.195:10000: proxyServer=10.42.0.156:8501 destination=10.42.0.156:10000: failed to add replica tcp://10.42.2.195:10000 for volume: rpc error: code = Unavailable desc = failed to get replica 10.42.2.195:10000: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.42.2.195:10000: i/o timeout"

Followed by a couple of:

Node pirack2-node4 is ready

And then

Engine image ei-ad735e4a (rancher/mirrored-longhornio-longhorn-engine:v1.6.1) is fully deployed on all ready nodes

Failed rebuilding replica with Address 10.42.2.213:10000: proxyServer=10.42.0.156:8501 destination=10.42.0.156:10000: failed to add replica tcp://10.42.2.213:10000 for volume: rpc error: code = Unknown desc = failed to sync files [{FromFileName:volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img ToFileName:volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img ActualSize:512} {FromFileName:volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img.meta ToFileName:volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img.meta ActualSize:0}] from tcp://10.42.7.226:10000: rpc error: code = Unknown desc = replica tcp://10.42.7.226:10000 failed to send file volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img to 10.42.2.213:10004: failed to send file volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img to 10.42.2.213:10004: rpc error: code = Unknown desc = open volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img: invalid argument

Start rebuilding replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-51bf25ab with Address 10.42.2.213:10000 for normal engine pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-e-0 and volume pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490

snapshot becomes ready to use

replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-31aac936 failed the snapshot purge: failed to open parent file volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img: open volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img: invalid argument

snapshot becomes not ready to use

snapshot is marked as removed

replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-b2ff7e06 failed the restore: Failed to get restoring status on tcp://10.42.7.226:10000: failed to get restore status: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.42.7.226:10002: connect: connection refused"

replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-b2ff7e06 failed the snapshot purge: failed to open parent file volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img: open volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img: invalid argument

Detected replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-31aac936 (10.42.1.194:10000) in error

Detected replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-b2ff7e06 (10.42.7.226:10000) in error

lost track of the corresponding snapshot info inside volume engine

Volume pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490 requested remount at 2024-04-21T01:01:15Z after automatically salvaging replicas

Replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-31aac936 of volume pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490 will be automatically salvaged

Replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-b2ff7e06 of volume pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490 will be automatically salvaged

snapshot is marked as removed

Stops pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-e-0

Start rebuilding replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-51bf25ab with Address 10.42.2.213:10000 for normal engine pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-e-0 and volume pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490

Detected rebuilding replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-51bf25ab (10.42.2.213:10000)

snapshot becomes ready to use

Failed rebuilding replica with Address 10.42.2.213:10000: proxyServer=10.42.3.188:8501 destination=10.42.3.188:10000: failed to add replica tcp://10.42.2.213:10000 for volume: rpc error: code = Unknown desc = failed to sync files [{FromFileName:volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img ToFileName:volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img ActualSize:2343993344} {FromFileName:volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img.meta ToFileName:volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img.meta ActualSize:0} {FromFileName:volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img ToFileName:volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img ActualSize:162422784} {FromFileName:volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img.meta ToFileName:volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img.meta ActualSize:0}] from tcp://10.42.1.194:10000: rpc error: code = Unknown desc = replica tcp://10.42.1.194:10000 failed to send file volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img to 10.42.2.213:10006: failed to send file volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img to 10.42.2.213:10006: rpc error: code = Unknown desc = open volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img: invalid argument

replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-31aac936 failed the snapshot purge: failed to open childFile: open volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img: invalid argument

replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-b2ff7e06 failed the snapshot purge: failed to open childFile: open volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img: invalid argument

snapshot becomes not ready to use

snapshot is marked as removed

volume pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490 became degraded

Detected replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-b2ff7e06 (10.42.7.226:10000) in error

Detected replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-31aac936 (10.42.1.194:10000) in error

...

Etc.

The one that stands out is the invalid argument

A day later: without me changing anything, it switches between showing only 1, 2 or all 3 (faulted) replicas in the Longhorn UI (showing only 1 most of the time).

Support bundle for troubleshooting

https://we.tl/t-Y482MSSwQH (valid for 7 days)

Environment

  • Longhorn version: v1.6.1
  • Impacted volume (PV): pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Rancher v2.8.3 (using chart longhorn:103.3.0+up1.6.1, helm version: v2.16.8-rancher2)
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: K3s v1.26.10+k3s1 (managed by Rancher v2.8.3)
    • Number of control plane nodes in the cluster: 3
    • Number of worker nodes in the cluster: 5
  • Node config
    • OS type and version: Ubuntu 22.04.3 LTS
    • Kernel version: 5.15.0-1047-raspi aarch64
    • CPU per node: 4 cores
    • Memory per node: 8GB
    • Disk type (e.g. SSD/NVMe/HDD): NVMe SSD in a NAS, connected with a CIFS mount.
    • Network bandwidth between the nodes (Gbps): 1Gb
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): 8x Raspberry PI 4B (for nodes) and QNAP TBS-464 (for NAS)
  • Number of Longhorn volumes in the cluster: 2

Metadata

Assignees

No one assigned

    Labels

    investigation-neededIdentified the issue but require further investigation for resolution (won't be stale)kind/bugrequire/backportRequire backport. Only used when the specific versions to backport have not been definied.require/qa-review-coverageRequire QA to review coverage

    Type

    No type

    Projects

    • Status

      In Progress
    • Status

      Implement

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions