[BUG] Longhorn fails to recover from a node restart #8403
Description
Describe the bug
After #8091 was fixed, I updated to 1.6.1, but got the another error.
To Reproduce
What I did:
- I uninstalled longhorn, along with ALL resources (making sure to not leave anything behind)
- I installed longhorn v1.6.1 (through rancher's chart: 103.3.0+up1.6.1)
- I set up the storage nodes (like before)
- I reinstalled nexus
- I checked if the replicas were all healthy (they were)
- I restarted one of my cluster nodes (making sure that that one contained one of the replicas)
Expected behavior
I noticed the node becoming unhealthy (showing red or gray). This was expected, since it doesn't surprise me that it would first try to wait to see if the node comes back.
I expected that the node's replica would either rejoin the volume (after it updated to the latest state) or, if the replica was too out of sync, that a new replica would be created from one of the 2 remaining ones
Actual behavior
After a while, when the node comes back, the replica on that node disappears and the other two replicas become faulted, switching between attaching and detaching because of issues.
When the node comes back I see these messages:
Start rebuilding replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-51bf25ab with Address 10.42.2.195:10000 for normal engine pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-e-0 and volume pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490
I then get this error (probably because the node wasn't fully ready yet):
Failed rebuilding replica with Address 10.42.2.195:10000: proxyServer=10.42.0.156:8501 destination=10.42.0.156:10000: failed to add replica tcp://10.42.2.195:10000 for volume: rpc error: code = Unavailable desc = failed to get replica 10.42.2.195:10000: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.42.2.195:10000: i/o timeout"
Followed by a couple of:
Node pirack2-node4 is ready
And then
Engine image ei-ad735e4a (rancher/mirrored-longhornio-longhorn-engine:v1.6.1) is fully deployed on all ready nodes
Failed rebuilding replica with Address 10.42.2.213:10000: proxyServer=10.42.0.156:8501 destination=10.42.0.156:10000: failed to add replica tcp://10.42.2.213:10000 for volume: rpc error: code = Unknown desc = failed to sync files [{FromFileName:volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img ToFileName:volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img ActualSize:512} {FromFileName:volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img.meta ToFileName:volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img.meta ActualSize:0}] from tcp://10.42.7.226:10000: rpc error: code = Unknown desc = replica tcp://10.42.7.226:10000 failed to send file volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img to 10.42.2.213:10004: failed to send file volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img to 10.42.2.213:10004: rpc error: code = Unknown desc = open volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img: invalid argument
Start rebuilding replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-51bf25ab with Address 10.42.2.213:10000 for normal engine pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-e-0 and volume pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490
snapshot becomes ready to use
replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-31aac936 failed the snapshot purge: failed to open parent file volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img: open volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img: invalid argument
snapshot becomes not ready to use
snapshot is marked as removed
replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-b2ff7e06 failed the restore: Failed to get restoring status on tcp://10.42.7.226:10000: failed to get restore status: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.42.7.226:10002: connect: connection refused"
replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-b2ff7e06 failed the snapshot purge: failed to open parent file volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img: open volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img: invalid argument
Detected replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-31aac936 (10.42.1.194:10000) in error
Detected replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-b2ff7e06 (10.42.7.226:10000) in error
lost track of the corresponding snapshot info inside volume engine
Volume pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490 requested remount at 2024-04-21T01:01:15Z after automatically salvaging replicas
Replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-31aac936 of volume pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490 will be automatically salvaged
Replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-b2ff7e06 of volume pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490 will be automatically salvaged
snapshot is marked as removed
Stops pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-e-0
Start rebuilding replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-51bf25ab with Address 10.42.2.213:10000 for normal engine pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-e-0 and volume pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490
Detected rebuilding replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-51bf25ab (10.42.2.213:10000)
snapshot becomes ready to use
Failed rebuilding replica with Address 10.42.2.213:10000: proxyServer=10.42.3.188:8501 destination=10.42.3.188:10000: failed to add replica tcp://10.42.2.213:10000 for volume: rpc error: code = Unknown desc = failed to sync files [{FromFileName:volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img ToFileName:volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img ActualSize:2343993344} {FromFileName:volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img.meta ToFileName:volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img.meta ActualSize:0} {FromFileName:volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img ToFileName:volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img ActualSize:162422784} {FromFileName:volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img.meta ToFileName:volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img.meta ActualSize:0}] from tcp://10.42.1.194:10000: rpc error: code = Unknown desc = replica tcp://10.42.1.194:10000 failed to send file volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img to 10.42.2.213:10006: failed to send file volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img to 10.42.2.213:10006: rpc error: code = Unknown desc = open volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img: invalid argument
replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-31aac936 failed the snapshot purge: failed to open childFile: open volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img: invalid argument
replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-b2ff7e06 failed the snapshot purge: failed to open childFile: open volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img: invalid argument
snapshot becomes not ready to use
snapshot is marked as removed
volume pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490 became degraded
Detected replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-b2ff7e06 (10.42.7.226:10000) in error
Detected replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-31aac936 (10.42.1.194:10000) in error
...
Etc.
The one that stands out is the invalid argument
A day later: without me changing anything, it switches between showing only 1, 2 or all 3 (faulted) replicas in the Longhorn UI (showing only 1 most of the time).
Support bundle for troubleshooting
https://we.tl/t-Y482MSSwQH (valid for 7 days)
Environment
- Longhorn version: v1.6.1
- Impacted volume (PV): pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Rancher v2.8.3 (using chart longhorn:103.3.0+up1.6.1, helm version: v2.16.8-rancher2)
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: K3s v1.26.10+k3s1 (managed by Rancher v2.8.3)
- Number of control plane nodes in the cluster: 3
- Number of worker nodes in the cluster: 5
- Node config
- OS type and version: Ubuntu 22.04.3 LTS
- Kernel version: 5.15.0-1047-raspi aarch64
- CPU per node: 4 cores
- Memory per node: 8GB
- Disk type (e.g. SSD/NVMe/HDD): NVMe SSD in a NAS, connected with a CIFS mount.
- Network bandwidth between the nodes (Gbps): 1Gb
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): 8x Raspberry PI 4B (for nodes) and QNAP TBS-464 (for NAS)
- Number of Longhorn volumes in the cluster: 2
Metadata
Assignees
Labels
Type
Projects
Status
In Progress
Status
Implement