[BUG] Longhorn fails to recover from a node restart

## Describe the bug

After https://github.com/longhorn/longhorn/issues/8091 was fixed, I updated to 1.6.1, but got the another error. 


## To Reproduce

What I did:

- I uninstalled longhorn, along with ALL resources (making sure to not leave anything behind)
- I installed longhorn v1.6.1 (through rancher's chart: 103.3.0+up1.6.1)
- I set up the storage nodes (like before)
- I reinstalled nexus
- I checked if the replicas were all healthy (they were)
- I restarted one of my cluster nodes (making sure that that one contained one of the replicas)

## Expected behavior
I noticed the node becoming unhealthy (showing red or gray). This was expected, since it doesn't surprise me that it would first try to wait to see if the node comes back.

I expected that the node's replica would either rejoin the volume (after it updated to the latest state) or, if the replica was too out of sync, that a new replica would be created from one of the 2 remaining ones

## Actual behavior

After a while, when the node comes back, the replica on that node disappears and the other two replicas become faulted, switching between attaching and detaching because of issues.

When the node comes back I see these messages:
```
Start rebuilding replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-51bf25ab with Address 10.42.2.195:10000 for normal engine pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-e-0 and volume pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490
```
I then get this error (probably because the node wasn't fully ready yet):
```
Failed rebuilding replica with Address 10.42.2.195:10000: proxyServer=10.42.0.156:8501 destination=10.42.0.156:10000: failed to add replica tcp://10.42.2.195:10000 for volume: rpc error: code = Unavailable desc = failed to get replica 10.42.2.195:10000: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.42.2.195:10000: i/o timeout"
```
Followed by a couple of:
```
Node pirack2-node4 is ready
```
And then
```
Engine image ei-ad735e4a (rancher/mirrored-longhornio-longhorn-engine:v1.6.1) is fully deployed on all ready nodes

Failed rebuilding replica with Address 10.42.2.213:10000: proxyServer=10.42.0.156:8501 destination=10.42.0.156:10000: failed to add replica tcp://10.42.2.213:10000 for volume: rpc error: code = Unknown desc = failed to sync files [{FromFileName:volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img ToFileName:volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img ActualSize:512} {FromFileName:volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img.meta ToFileName:volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img.meta ActualSize:0}] from tcp://10.42.7.226:10000: rpc error: code = Unknown desc = replica tcp://10.42.7.226:10000 failed to send file volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img to 10.42.2.213:10004: failed to send file volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img to 10.42.2.213:10004: rpc error: code = Unknown desc = open volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img: invalid argument

Start rebuilding replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-51bf25ab with Address 10.42.2.213:10000 for normal engine pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-e-0 and volume pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490

snapshot becomes ready to use

replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-31aac936 failed the snapshot purge: failed to open parent file volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img: open volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img: invalid argument

snapshot becomes not ready to use

snapshot is marked as removed

replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-b2ff7e06 failed the restore: Failed to get restoring status on tcp://10.42.7.226:10000: failed to get restore status: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.42.7.226:10002: connect: connection refused"

replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-b2ff7e06 failed the snapshot purge: failed to open parent file volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img: open volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img: invalid argument

Detected replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-31aac936 (10.42.1.194:10000) in error

Detected replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-b2ff7e06 (10.42.7.226:10000) in error

lost track of the corresponding snapshot info inside volume engine

Volume pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490 requested remount at 2024-04-21T01:01:15Z after automatically salvaging replicas

Replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-31aac936 of volume pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490 will be automatically salvaged

Replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-b2ff7e06 of volume pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490 will be automatically salvaged

snapshot is marked as removed

Stops pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-e-0

Start rebuilding replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-51bf25ab with Address 10.42.2.213:10000 for normal engine pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-e-0 and volume pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490

Detected rebuilding replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-51bf25ab (10.42.2.213:10000)

snapshot becomes ready to use

Failed rebuilding replica with Address 10.42.2.213:10000: proxyServer=10.42.3.188:8501 destination=10.42.3.188:10000: failed to add replica tcp://10.42.2.213:10000 for volume: rpc error: code = Unknown desc = failed to sync files [{FromFileName:volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img ToFileName:volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img ActualSize:2343993344} {FromFileName:volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img.meta ToFileName:volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img.meta ActualSize:0} {FromFileName:volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img ToFileName:volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img ActualSize:162422784} {FromFileName:volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img.meta ToFileName:volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img.meta ActualSize:0}] from tcp://10.42.1.194:10000: rpc error: code = Unknown desc = replica tcp://10.42.1.194:10000 failed to send file volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img to 10.42.2.213:10006: failed to send file volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img to 10.42.2.213:10006: rpc error: code = Unknown desc = open volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img: invalid argument

replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-31aac936 failed the snapshot purge: failed to open childFile: open volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img: invalid argument

replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-b2ff7e06 failed the snapshot purge: failed to open childFile: open volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img: invalid argument

snapshot becomes not ready to use

snapshot is marked as removed

volume pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490 became degraded

Detected replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-b2ff7e06 (10.42.7.226:10000) in error

Detected replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-31aac936 (10.42.1.194:10000) in error

...
```
Etc.

The one that stands out is the `invalid argument`

**A day later:** without me changing anything, it switches between showing only 1, 2 or all 3 (faulted) replicas in the Longhorn UI (showing only 1 most of the time).

## Support bundle for troubleshooting


https://we.tl/t-Y482MSSwQH (valid for 7 days)

## Environment



 - Longhorn version: v1.6.1
 - Impacted volume (PV): pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490
 - Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Rancher v2.8.3 (using chart longhorn:103.3.0+up1.6.1, helm version: v2.16.8-rancher2)
 - Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: K3s v1.26.10+k3s1 (managed by Rancher v2.8.3)
   - Number of control plane nodes in the cluster: 3
   - Number of worker nodes in the cluster: 5
 - Node config
   - OS type and version: Ubuntu 22.04.3 LTS
   - Kernel version: 5.15.0-1047-raspi aarch64
   - CPU per node: 4 cores
   - Memory per node: 8GB
   - Disk type (e.g. SSD/NVMe/HDD): NVMe SSD in a NAS, connected with a CIFS mount.
   - Network bandwidth between the nodes (Gbps): 1Gb
 - Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): 8x Raspberry PI 4B (for nodes) and QNAP TBS-464 (for NAS)
 - Number of Longhorn volumes in the cluster: 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Longhorn fails to recover from a node restart #8403

Describe the bug

To Reproduce

Expected behavior

Actual behavior

Support bundle for troubleshooting

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development