[BUG] Volume rebuilding never succeed after the first rebuilding failed #7723

shuo-wu · 2024-01-18T13:04:35Z

Describe the bug

If the first rebuilding fails, the previous replica info cached in the source replica cannot be cleaned up. Then this source replica will fail all following rebuildings.

To Reproduce

Create a 2-replica v2 volume
Attach it. Write some data and create some snapshots.
Detach the volume and reattach it with maintenance mode
Delete one replica to trigger the rebuilding
When the rebuilding starts, crash this rebuilding replica again. (Notice that to make sure the data copy is started, it's better to wait a while after the rebuilding replica launch. You can see the below shallow copy error message in the log)

[instance-manager-60b2e79f024f99de6761d34a5522cf74] [longhorn-instance-manager] time="2024-01-18T12:52:26Z" level=error msg="Failed to rebuild replica vol-r-9669bef4 with address 10.42.0.126:20001 from src replica vol-r-cd67ae89 with address 10.42.1.251:20001, will mark the rebuilding replica mode as ERR" func="spdk.(*Engine).ReplicaShallowCopy.func1" file="engine.go:944" engineName=vol-e-0 error="failed to shallow copy snapshot b6f6890f-71a1-4ff1-b5bd-f8c5052fbdae from src replica vol-r-cd67ae89: rpc error: code = Unknown desc = error sending message, id 34389, method bdev_lvol_shallow_copy, params {8a518b32-946b-4556-948e-119467cd3ec7 vol-r-9669bef4n1}: {\"code\": -32602,\"message\": \"Input/output error\"}" frontend= ip=10.42.0.126 replicaAddressMap="map[vol-r-9669bef4:10.42.0.126:20001 vol-r-cd67ae89:10.42.1.251:20001]" volumeName=vol

The volume is looping in new rebuilding replicas start then failed

Expected behavior

After step 6, you can find all following rebuilding replicas are failed with this log:

[instance-manager-60b2e79f024f99de6761d34a5522cf74] [longhorn-instance-manager] time="2024-01-18T12:57:22Z" level=error msg="Failed to rebuild replica vol-r-8356c8a1 with address 10.42.0.126:20001 from src replica vol-r-cd67ae89 with address 10.42.1.251:20001, will mark the rebuilding replica mode as ERR" func="spdk.(*Engine).ReplicaShallowCopy.func1" file="engine.go:944" engineName=vol-e-0 error="failed to start replica rebuilding src vol-r-cd67ae89 for rebuilding replica vol-r-8356c8a1: rpc error: code = Unknown desc = found mismatching between the required dst bdev nvme controller name vol-r-8356c8a1 and the expected dst controller name vol-r-9669bef4 for replica vol-r-cd67ae89 rebuilding src unexpose" frontend= ip=10.42.0.126 replicaAddressMap="map[vol-r-8356c8a1:10.42.0.126:20001 vol-r-cd67ae89:10.42.1.251:20001]" volumeName=vol

Support bundle for troubleshooting

Environment

Longhorn version: v1.6.0-rc1
Impacted volume (PV):
Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
- Number of control plane nodes in the cluster:
- Number of worker nodes in the cluster:
Node config
- OS type and version:
- Kernel version:
- CPU per node:
- Memory per node:
- Disk type (e.g. SSD/NVMe/HDD):
- Network bandwidth between the nodes (Gbps):
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
Number of Longhorn volumes in the cluster:

Additional context

The text was updated successfully, but these errors were encountered:

longhorn-io-github-bot · 2024-01-18T13:07:43Z

Pre Ready-For-Testing Checklist

Where is the reproduce steps/test steps documented?
The reproduce steps/test steps are at: the ticket description
Does the PR include the explanation for the fix or the feature?
Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
The PR is at spdk: Blindly finish rebuilding for src replica after rebuilding failure longhorn-spdk-engine#103
vendor: Update longhorn/longhorn-spdk-engine longhorn-instance-manager#374
Which areas/issues this PR might have potential impacts on?
Area v2 volume rebuilding
Issues

chriscchien · 2024-01-19T05:20:43Z

Verified pass on longhorn master(longhorn-instance-manager 79f2c8), longhorn v1.6.x(longhorn-instance-manager fe2323) with test steps

Crash a rebuilding replica of a maintance V2 volume, the replica rebuild eveuntally success and the data intact.

shuo-wu added kind/bug area/v2-data-engine v2 data engine (SPDK) area/volume-replica-rebuild Volume replica rebuilding related labels Jan 18, 2024

shuo-wu added this to the v1.6.0 milestone Jan 18, 2024

shuo-wu self-assigned this Jan 18, 2024

shuo-wu mentioned this issue Jan 18, 2024

spdk: Blindly finish rebuilding for src replica after rebuilding failure longhorn/longhorn-spdk-engine#103

Merged

shuo-wu mentioned this issue Jan 19, 2024

vendor: Update longhorn/longhorn-spdk-engine longhorn/longhorn-instance-manager#374

Merged

innobead assigned chriscchien Jan 19, 2024

chriscchien closed this as completed Jan 19, 2024

derekbit added this to Longhorn Sprint Aug 3, 2024

derekbit moved this to Closed in Longhorn Sprint Aug 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Volume rebuilding never succeed after the first rebuilding failed #7723

[BUG] Volume rebuilding never succeed after the first rebuilding failed #7723

shuo-wu commented Jan 18, 2024

longhorn-io-github-bot commented Jan 18, 2024 •

edited by shuo-wu

Loading

chriscchien commented Jan 19, 2024

[BUG] Volume rebuilding never succeed after the first rebuilding failed #7723

[BUG] Volume rebuilding never succeed after the first rebuilding failed #7723

Comments

shuo-wu commented Jan 18, 2024

Describe the bug

To Reproduce

Expected behavior

Support bundle for troubleshooting

Environment

Additional context

longhorn-io-github-bot commented Jan 18, 2024 • edited by shuo-wu Loading

Pre Ready-For-Testing Checklist

chriscchien commented Jan 19, 2024

longhorn-io-github-bot commented Jan 18, 2024 •

edited by shuo-wu

Loading