Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Volume rebuilding never succeed after the first rebuilding failed #7723

Closed
shuo-wu opened this issue Jan 18, 2024 · 2 comments
Closed
Assignees
Labels
area/v2-data-engine v2 data engine (SPDK) area/volume-replica-rebuild Volume replica rebuilding related kind/bug
Milestone

Comments

@shuo-wu
Copy link
Contributor

shuo-wu commented Jan 18, 2024

Describe the bug

If the first rebuilding fails, the previous replica info cached in the source replica cannot be cleaned up. Then this source replica will fail all following rebuildings.

To Reproduce

  1. Create a 2-replica v2 volume
  2. Attach it. Write some data and create some snapshots.
  3. Detach the volume and reattach it with maintenance mode
  4. Delete one replica to trigger the rebuilding
  5. When the rebuilding starts, crash this rebuilding replica again. (Notice that to make sure the data copy is started, it's better to wait a while after the rebuilding replica launch. You can see the below shallow copy error message in the log)
[instance-manager-60b2e79f024f99de6761d34a5522cf74] [longhorn-instance-manager] time="2024-01-18T12:52:26Z" level=error msg="Failed to rebuild replica vol-r-9669bef4 with address 10.42.0.126:20001 from src replica vol-r-cd67ae89 with address 10.42.1.251:20001, will mark the rebuilding replica mode as ERR" func="spdk.(*Engine).ReplicaShallowCopy.func1" file="engine.go:944" engineName=vol-e-0 error="failed to shallow copy snapshot b6f6890f-71a1-4ff1-b5bd-f8c5052fbdae from src replica vol-r-cd67ae89: rpc error: code = Unknown desc = error sending message, id 34389, method bdev_lvol_shallow_copy, params {8a518b32-946b-4556-948e-119467cd3ec7 vol-r-9669bef4n1}: {\"code\": -32602,\"message\": \"Input/output error\"}" frontend= ip=10.42.0.126 replicaAddressMap="map[vol-r-9669bef4:10.42.0.126:20001 vol-r-cd67ae89:10.42.1.251:20001]" volumeName=vol
  1. The volume is looping in new rebuilding replicas start then failed

Expected behavior

After step 6, you can find all following rebuilding replicas are failed with this log:

[instance-manager-60b2e79f024f99de6761d34a5522cf74] [longhorn-instance-manager] time="2024-01-18T12:57:22Z" level=error msg="Failed to rebuild replica vol-r-8356c8a1 with address 10.42.0.126:20001 from src replica vol-r-cd67ae89 with address 10.42.1.251:20001, will mark the rebuilding replica mode as ERR" func="spdk.(*Engine).ReplicaShallowCopy.func1" file="engine.go:944" engineName=vol-e-0 error="failed to start replica rebuilding src vol-r-cd67ae89 for rebuilding replica vol-r-8356c8a1: rpc error: code = Unknown desc = found mismatching between the required dst bdev nvme controller name vol-r-8356c8a1 and the expected dst controller name vol-r-9669bef4 for replica vol-r-cd67ae89 rebuilding src unexpose" frontend= ip=10.42.0.126 replicaAddressMap="map[vol-r-8356c8a1:10.42.0.126:20001 vol-r-cd67ae89:10.42.1.251:20001]" volumeName=vol

Support bundle for troubleshooting

Environment

  • Longhorn version: v1.6.0-rc1
  • Impacted volume (PV):
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
    • Number of control plane nodes in the cluster:
    • Number of worker nodes in the cluster:
  • Node config
    • OS type and version:
    • Kernel version:
    • CPU per node:
    • Memory per node:
    • Disk type (e.g. SSD/NVMe/HDD):
    • Network bandwidth between the nodes (Gbps):
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster:

Additional context

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Jan 18, 2024

Pre Ready-For-Testing Checklist

@chriscchien
Copy link
Contributor

Verified pass on longhorn master(longhorn-instance-manager 79f2c8), longhorn v1.6.x(longhorn-instance-manager fe2323) with test steps

Crash a rebuilding replica of a maintance V2 volume, the replica rebuild eveuntally success and the data intact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/v2-data-engine v2 data engine (SPDK) area/volume-replica-rebuild Volume replica rebuilding related kind/bug
Projects
Status: Closed
Development

No branches or pull requests

3 participants