[BUG] v2 volume gets stuck in degraded state and continuously rebuilds/deletes replicas after a kubelet restart #10107

yangchiu · 2024-12-31T05:52:06Z

Describe the bug

While running negative test case Stop Volume Node Kubelet For More Than Pod Eviction Timeout While Workload Heavy Writing, after stopping kubelet on the volume attached node for more than the Pod Eviction Timeout (6 minutes) and then restarting it, accidentally encountered v2 volume gets stuck in degraded state indefinitely. Replica rebuilding keeps being triggered, but the rebuilt replica will become failed and be deleted, causing the rebuild process to restart repeatedly.

case 1:

http://54.162.165.3:30000/#/volume/pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2

https://ci.longhorn.io/job/public/job/v1.8.x/job/v1.8.x-longhorn-e2e-tests-sles-amd64/14/

With error messages in longhorn-manager:

2024-12-31T05:01:07.794166660Z time="2024-12-31T05:01:07Z" level=warning msg="Failed to sync Longhorn snapshot" func=controller.handleReconcileErrorLogging file="utils.go:77" Snapshot=longhorn-system/rebuild-f0ff584c controller=longhorn-snapshot error="longhorn-snapshot: failed to sync snapshot longhorn-system/rebuild-f0ff584c: handleAttachmentTicketDeletion: failed to clean up attachment: Operation cannot be fulfilled on volumeattachments.longhorn.io \"pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2\": the object has been modified; please apply your changes to the latest version and try again" node=ip-10-0-2-222
2024-12-31T05:01:07.875211985Z time="2024-12-31T05:01:07Z" level=info msg="Event(v1.ObjectReference{Kind:\"Snapshot\", Namespace:\"longhorn-system\", Name:\"rebuild-76dbfa84\", UID:\"c92db4be-30a6-496c-9284-1474b2bcc135\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"93976\", FieldPath:\"\"}): type: 'Normal' reason: 'Update' snapshot becomes ready to use" func="record.(*eventBroadcasterImpl).StartLogging.func1" file="event.go:377"
2024-12-31T05:01:08.383478356Z time="2024-12-31T05:01:08Z" level=error msg="Failed to rebuild replica 10.42.3.33:24487" func="controller.(*EngineController).startRebuilding.func2" file="engine_controller.go:1853" controller=longhorn-engine engine=pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-e-0 error="unknown replica pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-a1a22fb0 for engine" node=ip-10-0-2-222 volume=pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2
2024-12-31T05:01:08.383504487Z time="2024-12-31T05:01:08Z" level=info msg="Removing failed rebuilding replica 10.42.3.33:24487" func="controller.(*EngineController).startRebuilding.func2" file="engine_controller.go:1860" controller=longhorn-engine engine=pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-e-0 node=ip-10-0-2-222 volume=pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2
2024-12-31T05:01:08.383691366Z time="2024-12-31T05:01:08Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-e-0\", UID:\"faf9d778-d83b-4bb7-bb3c-4aef83732423\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"93933\", FieldPath:\"\"}): type: 'Warning' reason: 'FailedRebuilding' Failed rebuilding replica with Address 10.42.3.33:24487: unknown replica pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-a1a22fb0 for engine" func="record.(*eventBroadcasterImpl).StartLogging.func1" file="event.go:377"
2024-12-31T05:01:08.385339799Z time="2024-12-31T05:01:08Z" level=warning msg="Failed to remove rebuilding replica 10.42.3.33:24487" func="controller.(*EngineController).startRebuilding.func2" file="engine_controller.go:1862" controller=longhorn-engine engine=pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-e-0 error="proxyServer=10.42.1.24:8501 destination=10.42.1.24:20320: failed to remove replica tcp://10.42.3.33:24487 for volume: rpc error: code = Unknown desc = failed to delete replica  with address tcp://10.42.3.33:24487 to engine pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-e-0: rpc error: code = Unknown desc = cannot find replica name with address tcp://10.42.3.33:24487 for engine pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-e-0 replica delete" node=ip-10-0-2-222 volume=pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2
2024-12-31T05:01:08.385709943Z time="2024-12-31T05:01:08Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-e-0\", UID:\"faf9d778-d83b-4bb7-bb3c-4aef83732423\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"93933\", FieldPath:\"\"}): type: 'Warning' reason: 'FailedDeleting' Failed to remove rebuilding replica pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-a1a22fb0 with address 10.42.3.33:24487 for engine pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-e-0 and volume pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2 due to rebuilding failure: proxyServer=10.42.1.24:8501 destination=10.42.1.24:20320: failed to remove replica tcp://10.42.3.33:24487 for volume: rpc error: code = Unknown desc = failed to delete replica  with address tcp://10.42.3.33:24487 to engine pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-e-0: rpc error: code = Unknown desc = cannot find replica name with address tcp://10.42.3.33:24487 for engine pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-e-0 replica delete" func="record.(*eventBroadcasterImpl).StartLogging.func1" file="event.go:377"
2024-12-31T05:01:08.391048901Z time="2024-12-31T05:01:08Z" level=error msg="Failed to update rebuild status information on replica pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-a1a22fb0" func="controller.(*EngineController).startRebuilding.func2" file="engine_controller.go:1875" controller=longhorn-engine engine=pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-e-0 error="Operation cannot be fulfilled on replicas.longhorn.io \"pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-a1a22fb0\": the object has been modified; please apply your changes to the latest version and try again" node=ip-10-0-2-222 volume=pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2

And error messages in instance-manager:

2024-12-31T05:49:56.771187701Z [longhorn-instance-manager] time="2024-12-31T05:49:56Z" level=error msg="Replica pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-2f697cb1 failed to register checksum for snapshot c683d7e3-e7f9-4575-b818-cd9c96abae8a: error sending message, id 192964, method bdev_lvol_register_snapshot_checksum, params {block-disk/pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-2f697cb1-snap-rebuild-b72a5f12}: {\"code\": -32603,\"message\": \"Device or resource busy\"}" func="spdk.(*Replica).Sync.func1" file="replica.go:249"
2024-12-31T05:49:59.680923939Z [2024-12-31 05:49:59.680766] vbdev_lvol_rpc.c:2076:rpc_bdev_lvol_get_snapshot_checksum: *ERROR*: Error -2 getting snapshot checksum
2024-12-31T05:49:59.689173544Z [2024-12-31 05:49:59.688988] lvol.c:2749:lvol_snapshot_checksum_cb: *ERROR*: could not exec snapshot checksum of lvol pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-2f697cb1-snap-rebuild-b72a5f12, error -16
2024-12-31T05:49:59.701040103Z [longhorn-instance-manager] time="2024-12-31T05:49:59Z" level=error msg="Replica pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-2f697cb1 failed to register checksum for snapshot c683d7e3-e7f9-4575-b818-cd9c96abae8a: error sending message, id 192974, method bdev_lvol_register_snapshot_checksum, params {block-disk/pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-2f697cb1-snap-rebuild-b72a5f12}: {\"code\": -32603,\"message\": \"Device or resource busy\"}" func="spdk.(*Replica).Sync.func1" file="replica.go:249"
2024-12-31T05:50:02.678250757Z [2024-12-31 05:50:02.678046] vbdev_lvol_rpc.c:2076:rpc_bdev_lvol_get_snapshot_checksum: *ERROR*: Error -2 getting snapshot checksum
2024-12-31T05:50:02.686435050Z [2024-12-31 05:50:02.686271] lvol.c:2749:lvol_snapshot_checksum_cb: *ERROR*: could not exec snapshot checksum of lvol pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-2f697cb1-snap-rebuild-b72a5f12, error -16
2024-12-31T05:50:02.701535411Z [longhorn-instance-manager] time="2024-12-31T05:50:02Z" level=error msg="Replica pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-2f697cb1 failed to register checksum for snapshot c683d7e3-e7f9-4575-b818-cd9c96abae8a: error sending message, id 192991, method bdev_lvol_register_snapshot_checksum, params {block-disk/pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-2f697cb1-snap-rebuild-b72a5f12}: {\"code\": -32603,\"message\": \"Device or resource busy\"}" func="spdk.(*Replica).Sync.func1" file="replica.go:249"
2024-12-31T05:50:02.906406913Z [2024-12-31 05:50:02.906246] vbdev_lvol_rpc.c:2076:rpc_bdev_lvol_get_snapshot_checksum: *ERROR*: Error -2 getting snapshot checksum
2024-12-31T05:50:02.934596968Z [longhorn-instance-manager] time="2024-12-31T05:50:02Z" level=info msg="Replica created snapshot rebuild-14acc9a7(block-disk/pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-2f697cb1-snap-rebuild-14acc9a7)" func="spdk.(*Replica).SnapshotCreate" file="replica.go:1062" lvsName=block-disk lvsUUID=10d0079d-3a20-4d99-bcb8-51c83cf8d38d replicaName=pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-2f697cb1
2024-12-31T05:50:05.703577981Z [2024-12-31 05:50:05.703411] vbdev_lvol_rpc.c:2076:rpc_bdev_lvol_get_snapshot_checksum: *ERROR*: Error -2 getting snapshot checksum
2024-12-31T05:50:27.738778225Z [2024-12-31 05:50:27.738564] vbdev_lvol_rpc.c:2076:rpc_bdev_lvol_get_snapshot_checksum: *ERROR*: Error -2 getting snapshot checksum
2024-12-31T05:50:29.678909911Z [2024-12-31 05:50:29.678720] vbdev_lvol_rpc.c:2076:rpc_bdev_lvol_get_snapshot_checksum: *ERROR*: Error -2 getting snapshot checksum
2024-12-31T05:50:32.745105537Z [2024-12-31 05:50:32.744948] vbdev_lvol_rpc.c:2076:rpc_bdev_lvol_get_snapshot_checksum: *ERROR*: Error -2 getting snapshot checksum
2024-12-31T05:50:32.753275334Z [2024-12-31 05:50:32.753163] lvol.c:2749:lvol_snapshot_checksum_cb: *ERROR*: could not exec snapshot checksum of lvol pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-2f697cb1-snap-rebuild-14acc9a7, error -16
2024-12-31T05:50:32.770728223Z [longhorn-instance-manager] time="2024-12-31T05:50:32Z" level=error msg="Replica pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-2f697cb1 failed to register checksum for snapshot 0df77fde-910c-44bc-b14e-789fba63d022: error sending message, id 193205, method bdev_lvol_register_snapshot_checksum, params {block-disk/pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-2f697cb1-snap-rebuild-14acc9a7}: {\"code\": -32603,\"message\": \"Device or resource busy\"}" func="spdk.(*Replica).Sync.func1" file="replica.go:249"
2024-12-31T05:50:35.679272946Z [2024-12-31 05:50:35.679068] vbdev_lvol_rpc.c:2076:rpc_bdev_lvol_get_snapshot_checksum: *ERROR*: Error -2 getting snapshot checksum
2024-12-31T05:50:35.687451498Z [2024-12-31 05:50:35.687289] lvol.c:2749:lvol_snapshot_checksum_cb: *ERROR*: could not exec snapshot checksum of lvol pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-2f697cb1-snap-rebuild-14acc9a7, error -16
2024-12-31T05:50:35.701521640Z [longhorn-instance-manager] time="2024-12-31T05:50:35Z" level=error msg="Replica pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-2f697cb1 failed to register checksum for snapshot 0df77fde-910c-44bc-b14e-789fba63d022: error sending message, id 193215, method bdev_lvol_register_snapshot_checksum, params {block-disk/pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-2f697cb1-snap-rebuild-14acc9a7}: {\"code\": -32603,\"message\": \"Device or resource busy\"}" func="spdk.(*Replica).Sync.func1" file="replica.go:249"

case 2:

https://ci.longhorn.io/job/public/job/v1.8.x/job/v1.8.x-longhorn-e2e-tests-sles-amd64/12/

http://52.20.55.8:30000/#/volume/pvc-8c6c802d-1959-4c05-b97e-397072b901c5

With error messages in longhorn-manager:

2024-12-30T17:10:31.763741785Z time="2024-12-30T17:10:31Z" level=error msg="Failed to rebuild replica 10.42.1.24:20006" func="controller.(*EngineController).startRebuilding.func2" file="engine_controller.go:1853" controller=longhorn-engine engine=pvc-959b8f6b-0925-42b9-afc4-1969f306147e-e-0 error="unknown replica pvc-959b8f6b-0925-42b9-afc4-1969f306147e-r-4932f092 for engine" node=ip-10-0-2-15 volume=pvc-959b8f6b-0925-42b9-afc4-1969f306147e
2024-12-30T17:10:31.763802659Z time="2024-12-30T17:10:31Z" level=info msg="Removing failed rebuilding replica 10.42.1.24:20006" func="controller.(*EngineController).startRebuilding.func2" file="engine_controller.go:1860" controller=longhorn-engine engine=pvc-959b8f6b-0925-42b9-afc4-1969f306147e-e-0 node=ip-10-0-2-15 volume=pvc-959b8f6b-0925-42b9-afc4-1969f306147e
2024-12-30T17:10:31.764365617Z time="2024-12-30T17:10:31Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pvc-959b8f6b-0925-42b9-afc4-1969f306147e-e-0\", UID:\"be514cf8-0d28-44c4-8295-bc1e8a48bb8f\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"5858\", FieldPath:\"\"}): type: 'Warning' reason: 'FailedRebuilding' Failed rebuilding replica with Address 10.42.1.24:20006: unknown replica pvc-959b8f6b-0925-42b9-afc4-1969f306147e-r-4932f092 for engine" func="record.(*eventBroadcasterImpl).StartLogging.func1" file="event.go:377"
2024-12-30T17:10:31.766468102Z time="2024-12-30T17:10:31Z" level=warning msg="Failed to remove rebuilding replica 10.42.1.24:20006" func="controller.(*EngineController).startRebuilding.func2" file="engine_controller.go:1862" controller=longhorn-engine engine=pvc-959b8f6b-0925-42b9-afc4-1969f306147e-e-0 error="proxyServer=10.42.4.18:8501 destination=10.42.4.18:20016: failed to remove replica tcp://10.42.1.24:20006 for volume: rpc error: code = Unknown desc = failed to delete replica  with address tcp://10.42.1.24:20006 to engine pvc-959b8f6b-0925-42b9-afc4-1969f306147e-e-0: rpc error: code = Unknown desc = cannot find replica name with address tcp://10.42.1.24:20006 for engine pvc-959b8f6b-0925-42b9-afc4-1969f306147e-e-0 replica delete" node=ip-10-0-2-15 volume=pvc-959b8f6b-0925-42b9-afc4-1969f306147e

And error message in instance-manager:

2024-12-31T06:01:20.004865665Z [2024-12-31 06:01:20.004816] vbdev_lvol.c: 864:vbdev_lvol_dump_info_json: *ERROR*: Cannot obtain snapshots name
2024-12-31T06:01:20.004900432Z [2024-12-31 06:01:20.004840] vbdev_lvol.c: 864:vbdev_lvol_dump_info_json: *ERROR*: Cannot obtain snapshots name
2024-12-31T06:01:20.004907862Z [2024-12-31 06:01:20.004865] vbdev_lvol.c: 864:vbdev_lvol_dump_info_json: *ERROR*: Cannot obtain snapshots name
2024-12-31T06:01:20.004979702Z [2024-12-31 06:01:20.004893] vbdev_lvol.c: 864:vbdev_lvol_dump_info_json: *ERROR*: Cannot obtain snapshots name
2024-12-31T06:01:20.004989427Z [2024-12-31 06:01:20.004924] vbdev_lvol.c: 864:vbdev_lvol_dump_info_json: *ERROR*: Cannot obtain snapshots name
2024-12-31T06:01:20.004993020Z [2024-12-31 06:01:20.004950] vbdev_lvol.c: 864:vbdev_lvol_dump_info_json: *ERROR*: Cannot obtain snapshots name
2024-12-31T06:01:20.005029075Z [2024-12-31 06:01:20.004973] vbdev_lvol.c: 864:vbdev_lvol_dump_info_json: *ERROR*: Cannot obtain snapshots name
2024-12-31T06:01:20.005047867Z [2024-12-31 06:01:20.004998] vbdev_lvol.c: 864:vbdev_lvol_dump_info_json: *ERROR*: Cannot obtain snapshots name
2024-12-31T06:01:20.005109812Z [2024-12-31 06:01:20.005028] vbdev_lvol.c: 864:vbdev_lvol_dump_info_json: *ERROR*: Cannot obtain snapshots name
2024-12-31T06:01:20.005115545Z [2024-12-31 06:01:20.005055] vbdev_lvol.c: 864:vbdev_lvol_dump_info_json: *ERROR*: Cannot obtain snapshots name
2024-12-31T06:01:20.005118871Z [2024-12-31 06:01:20.005078] vbdev_lvol.c: 864:vbdev_lvol_dump_info_json: *ERROR*: Cannot obtain snapshots name

To Reproduce

Run negative test case with options: -t \"Stop Volume Node Kubelet For More Than Pod Eviction Timeout While Workload Heavy Writing\" -v DATA_ENGINE:v2 -v LOOP_COUNT:10 -v RETRY_COUNT:259200

Expected behavior

Support bundle for troubleshooting

case 1:
supportbundle_894e3cbd-148a-43ba-a85f-98d07b22edda_2024-12-31T05-50-18Z.zip

case 2:
supportbundle_ff850278-d8c7-4518-8a32-5bdb85e44ee7_2024-12-31T06-00-59Z.zip

Environment

Longhorn version: v1.8.0-rc2
Impacted volume (PV):
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.31.2+k3s1
- Number of control plane nodes in the cluster:
- Number of worker nodes in the cluster:
Node config
- OS type and version: sles 15-sp6
- Kernel version:
- CPU per node:
- Memory per node:
- Disk type (e.g. SSD/NVMe/HDD):
- Network bandwidth between the nodes (Gbps):
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
Number of Longhorn volumes in the cluster:

Additional context

v1.8.0-rc1 doesn't have this issue:
https://ci.longhorn.io/job/private/job/longhorn-e2e-test/2237/
https://ci.longhorn.io/job/private/job/longhorn-e2e-test/2238/

Workaround and Mitigation

The text was updated successfully, but these errors were encountered:

derekbit · 2024-12-31T06:06:09Z

Hello @shuo-wu
Could you help check the issue? Thank you.

derekbit · 2025-01-06T13:42:05Z

Moved to v1.9.0. Best effort in v1.8.0.

yangchiu added this to the v1.8.0 milestone Dec 31, 2024

github-project-automation bot moved this to New Issues in Longhorn Sprint Dec 31, 2024

github-project-automation bot added this to Longhorn Sprint Dec 31, 2024

yangchiu changed the title ~~[BUG] v2 volume gets stuck in degraded state and keeps rebuilding/deleting replica after kubelet restart~~ [BUG] v2 volume gets stuck in degraded state and continuously rebuilds/deletes replicas after a kubelet restart Dec 31, 2024

derekbit assigned shuo-wu Dec 31, 2024

derekbit modified the milestones: v1.8.0, v1.9.0 Jan 6, 2025

derekbit added priority/0 Must be implement or fixed in this release (managed by PO) and removed priority/1 Highly recommended to implement or fix in this release (managed by PO) labels Jan 6, 2025

derekbit mentioned this issue Jan 7, 2025

[BUG] spdk emits Device or resource busy while registering lvol checksum calculation #10140

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] v2 volume gets stuck in degraded state and continuously rebuilds/deletes replicas after a kubelet restart #10107

[BUG] v2 volume gets stuck in degraded state and continuously rebuilds/deletes replicas after a kubelet restart #10107

yangchiu commented Dec 31, 2024 •

edited

Loading

derekbit commented Dec 31, 2024

derekbit commented Jan 6, 2025

[BUG] v2 volume gets stuck in degraded state and continuously rebuilds/deletes replicas after a kubelet restart #10107

[BUG] v2 volume gets stuck in degraded state and continuously rebuilds/deletes replicas after a kubelet restart #10107

Comments

yangchiu commented Dec 31, 2024 • edited Loading

Describe the bug

case 1:

case 2:

To Reproduce

Expected behavior

Support bundle for troubleshooting

Environment

Additional context

Workaround and Mitigation

derekbit commented Dec 31, 2024

derekbit commented Jan 6, 2025

yangchiu commented Dec 31, 2024 •

edited

Loading