Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] v2 volume gets stuck in degraded state and continuously rebuilds/deletes replicas after a kubelet restart #10107

Open
yangchiu opened this issue Dec 31, 2024 · 2 comments
Assignees
Labels
area/resilience System or volume resilience area/v2-data-engine v2 data engine (SPDK) area/volume-replica-rebuild Volume replica rebuilding related kind/bug priority/0 Must be implement or fixed in this release (managed by PO) reproduce/often 80 - 50% reproducible require/backport Require backport. Only used when the specific versions to backport have not been definied. severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact)
Milestone

Comments

@yangchiu
Copy link
Member

yangchiu commented Dec 31, 2024

Describe the bug

While running negative test case Stop Volume Node Kubelet For More Than Pod Eviction Timeout While Workload Heavy Writing, after stopping kubelet on the volume attached node for more than the Pod Eviction Timeout (6 minutes) and then restarting it, accidentally encountered v2 volume gets stuck in degraded state indefinitely. Replica rebuilding keeps being triggered, but the rebuilt replica will become failed and be deleted, causing the rebuild process to restart repeatedly.

case 1:

http://54.162.165.3:30000/#/volume/pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2

https://ci.longhorn.io/job/public/job/v1.8.x/job/v1.8.x-longhorn-e2e-tests-sles-amd64/14/

stuck

With error messages in longhorn-manager:

2024-12-31T05:01:07.794166660Z time="2024-12-31T05:01:07Z" level=warning msg="Failed to sync Longhorn snapshot" func=controller.handleReconcileErrorLogging file="utils.go:77" Snapshot=longhorn-system/rebuild-f0ff584c controller=longhorn-snapshot error="longhorn-snapshot: failed to sync snapshot longhorn-system/rebuild-f0ff584c: handleAttachmentTicketDeletion: failed to clean up attachment: Operation cannot be fulfilled on volumeattachments.longhorn.io \"pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2\": the object has been modified; please apply your changes to the latest version and try again" node=ip-10-0-2-222
2024-12-31T05:01:07.875211985Z time="2024-12-31T05:01:07Z" level=info msg="Event(v1.ObjectReference{Kind:\"Snapshot\", Namespace:\"longhorn-system\", Name:\"rebuild-76dbfa84\", UID:\"c92db4be-30a6-496c-9284-1474b2bcc135\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"93976\", FieldPath:\"\"}): type: 'Normal' reason: 'Update' snapshot becomes ready to use" func="record.(*eventBroadcasterImpl).StartLogging.func1" file="event.go:377"
2024-12-31T05:01:08.383478356Z time="2024-12-31T05:01:08Z" level=error msg="Failed to rebuild replica 10.42.3.33:24487" func="controller.(*EngineController).startRebuilding.func2" file="engine_controller.go:1853" controller=longhorn-engine engine=pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-e-0 error="unknown replica pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-a1a22fb0 for engine" node=ip-10-0-2-222 volume=pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2
2024-12-31T05:01:08.383504487Z time="2024-12-31T05:01:08Z" level=info msg="Removing failed rebuilding replica 10.42.3.33:24487" func="controller.(*EngineController).startRebuilding.func2" file="engine_controller.go:1860" controller=longhorn-engine engine=pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-e-0 node=ip-10-0-2-222 volume=pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2
2024-12-31T05:01:08.383691366Z time="2024-12-31T05:01:08Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-e-0\", UID:\"faf9d778-d83b-4bb7-bb3c-4aef83732423\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"93933\", FieldPath:\"\"}): type: 'Warning' reason: 'FailedRebuilding' Failed rebuilding replica with Address 10.42.3.33:24487: unknown replica pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-a1a22fb0 for engine" func="record.(*eventBroadcasterImpl).StartLogging.func1" file="event.go:377"
2024-12-31T05:01:08.385339799Z time="2024-12-31T05:01:08Z" level=warning msg="Failed to remove rebuilding replica 10.42.3.33:24487" func="controller.(*EngineController).startRebuilding.func2" file="engine_controller.go:1862" controller=longhorn-engine engine=pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-e-0 error="proxyServer=10.42.1.24:8501 destination=10.42.1.24:20320: failed to remove replica tcp://10.42.3.33:24487 for volume: rpc error: code = Unknown desc = failed to delete replica  with address tcp://10.42.3.33:24487 to engine pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-e-0: rpc error: code = Unknown desc = cannot find replica name with address tcp://10.42.3.33:24487 for engine pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-e-0 replica delete" node=ip-10-0-2-222 volume=pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2
2024-12-31T05:01:08.385709943Z time="2024-12-31T05:01:08Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-e-0\", UID:\"faf9d778-d83b-4bb7-bb3c-4aef83732423\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"93933\", FieldPath:\"\"}): type: 'Warning' reason: 'FailedDeleting' Failed to remove rebuilding replica pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-a1a22fb0 with address 10.42.3.33:24487 for engine pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-e-0 and volume pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2 due to rebuilding failure: proxyServer=10.42.1.24:8501 destination=10.42.1.24:20320: failed to remove replica tcp://10.42.3.33:24487 for volume: rpc error: code = Unknown desc = failed to delete replica  with address tcp://10.42.3.33:24487 to engine pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-e-0: rpc error: code = Unknown desc = cannot find replica name with address tcp://10.42.3.33:24487 for engine pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-e-0 replica delete" func="record.(*eventBroadcasterImpl).StartLogging.func1" file="event.go:377"
2024-12-31T05:01:08.391048901Z time="2024-12-31T05:01:08Z" level=error msg="Failed to update rebuild status information on replica pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-a1a22fb0" func="controller.(*EngineController).startRebuilding.func2" file="engine_controller.go:1875" controller=longhorn-engine engine=pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-e-0 error="Operation cannot be fulfilled on replicas.longhorn.io \"pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-a1a22fb0\": the object has been modified; please apply your changes to the latest version and try again" node=ip-10-0-2-222 volume=pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2

And error messages in instance-manager:

2024-12-31T05:49:56.771187701Z [longhorn-instance-manager] time="2024-12-31T05:49:56Z" level=error msg="Replica pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-2f697cb1 failed to register checksum for snapshot c683d7e3-e7f9-4575-b818-cd9c96abae8a: error sending message, id 192964, method bdev_lvol_register_snapshot_checksum, params {block-disk/pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-2f697cb1-snap-rebuild-b72a5f12}: {\"code\": -32603,\"message\": \"Device or resource busy\"}" func="spdk.(*Replica).Sync.func1" file="replica.go:249"
2024-12-31T05:49:59.680923939Z [2024-12-31 05:49:59.680766] vbdev_lvol_rpc.c:2076:rpc_bdev_lvol_get_snapshot_checksum: *ERROR*: Error -2 getting snapshot checksum
2024-12-31T05:49:59.689173544Z [2024-12-31 05:49:59.688988] lvol.c:2749:lvol_snapshot_checksum_cb: *ERROR*: could not exec snapshot checksum of lvol pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-2f697cb1-snap-rebuild-b72a5f12, error -16
2024-12-31T05:49:59.701040103Z [longhorn-instance-manager] time="2024-12-31T05:49:59Z" level=error msg="Replica pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-2f697cb1 failed to register checksum for snapshot c683d7e3-e7f9-4575-b818-cd9c96abae8a: error sending message, id 192974, method bdev_lvol_register_snapshot_checksum, params {block-disk/pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-2f697cb1-snap-rebuild-b72a5f12}: {\"code\": -32603,\"message\": \"Device or resource busy\"}" func="spdk.(*Replica).Sync.func1" file="replica.go:249"
2024-12-31T05:50:02.678250757Z [2024-12-31 05:50:02.678046] vbdev_lvol_rpc.c:2076:rpc_bdev_lvol_get_snapshot_checksum: *ERROR*: Error -2 getting snapshot checksum
2024-12-31T05:50:02.686435050Z [2024-12-31 05:50:02.686271] lvol.c:2749:lvol_snapshot_checksum_cb: *ERROR*: could not exec snapshot checksum of lvol pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-2f697cb1-snap-rebuild-b72a5f12, error -16
2024-12-31T05:50:02.701535411Z [longhorn-instance-manager] time="2024-12-31T05:50:02Z" level=error msg="Replica pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-2f697cb1 failed to register checksum for snapshot c683d7e3-e7f9-4575-b818-cd9c96abae8a: error sending message, id 192991, method bdev_lvol_register_snapshot_checksum, params {block-disk/pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-2f697cb1-snap-rebuild-b72a5f12}: {\"code\": -32603,\"message\": \"Device or resource busy\"}" func="spdk.(*Replica).Sync.func1" file="replica.go:249"
2024-12-31T05:50:02.906406913Z [2024-12-31 05:50:02.906246] vbdev_lvol_rpc.c:2076:rpc_bdev_lvol_get_snapshot_checksum: *ERROR*: Error -2 getting snapshot checksum
2024-12-31T05:50:02.934596968Z [longhorn-instance-manager] time="2024-12-31T05:50:02Z" level=info msg="Replica created snapshot rebuild-14acc9a7(block-disk/pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-2f697cb1-snap-rebuild-14acc9a7)" func="spdk.(*Replica).SnapshotCreate" file="replica.go:1062" lvsName=block-disk lvsUUID=10d0079d-3a20-4d99-bcb8-51c83cf8d38d replicaName=pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-2f697cb1
2024-12-31T05:50:05.703577981Z [2024-12-31 05:50:05.703411] vbdev_lvol_rpc.c:2076:rpc_bdev_lvol_get_snapshot_checksum: *ERROR*: Error -2 getting snapshot checksum
2024-12-31T05:50:27.738778225Z [2024-12-31 05:50:27.738564] vbdev_lvol_rpc.c:2076:rpc_bdev_lvol_get_snapshot_checksum: *ERROR*: Error -2 getting snapshot checksum
2024-12-31T05:50:29.678909911Z [2024-12-31 05:50:29.678720] vbdev_lvol_rpc.c:2076:rpc_bdev_lvol_get_snapshot_checksum: *ERROR*: Error -2 getting snapshot checksum
2024-12-31T05:50:32.745105537Z [2024-12-31 05:50:32.744948] vbdev_lvol_rpc.c:2076:rpc_bdev_lvol_get_snapshot_checksum: *ERROR*: Error -2 getting snapshot checksum
2024-12-31T05:50:32.753275334Z [2024-12-31 05:50:32.753163] lvol.c:2749:lvol_snapshot_checksum_cb: *ERROR*: could not exec snapshot checksum of lvol pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-2f697cb1-snap-rebuild-14acc9a7, error -16
2024-12-31T05:50:32.770728223Z [longhorn-instance-manager] time="2024-12-31T05:50:32Z" level=error msg="Replica pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-2f697cb1 failed to register checksum for snapshot 0df77fde-910c-44bc-b14e-789fba63d022: error sending message, id 193205, method bdev_lvol_register_snapshot_checksum, params {block-disk/pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-2f697cb1-snap-rebuild-14acc9a7}: {\"code\": -32603,\"message\": \"Device or resource busy\"}" func="spdk.(*Replica).Sync.func1" file="replica.go:249"
2024-12-31T05:50:35.679272946Z [2024-12-31 05:50:35.679068] vbdev_lvol_rpc.c:2076:rpc_bdev_lvol_get_snapshot_checksum: *ERROR*: Error -2 getting snapshot checksum
2024-12-31T05:50:35.687451498Z [2024-12-31 05:50:35.687289] lvol.c:2749:lvol_snapshot_checksum_cb: *ERROR*: could not exec snapshot checksum of lvol pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-2f697cb1-snap-rebuild-14acc9a7, error -16
2024-12-31T05:50:35.701521640Z [longhorn-instance-manager] time="2024-12-31T05:50:35Z" level=error msg="Replica pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-2f697cb1 failed to register checksum for snapshot 0df77fde-910c-44bc-b14e-789fba63d022: error sending message, id 193215, method bdev_lvol_register_snapshot_checksum, params {block-disk/pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2-r-2f697cb1-snap-rebuild-14acc9a7}: {\"code\": -32603,\"message\": \"Device or resource busy\"}" func="spdk.(*Replica).Sync.func1" file="replica.go:249"

case 2:

https://ci.longhorn.io/job/public/job/v1.8.x/job/v1.8.x-longhorn-e2e-tests-sles-amd64/12/

http://52.20.55.8:30000/#/volume/pvc-8c6c802d-1959-4c05-b97e-397072b901c5

stuck2

With error messages in longhorn-manager:

2024-12-30T17:10:31.763741785Z time="2024-12-30T17:10:31Z" level=error msg="Failed to rebuild replica 10.42.1.24:20006" func="controller.(*EngineController).startRebuilding.func2" file="engine_controller.go:1853" controller=longhorn-engine engine=pvc-959b8f6b-0925-42b9-afc4-1969f306147e-e-0 error="unknown replica pvc-959b8f6b-0925-42b9-afc4-1969f306147e-r-4932f092 for engine" node=ip-10-0-2-15 volume=pvc-959b8f6b-0925-42b9-afc4-1969f306147e
2024-12-30T17:10:31.763802659Z time="2024-12-30T17:10:31Z" level=info msg="Removing failed rebuilding replica 10.42.1.24:20006" func="controller.(*EngineController).startRebuilding.func2" file="engine_controller.go:1860" controller=longhorn-engine engine=pvc-959b8f6b-0925-42b9-afc4-1969f306147e-e-0 node=ip-10-0-2-15 volume=pvc-959b8f6b-0925-42b9-afc4-1969f306147e
2024-12-30T17:10:31.764365617Z time="2024-12-30T17:10:31Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pvc-959b8f6b-0925-42b9-afc4-1969f306147e-e-0\", UID:\"be514cf8-0d28-44c4-8295-bc1e8a48bb8f\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"5858\", FieldPath:\"\"}): type: 'Warning' reason: 'FailedRebuilding' Failed rebuilding replica with Address 10.42.1.24:20006: unknown replica pvc-959b8f6b-0925-42b9-afc4-1969f306147e-r-4932f092 for engine" func="record.(*eventBroadcasterImpl).StartLogging.func1" file="event.go:377"
2024-12-30T17:10:31.766468102Z time="2024-12-30T17:10:31Z" level=warning msg="Failed to remove rebuilding replica 10.42.1.24:20006" func="controller.(*EngineController).startRebuilding.func2" file="engine_controller.go:1862" controller=longhorn-engine engine=pvc-959b8f6b-0925-42b9-afc4-1969f306147e-e-0 error="proxyServer=10.42.4.18:8501 destination=10.42.4.18:20016: failed to remove replica tcp://10.42.1.24:20006 for volume: rpc error: code = Unknown desc = failed to delete replica  with address tcp://10.42.1.24:20006 to engine pvc-959b8f6b-0925-42b9-afc4-1969f306147e-e-0: rpc error: code = Unknown desc = cannot find replica name with address tcp://10.42.1.24:20006 for engine pvc-959b8f6b-0925-42b9-afc4-1969f306147e-e-0 replica delete" node=ip-10-0-2-15 volume=pvc-959b8f6b-0925-42b9-afc4-1969f306147e

And error message in instance-manager:

2024-12-31T06:01:20.004865665Z [2024-12-31 06:01:20.004816] vbdev_lvol.c: 864:vbdev_lvol_dump_info_json: *ERROR*: Cannot obtain snapshots name
2024-12-31T06:01:20.004900432Z [2024-12-31 06:01:20.004840] vbdev_lvol.c: 864:vbdev_lvol_dump_info_json: *ERROR*: Cannot obtain snapshots name
2024-12-31T06:01:20.004907862Z [2024-12-31 06:01:20.004865] vbdev_lvol.c: 864:vbdev_lvol_dump_info_json: *ERROR*: Cannot obtain snapshots name
2024-12-31T06:01:20.004979702Z [2024-12-31 06:01:20.004893] vbdev_lvol.c: 864:vbdev_lvol_dump_info_json: *ERROR*: Cannot obtain snapshots name
2024-12-31T06:01:20.004989427Z [2024-12-31 06:01:20.004924] vbdev_lvol.c: 864:vbdev_lvol_dump_info_json: *ERROR*: Cannot obtain snapshots name
2024-12-31T06:01:20.004993020Z [2024-12-31 06:01:20.004950] vbdev_lvol.c: 864:vbdev_lvol_dump_info_json: *ERROR*: Cannot obtain snapshots name
2024-12-31T06:01:20.005029075Z [2024-12-31 06:01:20.004973] vbdev_lvol.c: 864:vbdev_lvol_dump_info_json: *ERROR*: Cannot obtain snapshots name
2024-12-31T06:01:20.005047867Z [2024-12-31 06:01:20.004998] vbdev_lvol.c: 864:vbdev_lvol_dump_info_json: *ERROR*: Cannot obtain snapshots name
2024-12-31T06:01:20.005109812Z [2024-12-31 06:01:20.005028] vbdev_lvol.c: 864:vbdev_lvol_dump_info_json: *ERROR*: Cannot obtain snapshots name
2024-12-31T06:01:20.005115545Z [2024-12-31 06:01:20.005055] vbdev_lvol.c: 864:vbdev_lvol_dump_info_json: *ERROR*: Cannot obtain snapshots name
2024-12-31T06:01:20.005118871Z [2024-12-31 06:01:20.005078] vbdev_lvol.c: 864:vbdev_lvol_dump_info_json: *ERROR*: Cannot obtain snapshots name

To Reproduce

Run negative test case with options: -t \"Stop Volume Node Kubelet For More Than Pod Eviction Timeout While Workload Heavy Writing\" -v DATA_ENGINE:v2 -v LOOP_COUNT:10 -v RETRY_COUNT:259200

Expected behavior

Support bundle for troubleshooting

case 1:
supportbundle_894e3cbd-148a-43ba-a85f-98d07b22edda_2024-12-31T05-50-18Z.zip

case 2:
supportbundle_ff850278-d8c7-4518-8a32-5bdb85e44ee7_2024-12-31T06-00-59Z.zip

Environment

  • Longhorn version: v1.8.0-rc2
  • Impacted volume (PV):
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.31.2+k3s1
    • Number of control plane nodes in the cluster:
    • Number of worker nodes in the cluster:
  • Node config
    • OS type and version: sles 15-sp6
    • Kernel version:
    • CPU per node:
    • Memory per node:
    • Disk type (e.g. SSD/NVMe/HDD):
    • Network bandwidth between the nodes (Gbps):
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster:

Additional context

v1.8.0-rc1 doesn't have this issue:
https://ci.longhorn.io/job/private/job/longhorn-e2e-test/2237/
https://ci.longhorn.io/job/private/job/longhorn-e2e-test/2238/

Workaround and Mitigation

@yangchiu yangchiu added kind/bug reproduce/often 80 - 50% reproducible priority/1 Highly recommended to implement or fix in this release (managed by PO) severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact) area/v2-data-engine v2 data engine (SPDK) area/resilience System or volume resilience area/volume-replica-rebuild Volume replica rebuilding related require/backport Require backport. Only used when the specific versions to backport have not been definied. labels Dec 31, 2024
@yangchiu yangchiu added this to the v1.8.0 milestone Dec 31, 2024
@github-project-automation github-project-automation bot moved this to New Issues in Longhorn Sprint Dec 31, 2024
@yangchiu yangchiu changed the title [BUG] v2 volume gets stuck in degraded state and keeps rebuilding/deleting replica after kubelet restart [BUG] v2 volume gets stuck in degraded state and continuously rebuilds/deletes replicas after a kubelet restart Dec 31, 2024
@derekbit
Copy link
Member

Hello @shuo-wu
Could you help check the issue? Thank you.

@derekbit derekbit modified the milestones: v1.8.0, v1.9.0 Jan 6, 2025
@derekbit derekbit added priority/0 Must be implement or fixed in this release (managed by PO) and removed priority/1 Highly recommended to implement or fix in this release (managed by PO) labels Jan 6, 2025
@derekbit
Copy link
Member

derekbit commented Jan 6, 2025

Moved to v1.9.0. Best effort in v1.8.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/resilience System or volume resilience area/v2-data-engine v2 data engine (SPDK) area/volume-replica-rebuild Volume replica rebuilding related kind/bug priority/0 Must be implement or fixed in this release (managed by PO) reproduce/often 80 - 50% reproducible require/backport Require backport. Only used when the specific versions to backport have not been definied. severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact)
Projects
Status: New Issues
Development

No branches or pull requests

3 participants