[BUG] v2 volume gets stuck in degraded state and continuously rebuilds/deletes replicas after a kubelet restart #10107
Labels
area/resilience
System or volume resilience
area/v2-data-engine
v2 data engine (SPDK)
area/volume-replica-rebuild
Volume replica rebuilding related
kind/bug
priority/0
Must be implement or fixed in this release (managed by PO)
reproduce/often
80 - 50% reproducible
require/backport
Require backport. Only used when the specific versions to backport have not been definied.
severity/2
Function working but has a major issue w/o workaround (a major incident with significant impact)
Milestone
Describe the bug
While running negative test case
Stop Volume Node Kubelet For More Than Pod Eviction Timeout While Workload Heavy Writing
, after stopping kubelet on the volume attached node for more than thePod Eviction Timeout
(6 minutes) and then restarting it, accidentally encountered v2 volume gets stuck in degraded state indefinitely. Replica rebuilding keeps being triggered, but the rebuilt replica will become failed and be deleted, causing the rebuild process to restart repeatedly.case 1:
http://54.162.165.3:30000/#/volume/pvc-6c0e9f4b-b031-4bc0-9545-5dc5c53231f2
https://ci.longhorn.io/job/public/job/v1.8.x/job/v1.8.x-longhorn-e2e-tests-sles-amd64/14/
With error messages in longhorn-manager:
And error messages in instance-manager:
case 2:
https://ci.longhorn.io/job/public/job/v1.8.x/job/v1.8.x-longhorn-e2e-tests-sles-amd64/12/
http://52.20.55.8:30000/#/volume/pvc-8c6c802d-1959-4c05-b97e-397072b901c5
With error messages in longhorn-manager:
And error message in instance-manager:
To Reproduce
Run negative test case with options:
-t \"Stop Volume Node Kubelet For More Than Pod Eviction Timeout While Workload Heavy Writing\" -v DATA_ENGINE:v2 -v LOOP_COUNT:10 -v RETRY_COUNT:259200
Expected behavior
Support bundle for troubleshooting
case 1:
supportbundle_894e3cbd-148a-43ba-a85f-98d07b22edda_2024-12-31T05-50-18Z.zip
case 2:
supportbundle_ff850278-d8c7-4518-8a32-5bdb85e44ee7_2024-12-31T06-00-59Z.zip
Environment
Additional context
v1.8.0-rc1 doesn't have this issue:
https://ci.longhorn.io/job/private/job/longhorn-e2e-test/2237/
https://ci.longhorn.io/job/private/job/longhorn-e2e-test/2238/
Workaround and Mitigation
The text was updated successfully, but these errors were encountered: