[BUG] Rebuilding stuck for DR volume if the node was power down while restoring #2747

khushboo-rancher · 2021-06-30T04:12:47Z

Describe the bug
Rebuilding stuck for DR volume if the node was power down while restore.

To Reproduce
Steps to reproduce the behavior:

Create a volume and attach it to a pod.
Write data into it using the command dd if=/dev/urandom of=file1.txt count=100 bs=1M
Take a backup.
Create a DR volume from the backup.
While the rebuilding is in progress, power down the node.
The DR volume will get attached to another node and the replica on the powered down down becomes faulted.
While restore is in progress, power down the attached node once again.
The rebuilding stuck.

Expected behavior
The rebuilding should not stuck.

Log
longhorn-support-bundle_6cc5d2f7-e73d-44aa-a5f0-8369d974103f_2021-06-30T03-55-01Z.zip
Time stamp - around 2021-06-30 03:30:00
DR volume name - dr-1
Backup name - pvc-6a3248c3-a23c-4c30-83b0-c25615aa2831

Environment:

Longhorn version: v1.1.2-rc1
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE v1.20.7
- Number of management node in the cluster: 1
- Number of worker node in the cluster: 3
Node config
- OS type and version: ubuntu 20.04
- CPU per node: 4 vcpus
- Memory per node: 8 Gi
- Disk type(e.g. SSD/NVMe): SSD
- Network bandwidth between the nodes: Upto 5 Gi
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): DO
Number of Longhorn volumes in the cluster: 2

Additional context
This is an upgraded set up from v1.1.1 to v1.1.2-rc1

The text was updated successfully, but these errors were encountered:

khushboo-rancher · 2021-06-30T18:07:00Z

On further analysis, I found the problem occurs when there is only one healthy replica running.
I've updated the reproduce steps. This is same as #2753

Checked with v1.1.1 and v1.1.2-rc1, the rebuilding stuck with both version.

shuo-wu · 2021-07-01T02:33:48Z

@khushboo-rancher This issue is different from #2753. Here, there is still one RW replica running beside the rebuilding replica. The error message is different as well.

khushboo-rancher · 2021-07-01T04:37:45Z

Also, I have seen this issue few times, not consistent.

innobead · 2021-10-21T02:38:42Z

Hey team! Please add your planning poker estimate with ZenHub @jenting @joshimoo @PhanLe1010 @shuo-wu

joshimoo · 2021-10-25T20:43:27Z

@khushboo-rancher Have we seen this issue, do we know if it's still present in master after v1.2?

shuo-wu · 2021-10-26T03:24:31Z

IIRC, this one was not reproducible for me.

khushboo-rancher · 2021-10-28T22:53:46Z

I have not seen this after v1.2.2 release but as this is an inconsistent behavior, not sure.

khushboo-rancher added kind/bug severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade) labels Jun 30, 2021

khushboo-rancher assigned joshimoo Jun 30, 2021

yasker assigned shuo-wu Jun 30, 2021

yasker added this to the v1.1.2 milestone Jun 30, 2021

khushboo-rancher added kind/regression Regression which has worked before reproduce/always 100% reproducible labels Jun 30, 2021

innobead added the area/v1-data-engine v1 data engine (iSCSI tgt) label Jun 30, 2021

khushboo-rancher removed the kind/regression Regression which has worked before label Jun 30, 2021

yasker modified the milestones: v1.1.2, v1.2.0 Jun 30, 2021

yasker added the priority/1 Highly recommended to implement or fix in this release (managed by PO) label Jun 30, 2021

innobead unassigned joshimoo Jul 1, 2021

khushboo-rancher added reproduce/rare < 50% reproducible and removed reproduce/always 100% reproducible labels Jul 2, 2021

innobead modified the milestones: v1.2.0, v1.3.0 Aug 2, 2021

innobead modified the milestones: v1.3.0, v1.4.0 Mar 31, 2022

innobead assigned c3y1huang and unassigned shuo-wu Jun 24, 2022

innobead modified the milestones: v1.4.0, v1.5.0 Nov 30, 2022

innobead added area/resilience System or volume resilience priority/0 Must be implement or fixed in this release (managed by PO) and removed priority/1 Highly recommended to implement or fix in this release (managed by PO) labels Mar 29, 2023

innobead assigned ejweber and unassigned c3y1huang Apr 6, 2023

innobead modified the milestones: v1.5.0, v1.6.0 Apr 13, 2023

innobead modified the milestones: v1.6.0, v1.7.0 Nov 29, 2023

innobead added the require/qa-reproduce Require QA to reproduce, especially for issues reported from community label Nov 29, 2023

derekbit modified the milestones: v1.7.0, v1.8.0 Jun 17, 2024

derekbit added this to Longhorn Sprint Aug 3, 2024

derekbit moved this to Backlog in Longhorn Sprint Aug 3, 2024

innobead moved this to New Issues in Longhorn Sprint Sep 10, 2024

derekbit modified the milestones: v1.8.0, v1.9.0 Sep 15, 2024

derekbit assigned COLDTURNIP and unassigned ejweber Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Rebuilding stuck for DR volume if the node was power down while restoring #2747

[BUG] Rebuilding stuck for DR volume if the node was power down while restoring #2747

khushboo-rancher commented Jun 30, 2021 •

edited

Loading

khushboo-rancher commented Jun 30, 2021 •

edited

Loading

shuo-wu commented Jul 1, 2021

khushboo-rancher commented Jul 1, 2021

innobead commented Oct 21, 2021

joshimoo commented Oct 25, 2021

shuo-wu commented Oct 26, 2021

khushboo-rancher commented Oct 28, 2021

[BUG] Rebuilding stuck for DR volume if the node was power down while restoring #2747

[BUG] Rebuilding stuck for DR volume if the node was power down while restoring #2747

Comments

khushboo-rancher commented Jun 30, 2021 • edited Loading

khushboo-rancher commented Jun 30, 2021 • edited Loading

shuo-wu commented Jul 1, 2021

khushboo-rancher commented Jul 1, 2021

innobead commented Oct 21, 2021

joshimoo commented Oct 25, 2021

shuo-wu commented Oct 26, 2021

khushboo-rancher commented Oct 28, 2021

khushboo-rancher commented Jun 30, 2021 •

edited

Loading

khushboo-rancher commented Jun 30, 2021 •

edited

Loading