Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Rebuilding stuck for DR volume if the node was power down while restoring #2747

Open
khushboo-rancher opened this issue Jun 30, 2021 · 7 comments
Assignees
Labels
area/resilience System or volume resilience area/v1-data-engine v1 data engine (iSCSI tgt) kind/bug priority/0 Must be implement or fixed in this release (managed by PO) reproduce/rare < 50% reproducible require/qa-reproduce Require QA to reproduce, especially for issues reported from community severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade)
Milestone

Comments

@khushboo-rancher
Copy link
Contributor

khushboo-rancher commented Jun 30, 2021

Describe the bug
Rebuilding stuck for DR volume if the node was power down while restore.

To Reproduce
Steps to reproduce the behavior:

  1. Create a volume and attach it to a pod.
  2. Write data into it using the command dd if=/dev/urandom of=file1.txt count=100 bs=1M
  3. Take a backup.
  4. Create a DR volume from the backup.
  5. While the rebuilding is in progress, power down the node.
  6. The DR volume will get attached to another node and the replica on the powered down down becomes faulted.
  7. While restore is in progress, power down the attached node once again.
  8. The rebuilding stuck.

Screen Shot 2021-06-29 at 8 53 47 PM

Expected behavior
The rebuilding should not stuck.

Log
longhorn-support-bundle_6cc5d2f7-e73d-44aa-a5f0-8369d974103f_2021-06-30T03-55-01Z.zip
Time stamp - around 2021-06-30 03:30:00
DR volume name - dr-1
Backup name - pvc-6a3248c3-a23c-4c30-83b0-c25615aa2831

Environment:

  • Longhorn version: v1.1.2-rc1
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE v1.20.7
    • Number of management node in the cluster: 1
    • Number of worker node in the cluster: 3
  • Node config
    • OS type and version: ubuntu 20.04
    • CPU per node: 4 vcpus
    • Memory per node: 8 Gi
    • Disk type(e.g. SSD/NVMe): SSD
    • Network bandwidth between the nodes: Upto 5 Gi
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): DO
  • Number of Longhorn volumes in the cluster: 2

Additional context
This is an upgraded set up from v1.1.1 to v1.1.2-rc1

@khushboo-rancher khushboo-rancher added kind/bug severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade) labels Jun 30, 2021
@yasker yasker added this to the v1.1.2 milestone Jun 30, 2021
@khushboo-rancher khushboo-rancher added kind/regression Regression which has worked before reproduce/always 100% reproducible labels Jun 30, 2021
@innobead innobead added the area/v1-data-engine v1 data engine (iSCSI tgt) label Jun 30, 2021
@khushboo-rancher
Copy link
Contributor Author

khushboo-rancher commented Jun 30, 2021

On further analysis, I found the problem occurs when there is only one healthy replica running.
I've updated the reproduce steps. This is same as #2753

Checked with v1.1.1 and v1.1.2-rc1, the rebuilding stuck with both version.

@khushboo-rancher khushboo-rancher removed the kind/regression Regression which has worked before label Jun 30, 2021
@yasker yasker modified the milestones: v1.1.2, v1.2.0 Jun 30, 2021
@yasker yasker added the priority/1 Highly recommended to implement or fix in this release (managed by PO) label Jun 30, 2021
@shuo-wu
Copy link
Contributor

shuo-wu commented Jul 1, 2021

@khushboo-rancher This issue is different from #2753. Here, there is still one RW replica running beside the rebuilding replica. The error message is different as well.

@khushboo-rancher
Copy link
Contributor Author

Also, I have seen this issue few times, not consistent.

@khushboo-rancher khushboo-rancher added reproduce/rare < 50% reproducible and removed reproduce/always 100% reproducible labels Jul 2, 2021
@innobead innobead modified the milestones: v1.2.0, v1.3.0 Aug 2, 2021
@innobead
Copy link
Member

Hey team! Please add your planning poker estimate with ZenHub @jenting @joshimoo @PhanLe1010 @shuo-wu

@joshimoo
Copy link
Contributor

@khushboo-rancher Have we seen this issue, do we know if it's still present in master after v1.2?

@shuo-wu
Copy link
Contributor

shuo-wu commented Oct 26, 2021

IIRC, this one was not reproducible for me.

@khushboo-rancher
Copy link
Contributor Author

I have not seen this after v1.2.2 release but as this is an inconsistent behavior, not sure.

@innobead innobead modified the milestones: v1.3.0, v1.4.0 Mar 31, 2022
@innobead innobead assigned c3y1huang and unassigned shuo-wu Jun 24, 2022
@innobead innobead modified the milestones: v1.4.0, v1.5.0 Nov 30, 2022
@innobead innobead added area/resilience System or volume resilience priority/0 Must be implement or fixed in this release (managed by PO) and removed priority/1 Highly recommended to implement or fix in this release (managed by PO) labels Mar 29, 2023
@innobead innobead assigned ejweber and unassigned c3y1huang Apr 6, 2023
@innobead innobead modified the milestones: v1.5.0, v1.6.0 Apr 13, 2023
@innobead innobead modified the milestones: v1.6.0, v1.7.0 Nov 29, 2023
@innobead innobead added the require/qa-reproduce Require QA to reproduce, especially for issues reported from community label Nov 29, 2023
@derekbit derekbit modified the milestones: v1.7.0, v1.8.0 Jun 17, 2024
@derekbit derekbit moved this to Backlog in Longhorn Sprint Aug 3, 2024
@innobead innobead moved this to New Issues in Longhorn Sprint Sep 10, 2024
@derekbit derekbit modified the milestones: v1.8.0, v1.9.0 Sep 15, 2024
@derekbit derekbit assigned COLDTURNIP and unassigned ejweber Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/resilience System or volume resilience area/v1-data-engine v1 data engine (iSCSI tgt) kind/bug priority/0 Must be implement or fixed in this release (managed by PO) reproduce/rare < 50% reproducible require/qa-reproduce Require QA to reproduce, especially for issues reported from community severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade)
Projects
Status: New Issues
Development

No branches or pull requests

9 participants