Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] DR volume failed when synchronizing the incremental backup #6750

Closed
mantissahz opened this issue Sep 21, 2023 · 3 comments
Closed

[BUG] DR volume failed when synchronizing the incremental backup #6750

mantissahz opened this issue Sep 21, 2023 · 3 comments
Assignees
Labels
area/stability System or volume stability area/volume-backup-restore Volume backup restore area/volume-disaster-recovery Volume DR backport/1.4.4 backport/1.5.2 kind/bug priority/0 Must be implement or fixed in this release (managed by PO) require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage
Milestone

Comments

@mantissahz
Copy link
Contributor

Describe the bug (🐛 if you encounter this issue)

When a Longhorn recurring job on source cluster completed an incremental backup and start removing the oldest backup for retain number, the DR volume on target cluster will start to synchronize the data from newest backup and it would be failed because of the removing the oldest backup on the source cluster.

To Reproduce

  1. Create source and target cluster and deploy Longhorn 1.5.1 on both clusters.
  2. Setup the nfs server on both sides. I set poll interval 90 seconds on target cluster.
  3. Create a 50G volume A and write an amount of data on source cluster.
  4. Create a recurring job on volume A with retain number 3 and a cron job every 10 minutes.
  5. After the recurring job completed 3 backups, disable the recurring job and create a DR volume on target cluster and then wait for restoration completed.
  6. Write new 10G data into source volume A and enable the recurring job to continue to do backup.
  7. After a new incremental backup completed, the DR volume will be failed when synchronizing the new data from the new backup.

Expected behavior

DR volume on target cluster could synchronize the data from newest backup and successfully remove the unnecessary backup for retain number on source cluster.

Support bundle for troubleshooting

After logs in longhorn-manager on target cluster, it failed.

2023-09-07T15:01:34.096007083+02:00 time="2023-09-07T13:01:34Z" level=warning msg="pvc-9e0658ab-76e7-424d-857a-eef250ee2376-r-8497380d: time=\"2023-09-07T13:01:33Z\" level=info msg=\"backupstore volume pvc-9e0658ab-76e7-424d-857a-eef2
50ee2376 contains locks [{ volume: , name: lock-8f167adf3a5148b2, type: 1, acquired: false, serverTime: 2023-09-07 13:01:31.908941276 +0000 UTC } { volume: , name: lock-90bf6be9d983417b, type: 1, acquired: false, serverTime: 2023-09-0
7 13:01:31.934941661 +0000 UTC } { volume: , name: lock-9ff510b7080540f1, type: 1, acquired: false, serverTime: 2023-09-07 13:01:31.908941276 +0000 UTC } { volume: , name: lock-a7bee2e561a24f21, type: 2, acquired: true, serverTime: 20
23-09-07 13:01:28.741894372 +0000 UTC }]\" pkg=backupstore" 

Environment

  • Longhorn version: both 1.5.1
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: K3s, v1.27.2+k3s1
    • Number of management node in the cluster: 1
    • Number of worker node in the cluster: 3
  • Node config
    • OS type and version: SLES 15.5
    • Kernel version:
    • CPU per node: 2
    • Memory per node: 8G
    • Disk type(e.g. SSD/NVMe/HDD): SSD
    • Network bandwidth between the nodes: 1G
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): AWS
  • Number of Longhorn volumes in the cluster: 1
  • Impacted Longhorn resources:
    • Volume names:

Additional context

@mantissahz mantissahz added kind/bug require/qa-review-coverage Require QA to review coverage require/backport Require backport. Only used when the specific versions to backport have not been definied. labels Sep 21, 2023
@innobead innobead added this to the v1.6.0 milestone Sep 21, 2023
@innobead innobead added priority/0 Must be implement or fixed in this release (managed by PO) area/stability System or volume stability area/volume-disaster-recovery Volume DR area/volume-backup-restore Volume backup restore labels Sep 21, 2023
@mantissahz
Copy link
Contributor Author

related: #3055
We only have a backoff mechanism for the full restoration, need to retry with the backoff as the full restoration

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Sep 28, 2023

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at:
    [BUG] DR volume failed when synchronizing the incremental backup #6750 (comment)

  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at
    fix(restore): dr volume failed by delete lock longhorn-manager#2183

  • Which areas/issues this PR might have potential impacts on?
    Area
    Issues

  • If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*)
    The automation skeleton PR is at
    The automation test case PR is at
    The issue of automation test case implementation is at (please create by the template)

@chriscchien
Copy link
Contributor

Verified pass on longhorn master(longhorn-manager 396605)

  1. In Longhorn master, the DR volume successfully synchronizes the incremental backup.
  2. After DR volume is activated, the data is latest data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/stability System or volume stability area/volume-backup-restore Volume backup restore area/volume-disaster-recovery Volume DR backport/1.4.4 backport/1.5.2 kind/bug priority/0 Must be implement or fixed in this release (managed by PO) require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage
Projects
Status: Closed
Development

No branches or pull requests

4 participants