[BUG] DR volume failed when synchronizing the incremental backup #6750

mantissahz · 2023-09-21T12:52:16Z

Describe the bug (🐛 if you encounter this issue)

When a Longhorn recurring job on source cluster completed an incremental backup and start removing the oldest backup for retain number, the DR volume on target cluster will start to synchronize the data from newest backup and it would be failed because of the removing the oldest backup on the source cluster.

To Reproduce

Create source and target cluster and deploy Longhorn 1.5.1 on both clusters.
Setup the nfs server on both sides. I set poll interval 90 seconds on target cluster.
Create a 50G volume A and write an amount of data on source cluster.
Create a recurring job on volume A with retain number 3 and a cron job every 10 minutes.
After the recurring job completed 3 backups, disable the recurring job and create a DR volume on target cluster and then wait for restoration completed.
Write new 10G data into source volume A and enable the recurring job to continue to do backup.
After a new incremental backup completed, the DR volume will be failed when synchronizing the new data from the new backup.

Expected behavior

DR volume on target cluster could synchronize the data from newest backup and successfully remove the unnecessary backup for retain number on source cluster.

Support bundle for troubleshooting

After logs in longhorn-manager on target cluster, it failed.

2023-09-07T15:01:34.096007083+02:00 time="2023-09-07T13:01:34Z" level=warning msg="pvc-9e0658ab-76e7-424d-857a-eef250ee2376-r-8497380d: time=\"2023-09-07T13:01:33Z\" level=info msg=\"backupstore volume pvc-9e0658ab-76e7-424d-857a-eef2
50ee2376 contains locks [{ volume: , name: lock-8f167adf3a5148b2, type: 1, acquired: false, serverTime: 2023-09-07 13:01:31.908941276 +0000 UTC } { volume: , name: lock-90bf6be9d983417b, type: 1, acquired: false, serverTime: 2023-09-0
7 13:01:31.934941661 +0000 UTC } { volume: , name: lock-9ff510b7080540f1, type: 1, acquired: false, serverTime: 2023-09-07 13:01:31.908941276 +0000 UTC } { volume: , name: lock-a7bee2e561a24f21, type: 2, acquired: true, serverTime: 20
23-09-07 13:01:28.741894372 +0000 UTC }]\" pkg=backupstore"

Environment

Longhorn version: both 1.5.1
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Kubectl
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: K3s, v1.27.2+k3s1
- Number of management node in the cluster: 1
- Number of worker node in the cluster: 3
Node config
- OS type and version: SLES 15.5
- Kernel version:
- CPU per node: 2
- Memory per node: 8G
- Disk type(e.g. SSD/NVMe/HDD): SSD
- Network bandwidth between the nodes: 1G
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): AWS
Number of Longhorn volumes in the cluster: 1
Impacted Longhorn resources:
- Volume names:

Additional context

The text was updated successfully, but these errors were encountered:

mantissahz · 2023-09-25T07:51:20Z

related: #3055
We only have a backoff mechanism for the full restoration, need to retry with the backoff as the full restoration

longhorn-io-github-bot · 2023-09-28T16:39:45Z

Pre Ready-For-Testing Checklist

Where is the reproduce steps/test steps documented?
The reproduce steps/test steps are at:
[BUG] DR volume failed when synchronizing the incremental backup #6750 (comment)
Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
The PR is at
fix(restore): dr volume failed by delete lock longhorn-manager#2183
Which areas/issues this PR might have potential impacts on?
Area
Issues
If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*)
The automation skeleton PR is at
The automation test case PR is at
The issue of automation test case implementation is at (please create by the template)

chriscchien · 2023-10-11T02:53:04Z

Verified pass on longhorn master(longhorn-manager 396605)

In Longhorn master, the DR volume successfully synchronizes the incremental backup.
After DR volume is activated, the data is latest data.

mantissahz added kind/bug require/qa-review-coverage Require QA to review coverage require/backport Require backport. Only used when the specific versions to backport have not been definied. labels Sep 21, 2023

innobead added this to the v1.6.0 milestone Sep 21, 2023

innobead added priority/0 Must be implement or fixed in this release (managed by PO) area/stability System or volume stability area/volume-disaster-recovery Volume DR area/volume-backup-restore Volume backup restore labels Sep 21, 2023

innobead assigned mantissahz Sep 21, 2023

mantissahz mentioned this issue Sep 28, 2023

fix(restore): dr volume failed by delete lock longhorn/longhorn-manager#2183

Merged

innobead added backport/1.5.2 backport/1.4.4 labels Oct 1, 2023

This was referenced Oct 1, 2023

[BACKPORT][v1.4.4][BUG] DR volume failed when synchronizing the incremental backup #6821

Closed

[BACKPORT][v1.5.2][BUG] DR volume failed when synchronizing the incremental backup #6822

Closed

innobead assigned chriscchien Oct 2, 2023

chriscchien closed this as completed Oct 11, 2023

derekbit added this to Longhorn Sprint Aug 3, 2024

derekbit moved this to Closed in Longhorn Sprint Aug 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] DR volume failed when synchronizing the incremental backup #6750

[BUG] DR volume failed when synchronizing the incremental backup #6750

mantissahz commented Sep 21, 2023

mantissahz commented Sep 25, 2023

longhorn-io-github-bot commented Sep 28, 2023 •

edited by mantissahz

Loading

chriscchien commented Oct 11, 2023

[BUG] DR volume failed when synchronizing the incremental backup #6750

[BUG] DR volume failed when synchronizing the incremental backup #6750

Comments

mantissahz commented Sep 21, 2023

Describe the bug (🐛 if you encounter this issue)

To Reproduce

Expected behavior

Support bundle for troubleshooting

Environment

Additional context

mantissahz commented Sep 25, 2023

longhorn-io-github-bot commented Sep 28, 2023 • edited by mantissahz Loading

Pre Ready-For-Testing Checklist

chriscchien commented Oct 11, 2023

longhorn-io-github-bot commented Sep 28, 2023 •

edited by mantissahz

Loading