[FEATURE] v2 volumes supports DR volume #6613

derekbit · 2023-08-30T02:27:55Z

Is your feature request related to a problem? Please describe (👍 if you like this request)

v2 volumes supports DR volume which is implemented by the incremental restoration.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

c3y1huang · 2024-08-06T05:35:20Z

Currently, I am encountering some weird behavior and still investigating.

When testing manually, the DR seems to work fine. However, when testing with the robot test case, the results are quite unstable. The checksum sometimes differs during volume 2 checks, other times during volume 3 checks, and occasionally, the test passes.
Additionally, when I use block-type disks in the nodes, the test generally pass. However, when I use external block storage, same to the setup in the pipeline, the test case fails most of the time.
When the test fails during volume 2 checks, I noticed that the md5sum keeps changing continuously AFTER activation. There seems to be no activity from Longhorn during these md5sum changes (nothing is logged during this time).
However, when I activate the DR volume 3, the md5sum remains the same as the source volume (volume 0).

Flow:

 volume 0 (src) --> write data --> backup ---> DR volume 2 restored
                                          ---> DR volume 3 restored
 ======================================================
 ---> activate DR volume 2 ---> md5sum changes continuously
 ---> activate DR volume 3 ---> md5sum remains same as volume 0

 # volume 0 (src)
 ip-10-0-2-232 > md5sum /dev/longhorn/e2e-test-volume-0
 163b6d5bed0a42d21d05be80bb9e7df2  /dev/longhorn/e2e-test-volume-0
 ip-10-0-2-232 > md5sum /dev/longhorn/e2e-test-volume-0
 163b6d5bed0a42d21d05be80bb9e7df2  /dev/longhorn/e2e-test-volume-0

 # volume 2 (activated DR volume)
 ip-10-0-2-232 > md5sum /dev/longhorn/e2e-test-volume-2
 eee1eaf44981e9bdaf5ff305d511b398  /dev/longhorn/e2e-test-volume-2
 ip-10-0-2-232 > md5sum /dev/longhorn/e2e-test-volume-2
 b5325256e38f89a7baec8caa667776e8  /dev/longhorn/e2e-test-volume-2

 # volume 3 (activated DR volume)
 ip-10-0-2-232 > md5sum /dev/longhorn/e2e-test-volume-3
 163b6d5bed0a42d21d05be80bb9e7df2  /dev/longhorn/e2e-test-volume-3
 ip-10-0-2-232 > md5sum /dev/longhorn/e2e-test-volume-3
 163b6d5bed0a42d21d05be80bb9e7df2  /dev/longhorn/e2e-test-volume-3

The checksum changes persist after reattachment via the UI, but seem to stop when the volume is attached through the workload.
Looking for some insight from @derekbit @shuo-wu @DamiaSan.
- After discussing with @derekbit and @shuo-wu there are some investigation ~~TODOs~~ done:
  - If you write some data, for example, only one 100 MiB file, can the issue be reproduced as well? Yes:
    - Test 1 checksum mismatched and changing:
      - e2e-test-volume-2
    - Test 2 checksum mismatched and changing:
      - e2e-test-volume-2
      - e2e-test-volume-3
  - Only one activated DR volume data keeps changing after each re-attachment? Multiple observed:
    - Test 1 checksum mismatched and changing:
      - e2e-test-volume-2
    - Test 2 checksum mismatched and changing:
      - e2e-test-volume-2
      - e2e-test-volume-3
  - Can you verify the data in the filesystem rather than the block device? I'm curious about the files' data integrity inside the filesystem.
    - The data files checksum are consistent, and the volume device checksum seem stable.

Seems no IO activity was detected on the volume block device during the checksum change:

 > md5sum /dev/longhorn/e2e-test-volume-2 
 6601923db52973b0b0aea996958accfe  /dev/longhorn/e2e-test-volume-2
 > blktrace -d /dev/longhorn/e2e-test-volume-2 -w 30 -o trace.out
 === dm-2 ===
   CPU  0:                    0 events,        0 KiB data
   CPU  1:                    1 events,        1 KiB data
   CPU  2:                    0 events,        0 KiB data
   CPU  3:                    0 events,        0 KiB data
   Total:                     1 events (dropped 0),        1 KiB data
 > blkparse -i trace.out
 Input file trace.out.blktrace.1 added

 Throughput (R/W): 0KiB/s / 0KiB/s
 Events (trace.out): 0 entries
 Skips: 0 forward (0 -   0.0%)
 > md5sum /dev/longhorn/e2e-test-volume-2 
 82fa23558987970ae9b154bdd5d165fb  /dev/longhorn/e2e-test-volume-2

Test Incremental Restore When Writing Data To Filesystem:
- c5.xlarge (PASS): https://ci.longhorn.io/job/private/job/longhorn-e2e-test/1053/
  - This kind of instance uses NVMe drive for storage and is mounted as /dev/nvme1n1.
- t2.xlarge (FAIL): https://ci.longhorn.io/job/private/job/longhorn-e2e-test/1052/
  - Note: This kind of instance uses EBS volume for storage and is mounted as /dev/xvdh.

Discuss findings in sprint meeting; Action TODOs (WIP):

Find the behavior when mounting the attached volume manually.

Able to manually mount. And the checksum stabilized after running mkfs.ext4:

 _volume_name="e2e-test-volume-2"
 _source_device="/dev/longhorn/${_volume_name}"
 _target_path="/mnt/${_volume_name}"

 if [ -d "${_target_path}" ]; then
   rm -rf "${_target_path}"
   echo -e "[INFO][`date`)]: Removed existing directory ${_target_path}\n"
 fi

 mkdir ${_target_path} && \
   echo -e "[INFO][`date`]: Creatd directory ${_target_path}\n"

 mkfs.ext4 ${_source_device} && \
   echo -e "[INFO][`date`]: Created filesystem: ${_source_device}\n"

 # mount ${_source_device} ${_target_path} && \
 #   echo -e "[INFO][`date`]: Mounted ${_source_device} to ${_target_path}\n"

 # echo -e "[INFO][`date`]: Print ${_target_path}"
 # ls ${_target_path}

 Wed Aug 14 06:36:48 UTC 2024: 9d92da5bd9b3a37a71bfee05c44fa0fe  /dev/longhorn/e2e-test-volume-2
 Wed Aug 14 06:37:04 UTC 2024: 6ecec3fe33d6bd7dac0b8386108240c3  /dev/longhorn/e2e-test-volume-2
 Wed Aug 14 06:37:37 UTC 2024: 0bfdf475eb2726074465ec39a563c3d6  /dev/longhorn/e2e-test-volume-2
 Wed Aug 14 06:38:11 UTC 2024: 16392090c78dfe9c223dfc48746d6c5e  /dev/longhorn/e2e-test-volume-2
 Wed Aug 14 06:38:45 UTC 2024: 3b6f2d758984aa37227bec65247977cf  /dev/longhorn/e2e-test-volume-2
 Wed Aug 14 06:39:18 UTC 2024: 055fb131a564476dc3365a41e99ee39a  /dev/longhorn/e2e-test-volume-2
 Wed Aug 14 06:39:52 UTC 2024: fe9abfa766d056b1903adeaf30685ec7  /dev/longhorn/e2e-test-volume-2
 Wed Aug 14 06:40:14 UTC 2024: b5c6194164d787d47048f85b6c580336  /dev/longhorn/e2e-test-volume-2
 Wed Aug 14 06:40:25 UTC 2024: c8ae12d49ed04bdc99a439bde311041f  /dev/longhorn/e2e-test-volume-2
 Wed Aug 14 06:40:35 UTC 2024: 62cb93c60ff5a766cada40b77baa6032  /dev/longhorn/e2e-test-volume-2
 Wed Aug 14 06:40:49 UTC 2024: f4fbc23000c6cbad6fd3585257011fd7  /dev/longhorn/e2e-test-volume-2
 Wed Aug 14 06:41:00 UTC 2024: f4fbc23000c6cbad6fd3585257011fd7  /dev/longhorn/e2e-test-volume-2
 Wed Aug 14 06:41:10 UTC 2024: f4fbc23000c6cbad6fd3585257011fd7  /dev/longhorn/e2e-test-volume-2
 Wed Aug 14 06:41:21 UTC 2024: f4fbc23000c6cbad6fd3585257011fd7  /dev/longhorn/e2e-test-volume-2
 Wed Aug 14 06:41:33 UTC 2024: f4fbc23000c6cbad6fd3585257011fd7  /dev/longhorn/e2e-test-volume-2
 Wed Aug 14 06:42:07 UTC 2024: f4fbc23000c6cbad6fd3585257011fd7  /dev/longhorn/e2e-test-volume-2

 [INFO][Wed Aug 14 06:40:43 UTC 2024]: Creatd directory /mnt/e2e-test-volume-2

 mke2fs 1.46.4 (18-Aug-2021)
 Discarding device blocks: done                            
 Creating filesystem with 524288 4k blocks and 131072 inodes
 Filesystem UUID: ae5ec712-e883-452f-aa39-b2d03a907c94
 Superblock backups stored on blocks: 
 	32768, 98304, 163840, 229376, 294912

 Allocating group tables: done                            
 Writing inode tables: done                            
 Creating journal (16384 blocks): done
 Writing superblocks and filesystem accounting information: done 

 [INFO][Wed Aug 14 06:40:46 UTC 2024]: Created filesystem: /dev/longhorn/e2e-test-volume-2

Find the fragmap behavior using go-spdk-helper lvol get-fragmap.

Input from Derek: the holes remain unchanged, which means that the changes were made where the data resides.

 Wed Aug 14 04:34:49 UTC 2024: e66b1e598a6d57085b2c0ec0f3fa55d9  /dev/longhorn/e2e-test-volume-2
 Wed Aug 14 04:35:00 UTC 2024: d557904471550e4f8412c47a0cd72ffd  /dev/longhorn/e2e-test-volume-2

 instance-manager-7076fff9d8118ad15a7ccf11d3c70cf0:/ # go-spdk-helper lvol get
 [
 	{
 		"name": "2f2e0499-5bff-43b7-ab36-1ad3ec3a6062",
 		"aliases": [
 			"block-disk/e2e-test-volume-2-r-f89dfe8d"
 		],
 		"product_name": "Logical Volume",
 		"block_size": 4096,
 		"num_blocks": 524288,
 		"uuid": "2f2e0499-5bff-43b7-ab36-1ad3ec3a6062",
 		"creation_time": "2024-08-14T04:02:05Z",
 		"assigned_rate_limits": {
 			"rw_ios_per_sec": 0,
 			"rw_mbytes_per_sec": 0,
 			"r_mbytes_per_sec": 0,
 			"w_mbytes_per_sec": 0
 		},
 		"claimed": true,
 		"claim_type": "exclusive_write",
 		"zoned": false,
 		"supported_io_types": {
 			"read": true,
 			"write": true,
 			"unmap": true,
 			"write_zeroes": true,
 			"flush": false,
 			"reset": true,
 			"compare": false,
 			"compare_and_write": false,
 			"abort": false,
 			"nvme_admin": false,
 			"nvme_io": false
 		},
 		"driver_specific": {
 			"lvol": {
 				"lvol_store_uuid": "8d9dc6e2-8b91-4a51-ba40-e5cf1c055dfc",
 				"base_bdev": "block-disk",
 				"base_snapshot": "e2e-test-volume-2-r-f89dfe8d-snap-8446f4ac-a28f-498f-b4e6-7d53853adf73",
 				"thin_provision": true,
 				"num_allocated_clusters": 1790,
 				"snapshot": false,
 				"clone": true,
 				"xattrs": {
 					"user_created": "true"
 				}
 			}
 		}
 	},
 	{
 		"name": "9798fc1f-8ba7-4389-b89d-bba3f2382dd3",
 		"aliases": [
 			"block-disk/e2e-test-volume-2-r-f89dfe8d-snap-8446f4ac-a28f-498f-b4e6-7d53853adf73"
 		],
 		"product_name": "Logical Volume",
 		"block_size": 4096,
 		"num_blocks": 524288,
 		"uuid": "9798fc1f-8ba7-4389-b89d-bba3f2382dd3",
 		"creation_time": "2024-08-14T04:02:47Z",
 		"assigned_rate_limits": {
 			"rw_ios_per_sec": 0,
 			"rw_mbytes_per_sec": 0,
 			"r_mbytes_per_sec": 0,
 			"w_mbytes_per_sec": 0
 		},
 		"claimed": false,
 		"zoned": false,
 		"supported_io_types": {
 			"read": true,
 			"write": false,
 			"unmap": false,
 			"write_zeroes": false,
 			"flush": false,
 			"reset": true,
 			"compare": false,
 			"compare_and_write": false,
 			"abort": false,
 			"nvme_admin": false,
 			"nvme_io": false
 		},
 		"driver_specific": {
 			"lvol": {
 				"lvol_store_uuid": "8d9dc6e2-8b91-4a51-ba40-e5cf1c055dfc",
 				"base_bdev": "block-disk",
 				"thin_provision": true,
 				"num_allocated_clusters": 2048,
 				"snapshot": true,
 				"clone": false,
 				"clones": [
 					"e2e-test-volume-2-r-f89dfe8d"
 				],
 				"xattrs": {
 					"snapshot_timestamp": "2024-08-14T04:02:47Z",
 					"user_created": "false"
 				}
 			}
 		}
 	}
 ]

 instance-manager-7076fff9d8118ad15a7ccf11d3c70cf0:/ # go-spdk-helper lvol get-fragmap --uuid 2f2e0499-5bff-43b7-ab36-1ad3ec3a6062
 {
 	"cluster_size": 1048576,
 	"num_clusters": 2048,
 	"num_allocated_clusters": 1790,
 	"fragmap": "/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////x8AAAAAAAAAAAAAAAAAAAAgAAAAAAAAAAAAAAAAAAAAAA=="
 }
 instance-manager-7076fff9d8118ad15a7ccf11d3c70cf0:/ # go-spdk-helper lvol get-fragmap --uuid 2f2e0499-5bff-43b7-ab36-1ad3ec3a6062
 {
 	"cluster_size": 1048576,
 	"num_clusters": 2048,
 	"num_allocated_clusters": 1790,
 	"fragmap": "/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////x8AAAAAAAAAAAAAAAAAAAAgAAAAAAAAAAAAAAAAAAAAAA=="
 }
 instance-manager-7076fff9d8118ad15a7ccf11d3c70cf0:/ # go-spdk-helper lvol get-fragmap --uuid 9798fc1f-8ba7-4389-b89d-bba3f2382dd3
 {
 	"cluster_size": 1048576,
 	"num_clusters": 2048,
 	"num_allocated_clusters": 2048,
 	"fragmap": "/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////w=="
 }
 instance-manager-7076fff9d8118ad15a7ccf11d3c70cf0:/ # go-spdk-helper lvol get-fragmap --uuid 9798fc1f-8ba7-4389-b89d-bba3f2382dd3
 {
 	"cluster_size": 1048576,
 	"num_clusters": 2048,
 	"num_allocated_clusters": 2048,
 	"fragmap": "/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////w=="
 }

Pausing for a few seconds before activating the DR volume seems to increase the test's pass rate?

diff --git a/e2e/tests/regression/test_backup.robot b/e2e/tests/regression/test_backup.robot
index b329305c6..2eaa737c1 100644
--- a/e2e/tests/regression/test_backup.robot
+++ b/e2e/tests/regression/test_backup.robot
@@ -85,12 +85,17 @@ Test Incremental Restore
     And Wait for volume 1 healthy
     Then Check volume 1 data is backup 0
 
-
     When Write data 1 to volume 0
     And Create backup 1 for volume 0
     # Wait for DR volume 2 incremental restoration completed
     Then Wait for volume 2 restoration from backup 1 completed
+
+    Log    "Pausing the test for 10 seconds"    console=True
+    Sleep  10  # Pauses the test for 10 seconds
+    Log    "Resume the test"    console=True
+
     And Activate DR volume 2
+
     And Attach volume 2
     And Wait for volume 2 healthy
     And Check volume 2 data is backup 1
@@ -99,6 +104,11 @@ Test Incremental Restore
     And Create backup 2 for volume 0
     # Wait for DR volume 3 incremental restoration completed
     Then Wait for volume 3 restoration from backup 2 completed
+
+    Log    "Pausing the test for 10 seconds"    console=True
+    Sleep  10  # Pauses the test for 10 seconds
+    Log    "Resume the test"    console=True
+
     And Activate DR volume 3
     And Attach volume 3
     And Wait for volume 3 healthy

Pausing for 10 seconds: 2/5
Pausing for 20 seconds: 4/5
Pausing for 30 seconds: 4/5

Is this reproducible if flush the filesystem before updating the restore progress to 100?
- Not able to reproduce.

DamiaSan · 2024-08-19T12:53:05Z

Could you try to use hexdump on /dev/longhorn/e2e-test-volume-2 to understand what is happening? Maybe writing a fixed stream of bytes in a single part of the volume ...

DamiaSan · 2024-08-22T06:01:39Z

Hi @c3y1huang , thanks for the explanation. Just a question: do you want to proceed with the PRs with only the race condition fix or do you want to wait to find, if any, other possible causes of the issue?

c3y1huang · 2024-08-22T06:26:39Z

Hi @c3y1huang , thanks for the explanation. Just a question: do you want to proceed with the PRs with only the race condition fix or do you want to wait to find, if any, other possible causes of the issue?

After discussing with @shuo-wu and @derekbit , this feature is still missing snapshotting. I've moved the PRs to draft status to implement this. Once they are ready for review again, I would like to progress with the race condition fix, as we believe it will address the checksum mismatching issue. However, we still want to understand the cause of the fluctuating checksum. If we can't pinpoint the culprit before merging the PRs, I plan to create a new issue with some reproducible PR (by reverting the fix) to help us track the problem. WDYT?

DamiaSan · 2024-08-22T06:30:42Z

Good plan, thanks @c3y1huang .

roger-ryao · 2024-10-23T04:52:16Z

Verified on v1.8.0-dev-20241020 20231023

longhorn v1.8.0-dev-20241020 c023511

Result Passed

1. Verify VirtIO disk v2 volumes support DR volumes.
2. Verify NVMe disk v2 volumes support DR volumes.
3. Verify disk driver aio v2 volumes support DR volumes.
4. require/doc : Checked documentation in doc(1.8.0): add DR volumes support for v2 data engine website#977
5. require/auto-e2e-test : test(robot): Test Incremental Restore support different data engine longhorn-tests#2030

github-actions bot mentioned this issue Aug 30, 2023

[TEST][FEATURE] v2 volumes supports DR volume #6614

Open

github-actions bot added the stale label Jan 7, 2024

innobead added this to the v1.7.0 milestone Jan 7, 2024

innobead removed the stale label Jan 7, 2024

derekbit modified the milestones: v1.7.0, v1.8.0 May 17, 2024

derekbit added the highlight Important feature/issue to highlight label Jul 28, 2024

derekbit mentioned this issue Jul 28, 2024

[FEATURE] v2 data engine disaster recovery #8031

Closed

derekbit assigned DamiaSan and c3y1huang and unassigned DamiaSan Jul 28, 2024

derekbit added this to Longhorn Sprint Aug 3, 2024

derekbit moved this to Implement in Longhorn Sprint Aug 3, 2024

innobead added the priority/0 Must be implement or fixed in this release (managed by PO) label Aug 5, 2024

longhorn deleted a comment from github-actions bot Aug 5, 2024

c3y1huang mentioned this issue Aug 22, 2024

[IMPROVEMENT] Move v2 volume backup restore from replica to engine #9277

Open

c3y1huang moved this from Implement to Review in Longhorn Sprint Aug 23, 2024

c3y1huang moved this from Review to Ready For Testing in Longhorn Sprint Aug 28, 2024

c3y1huang mentioned this issue Aug 28, 2024

revert: fix for fluctuating checksum longhorn/longhorn-manager#3113

Draft

c3y1huang removed the require/lep Require adding/updating enhancement proposal label Aug 28, 2024

roger-ryao self-assigned this Oct 21, 2024

roger-ryao closed this as completed Oct 23, 2024

github-project-automation bot moved this from Testing to Closed in Longhorn Sprint Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] v2 volumes supports DR volume #6613

[FEATURE] v2 volumes supports DR volume #6613

derekbit commented Aug 30, 2023

c3y1huang commented Aug 6, 2024 •

edited

Loading

DamiaSan commented Aug 19, 2024 •

edited

Loading

DamiaSan commented Aug 22, 2024

c3y1huang commented Aug 22, 2024

DamiaSan commented Aug 22, 2024

roger-ryao commented Oct 23, 2024

[FEATURE] v2 volumes supports DR volume #6613

[FEATURE] v2 volumes supports DR volume #6613

Comments

derekbit commented Aug 30, 2023

Is your feature request related to a problem? Please describe (👍 if you like this request)

Describe the solution you'd like

Describe alternatives you've considered

Additional context

c3y1huang commented Aug 6, 2024 • edited Loading

DamiaSan commented Aug 19, 2024 • edited Loading

DamiaSan commented Aug 22, 2024

c3y1huang commented Aug 22, 2024

DamiaSan commented Aug 22, 2024

roger-ryao commented Oct 23, 2024

c3y1huang commented Aug 6, 2024 •

edited

Loading

DamiaSan commented Aug 19, 2024 •

edited

Loading