Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] v2 volumes supports DR volume #6613

Closed
derekbit opened this issue Aug 30, 2023 · 15 comments
Closed

[FEATURE] v2 volumes supports DR volume #6613

derekbit opened this issue Aug 30, 2023 · 15 comments
Assignees
Labels
area/backup-store Remote backup store related area/v2-data-engine v2 data engine (SPDK) area/volume-disaster-recovery Volume DR highlight Important feature/issue to highlight kind/feature Feature request, new feature priority/0 Must be implement or fixed in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated require/doc Require updating the longhorn.io documentation
Milestone

Comments

@derekbit
Copy link
Member

Is your feature request related to a problem? Please describe (👍 if you like this request)

v2 volumes supports DR volume which is implemented by the incremental restoration.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

@derekbit derekbit added kind/feature Feature request, new feature require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated require/doc Require updating the longhorn.io documentation require/lep Require adding/updating enhancement proposal area/v2-data-engine v2 data engine (SPDK) area/volume-disaster-recovery Volume DR area/backup-store Remote backup store related labels Aug 30, 2023
@github-actions github-actions bot added the stale label Jan 7, 2024
@innobead innobead added this to the v1.7.0 milestone Jan 7, 2024
@innobead innobead removed the stale label Jan 7, 2024
@derekbit derekbit modified the milestones: v1.7.0, v1.8.0 May 17, 2024
@derekbit derekbit added the highlight Important feature/issue to highlight label Jul 28, 2024
@derekbit derekbit assigned DamiaSan and c3y1huang and unassigned DamiaSan Jul 28, 2024
@derekbit derekbit moved this to Implement in Longhorn Sprint Aug 3, 2024
@innobead innobead added the priority/0 Must be implement or fixed in this release (managed by PO) label Aug 5, 2024
@longhorn longhorn deleted a comment from github-actions bot Aug 5, 2024
@c3y1huang
Copy link
Contributor

c3y1huang commented Aug 6, 2024

Currently, I am encountering some weird behavior and still investigating.

  • When testing manually, the DR seems to work fine. However, when testing with the robot test case, the results are quite unstable. The checksum sometimes differs during volume 2 checks, other times during volume 3 checks, and occasionally, the test passes.
  • Additionally, when I use block-type disks in the nodes, the test generally pass. However, when I use external block storage, same to the setup in the pipeline, the test case fails most of the time.
  • When the test fails during volume 2 checks, I noticed that the md5sum keeps changing continuously AFTER activation. There seems to be no activity from Longhorn during these md5sum changes (nothing is logged during this time).
  • However, when I activate the DR volume 3, the md5sum remains the same as the source volume (volume 0).
  • Flow:
     volume 0 (src) --> write data --> backup ---> DR volume 2 restored
                                              ---> DR volume 3 restored
     ======================================================
     ---> activate DR volume 2 ---> md5sum changes continuously
     ---> activate DR volume 3 ---> md5sum remains same as volume 0
    
     # volume 0 (src)
     ip-10-0-2-232 > md5sum /dev/longhorn/e2e-test-volume-0
     163b6d5bed0a42d21d05be80bb9e7df2  /dev/longhorn/e2e-test-volume-0
     ip-10-0-2-232 > md5sum /dev/longhorn/e2e-test-volume-0
     163b6d5bed0a42d21d05be80bb9e7df2  /dev/longhorn/e2e-test-volume-0
    
     # volume 2 (activated DR volume)
     ip-10-0-2-232 > md5sum /dev/longhorn/e2e-test-volume-2
     eee1eaf44981e9bdaf5ff305d511b398  /dev/longhorn/e2e-test-volume-2
     ip-10-0-2-232 > md5sum /dev/longhorn/e2e-test-volume-2
     b5325256e38f89a7baec8caa667776e8  /dev/longhorn/e2e-test-volume-2
    
     # volume 3 (activated DR volume)
     ip-10-0-2-232 > md5sum /dev/longhorn/e2e-test-volume-3
     163b6d5bed0a42d21d05be80bb9e7df2  /dev/longhorn/e2e-test-volume-3
     ip-10-0-2-232 > md5sum /dev/longhorn/e2e-test-volume-3
     163b6d5bed0a42d21d05be80bb9e7df2  /dev/longhorn/e2e-test-volume-3
    
  • The checksum changes persist after reattachment via the UI, but seem to stop when the volume is attached through the workload.
  • Looking for some insight from @derekbit @shuo-wu @DamiaSan.
    • After discussing with @derekbit and @shuo-wu there are some investigation TODOs done:
      • If you write some data, for example, only one 100 MiB file, can the issue be reproduced as well? Yes:
        • Test 1 checksum mismatched and changing:
          • e2e-test-volume-2
        • Test 2 checksum mismatched and changing:
          • e2e-test-volume-2
          • e2e-test-volume-3
      • Only one activated DR volume data keeps changing after each re-attachment? Multiple observed:
        • Test 1 checksum mismatched and changing:
          • e2e-test-volume-2
        • Test 2 checksum mismatched and changing:
          • e2e-test-volume-2
          • e2e-test-volume-3
      • Can you verify the data in the filesystem rather than the block device? I'm curious about the files' data integrity inside the filesystem.
        • The data files checksum are consistent, and the volume device checksum seem stable.
  • Seems no IO activity was detected on the volume block device during the checksum change:
     > md5sum /dev/longhorn/e2e-test-volume-2 
     6601923db52973b0b0aea996958accfe  /dev/longhorn/e2e-test-volume-2
     > blktrace -d /dev/longhorn/e2e-test-volume-2 -w 30 -o trace.out
     === dm-2 ===
       CPU  0:                    0 events,        0 KiB data
       CPU  1:                    1 events,        1 KiB data
       CPU  2:                    0 events,        0 KiB data
       CPU  3:                    0 events,        0 KiB data
       Total:                     1 events (dropped 0),        1 KiB data
     > blkparse -i trace.out
     Input file trace.out.blktrace.1 added
    
     Throughput (R/W): 0KiB/s / 0KiB/s
     Events (trace.out): 0 entries
     Skips: 0 forward (0 -   0.0%)
     > md5sum /dev/longhorn/e2e-test-volume-2 
     82fa23558987970ae9b154bdd5d165fb  /dev/longhorn/e2e-test-volume-2
    
  • Test Incremental Restore When Writing Data To Filesystem:
  • Discuss findings in sprint meeting; Action TODOs (WIP):
    • Find the behavior when mounting the attached volume manually.
      • Able to manually mount. And the checksum stabilized after running mkfs.ext4:
         _volume_name="e2e-test-volume-2"
         _source_device="/dev/longhorn/${_volume_name}"
         _target_path="/mnt/${_volume_name}"
        
         if [ -d "${_target_path}" ]; then
           rm -rf "${_target_path}"
           echo -e "[INFO][`date`)]: Removed existing directory ${_target_path}\n"
         fi
        
         mkdir ${_target_path} && \
           echo -e "[INFO][`date`]: Creatd directory ${_target_path}\n"
        
         mkfs.ext4 ${_source_device} && \
           echo -e "[INFO][`date`]: Created filesystem: ${_source_device}\n"
        
         # mount ${_source_device} ${_target_path} && \
         #   echo -e "[INFO][`date`]: Mounted ${_source_device} to ${_target_path}\n"
        
         # echo -e "[INFO][`date`]: Print ${_target_path}"
         # ls ${_target_path}
         Wed Aug 14 06:36:48 UTC 2024: 9d92da5bd9b3a37a71bfee05c44fa0fe  /dev/longhorn/e2e-test-volume-2
         Wed Aug 14 06:37:04 UTC 2024: 6ecec3fe33d6bd7dac0b8386108240c3  /dev/longhorn/e2e-test-volume-2
         Wed Aug 14 06:37:37 UTC 2024: 0bfdf475eb2726074465ec39a563c3d6  /dev/longhorn/e2e-test-volume-2
         Wed Aug 14 06:38:11 UTC 2024: 16392090c78dfe9c223dfc48746d6c5e  /dev/longhorn/e2e-test-volume-2
         Wed Aug 14 06:38:45 UTC 2024: 3b6f2d758984aa37227bec65247977cf  /dev/longhorn/e2e-test-volume-2
         Wed Aug 14 06:39:18 UTC 2024: 055fb131a564476dc3365a41e99ee39a  /dev/longhorn/e2e-test-volume-2
         Wed Aug 14 06:39:52 UTC 2024: fe9abfa766d056b1903adeaf30685ec7  /dev/longhorn/e2e-test-volume-2
         Wed Aug 14 06:40:14 UTC 2024: b5c6194164d787d47048f85b6c580336  /dev/longhorn/e2e-test-volume-2
         Wed Aug 14 06:40:25 UTC 2024: c8ae12d49ed04bdc99a439bde311041f  /dev/longhorn/e2e-test-volume-2
         Wed Aug 14 06:40:35 UTC 2024: 62cb93c60ff5a766cada40b77baa6032  /dev/longhorn/e2e-test-volume-2
         Wed Aug 14 06:40:49 UTC 2024: f4fbc23000c6cbad6fd3585257011fd7  /dev/longhorn/e2e-test-volume-2
         Wed Aug 14 06:41:00 UTC 2024: f4fbc23000c6cbad6fd3585257011fd7  /dev/longhorn/e2e-test-volume-2
         Wed Aug 14 06:41:10 UTC 2024: f4fbc23000c6cbad6fd3585257011fd7  /dev/longhorn/e2e-test-volume-2
         Wed Aug 14 06:41:21 UTC 2024: f4fbc23000c6cbad6fd3585257011fd7  /dev/longhorn/e2e-test-volume-2
         Wed Aug 14 06:41:33 UTC 2024: f4fbc23000c6cbad6fd3585257011fd7  /dev/longhorn/e2e-test-volume-2
         Wed Aug 14 06:42:07 UTC 2024: f4fbc23000c6cbad6fd3585257011fd7  /dev/longhorn/e2e-test-volume-2
        
         [INFO][Wed Aug 14 06:40:43 UTC 2024]: Creatd directory /mnt/e2e-test-volume-2
        
         mke2fs 1.46.4 (18-Aug-2021)
         Discarding device blocks: done                            
         Creating filesystem with 524288 4k blocks and 131072 inodes
         Filesystem UUID: ae5ec712-e883-452f-aa39-b2d03a907c94
         Superblock backups stored on blocks: 
         	32768, 98304, 163840, 229376, 294912
        
         Allocating group tables: done                            
         Writing inode tables: done                            
         Creating journal (16384 blocks): done
         Writing superblocks and filesystem accounting information: done 
        
         [INFO][Wed Aug 14 06:40:46 UTC 2024]: Created filesystem: /dev/longhorn/e2e-test-volume-2
        
    • Find the fragmap behavior using go-spdk-helper lvol get-fragmap.
      • Input from Derek: the holes remain unchanged, which means that the changes were made where the data resides.
         Wed Aug 14 04:34:49 UTC 2024: e66b1e598a6d57085b2c0ec0f3fa55d9  /dev/longhorn/e2e-test-volume-2
         Wed Aug 14 04:35:00 UTC 2024: d557904471550e4f8412c47a0cd72ffd  /dev/longhorn/e2e-test-volume-2
        
         instance-manager-7076fff9d8118ad15a7ccf11d3c70cf0:/ # go-spdk-helper lvol get
         [
         	{
         		"name": "2f2e0499-5bff-43b7-ab36-1ad3ec3a6062",
         		"aliases": [
         			"block-disk/e2e-test-volume-2-r-f89dfe8d"
         		],
         		"product_name": "Logical Volume",
         		"block_size": 4096,
         		"num_blocks": 524288,
         		"uuid": "2f2e0499-5bff-43b7-ab36-1ad3ec3a6062",
         		"creation_time": "2024-08-14T04:02:05Z",
         		"assigned_rate_limits": {
         			"rw_ios_per_sec": 0,
         			"rw_mbytes_per_sec": 0,
         			"r_mbytes_per_sec": 0,
         			"w_mbytes_per_sec": 0
         		},
         		"claimed": true,
         		"claim_type": "exclusive_write",
         		"zoned": false,
         		"supported_io_types": {
         			"read": true,
         			"write": true,
         			"unmap": true,
         			"write_zeroes": true,
         			"flush": false,
         			"reset": true,
         			"compare": false,
         			"compare_and_write": false,
         			"abort": false,
         			"nvme_admin": false,
         			"nvme_io": false
         		},
         		"driver_specific": {
         			"lvol": {
         				"lvol_store_uuid": "8d9dc6e2-8b91-4a51-ba40-e5cf1c055dfc",
         				"base_bdev": "block-disk",
         				"base_snapshot": "e2e-test-volume-2-r-f89dfe8d-snap-8446f4ac-a28f-498f-b4e6-7d53853adf73",
         				"thin_provision": true,
         				"num_allocated_clusters": 1790,
         				"snapshot": false,
         				"clone": true,
         				"xattrs": {
         					"user_created": "true"
         				}
         			}
         		}
         	},
         	{
         		"name": "9798fc1f-8ba7-4389-b89d-bba3f2382dd3",
         		"aliases": [
         			"block-disk/e2e-test-volume-2-r-f89dfe8d-snap-8446f4ac-a28f-498f-b4e6-7d53853adf73"
         		],
         		"product_name": "Logical Volume",
         		"block_size": 4096,
         		"num_blocks": 524288,
         		"uuid": "9798fc1f-8ba7-4389-b89d-bba3f2382dd3",
         		"creation_time": "2024-08-14T04:02:47Z",
         		"assigned_rate_limits": {
         			"rw_ios_per_sec": 0,
         			"rw_mbytes_per_sec": 0,
         			"r_mbytes_per_sec": 0,
         			"w_mbytes_per_sec": 0
         		},
         		"claimed": false,
         		"zoned": false,
         		"supported_io_types": {
         			"read": true,
         			"write": false,
         			"unmap": false,
         			"write_zeroes": false,
         			"flush": false,
         			"reset": true,
         			"compare": false,
         			"compare_and_write": false,
         			"abort": false,
         			"nvme_admin": false,
         			"nvme_io": false
         		},
         		"driver_specific": {
         			"lvol": {
         				"lvol_store_uuid": "8d9dc6e2-8b91-4a51-ba40-e5cf1c055dfc",
         				"base_bdev": "block-disk",
         				"thin_provision": true,
         				"num_allocated_clusters": 2048,
         				"snapshot": true,
         				"clone": false,
         				"clones": [
         					"e2e-test-volume-2-r-f89dfe8d"
         				],
         				"xattrs": {
         					"snapshot_timestamp": "2024-08-14T04:02:47Z",
         					"user_created": "false"
         				}
         			}
         		}
         	}
         ]
        
         instance-manager-7076fff9d8118ad15a7ccf11d3c70cf0:/ # go-spdk-helper lvol get-fragmap --uuid 2f2e0499-5bff-43b7-ab36-1ad3ec3a6062
         {
         	"cluster_size": 1048576,
         	"num_clusters": 2048,
         	"num_allocated_clusters": 1790,
         	"fragmap": "/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////x8AAAAAAAAAAAAAAAAAAAAgAAAAAAAAAAAAAAAAAAAAAA=="
         }
         instance-manager-7076fff9d8118ad15a7ccf11d3c70cf0:/ # go-spdk-helper lvol get-fragmap --uuid 2f2e0499-5bff-43b7-ab36-1ad3ec3a6062
         {
         	"cluster_size": 1048576,
         	"num_clusters": 2048,
         	"num_allocated_clusters": 1790,
         	"fragmap": "/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////x8AAAAAAAAAAAAAAAAAAAAgAAAAAAAAAAAAAAAAAAAAAA=="
         }
         instance-manager-7076fff9d8118ad15a7ccf11d3c70cf0:/ # go-spdk-helper lvol get-fragmap --uuid 9798fc1f-8ba7-4389-b89d-bba3f2382dd3
         {
         	"cluster_size": 1048576,
         	"num_clusters": 2048,
         	"num_allocated_clusters": 2048,
         	"fragmap": "/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////w=="
         }
         instance-manager-7076fff9d8118ad15a7ccf11d3c70cf0:/ # go-spdk-helper lvol get-fragmap --uuid 9798fc1f-8ba7-4389-b89d-bba3f2382dd3
         {
         	"cluster_size": 1048576,
         	"num_clusters": 2048,
         	"num_allocated_clusters": 2048,
         	"fragmap": "/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////w=="
         }
        
  • Pausing for a few seconds before activating the DR volume seems to increase the test's pass rate?
    diff --git a/e2e/tests/regression/test_backup.robot b/e2e/tests/regression/test_backup.robot
    index b329305c6..2eaa737c1 100644
    --- a/e2e/tests/regression/test_backup.robot
    +++ b/e2e/tests/regression/test_backup.robot
    @@ -85,12 +85,17 @@ Test Incremental Restore
         And Wait for volume 1 healthy
         Then Check volume 1 data is backup 0
     
    -
         When Write data 1 to volume 0
         And Create backup 1 for volume 0
         # Wait for DR volume 2 incremental restoration completed
         Then Wait for volume 2 restoration from backup 1 completed
    +
    +    Log    "Pausing the test for 10 seconds"    console=True
    +    Sleep  10  # Pauses the test for 10 seconds
    +    Log    "Resume the test"    console=True
    +
         And Activate DR volume 2
    +
         And Attach volume 2
         And Wait for volume 2 healthy
         And Check volume 2 data is backup 1
    @@ -99,6 +104,11 @@ Test Incremental Restore
         And Create backup 2 for volume 0
         # Wait for DR volume 3 incremental restoration completed
         Then Wait for volume 3 restoration from backup 2 completed
    +
    +    Log    "Pausing the test for 10 seconds"    console=True
    +    Sleep  10  # Pauses the test for 10 seconds
    +    Log    "Resume the test"    console=True
    +
         And Activate DR volume 3
         And Attach volume 3
         And Wait for volume 3 healthy
    
    • Pausing for 10 seconds: 2/5
    • Pausing for 20 seconds: 4/5
    • Pausing for 30 seconds: 4/5
  • Is this reproducible if flush the filesystem before updating the restore progress to 100?
    • Not able to reproduce.

@DamiaSan
Copy link
Contributor

DamiaSan commented Aug 19, 2024

Could you try to use hexdump on /dev/longhorn/e2e-test-volume-2 to understand what is happening? Maybe writing a fixed stream of bytes in a single part of the volume ...

@DamiaSan
Copy link
Contributor

Hi @c3y1huang , thanks for the explanation. Just a question: do you want to proceed with the PRs with only the race condition fix or do you want to wait to find, if any, other possible causes of the issue?

@c3y1huang
Copy link
Contributor

Hi @c3y1huang , thanks for the explanation. Just a question: do you want to proceed with the PRs with only the race condition fix or do you want to wait to find, if any, other possible causes of the issue?

After discussing with @shuo-wu and @derekbit , this feature is still missing snapshotting. I've moved the PRs to draft status to implement this. Once they are ready for review again, I would like to progress with the race condition fix, as we believe it will address the checksum mismatching issue. However, we still want to understand the cause of the fluctuating checksum. If we can't pinpoint the culprit before merging the PRs, I plan to create a new issue with some reproducible PR (by reverting the fix) to help us track the problem. WDYT?

@DamiaSan
Copy link
Contributor

Good plan, thanks @c3y1huang .

@c3y1huang c3y1huang moved this from Implement to Review in Longhorn Sprint Aug 23, 2024
@c3y1huang c3y1huang moved this from Review to Ready For Testing in Longhorn Sprint Aug 28, 2024
@c3y1huang c3y1huang removed the require/lep Require adding/updating enhancement proposal label Aug 28, 2024
@roger-ryao roger-ryao self-assigned this Oct 21, 2024
@c3y1huang c3y1huang added kind/feature Feature request, new feature require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated require/doc Require updating the longhorn.io documentation area/v2-data-engine v2 data engine (SPDK) area/volume-disaster-recovery Volume DR area/backup-store Remote backup store related and removed kind/feature Feature request, new feature require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated require/doc Require updating the longhorn.io documentation area/v2-data-engine v2 data engine (SPDK) area/volume-disaster-recovery Volume DR area/backup-store Remote backup store related labels Oct 23, 2024
@roger-ryao
Copy link

Verified on v1.8.0-dev-20241020 20231023

  • longhorn v1.8.0-dev-20241020 c023511

Result Passed

@github-project-automation github-project-automation bot moved this from Testing to Closed in Longhorn Sprint Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/backup-store Remote backup store related area/v2-data-engine v2 data engine (SPDK) area/volume-disaster-recovery Volume DR highlight Important feature/issue to highlight kind/feature Feature request, new feature priority/0 Must be implement or fixed in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated require/doc Require updating the longhorn.io documentation
Projects
Status: Closed
Development

No branches or pull requests

7 participants