feat(lep): implement volume delta copy inside SPDK #7031

DamiaSan · 2023-11-02T16:13:43Z

There are a couple of options about delta copy implementation in SPDK, I described both so we can discuss together and choose the best one for Longhorn.

innobead

In general, LGTM. Just one question needs to clarify.

enhancements/20231030-spdk-raid-delta-copy.md

shuo-wu · 2023-11-15T11:21:09Z

The concern of the current design is, if the node is suddenly powered off when a replica/lvol keeps handling write requests, is the last succeed write (retuned to the caller) really flushed into disk rather than cached in SPDK/OS? This determines if we can use that write bitmap recorded in the replica offline period to apply the delta rebuilding.

Regardless of of the above concerns, I raised an alternative for the special case (snapshot creation during the replica offline period) mentioned in LEP option 1. I would explain it with the following example:

Let's say a volume has 2 replicas:

replica 1: existing snapshots lvol -> valid head lvol
replica 2: existing snapshots lvol -> valid head lvol

Suddently, replica 2 is offline. And there are 3 new snapshots created during the offline periods:

replica 1 (healthy): existing snapshots lvol ->     snapshot 1    ---> snapshot 2 -> snapshot 3 -> valid head lvol
replica 2 (offline): existing snapshots lvol -> invalid head lvol

After the replica 2 back, of course there would be mismatching between the replica 2 invalid head/rebuilding lvol and replica 1 snapshot 1. In this case, we can rely on the bitmap recorded in the replica 2 offline period to do delta rebuilding.

replica 1 (healthy): existing snapshots lvol ->    snapshot 1   ---> snapshot 2 -> snapshot 3 -> valid head lvol
replica 2 (offline): existing snapshots lvol ->  rebuilding lvol

After finishing the rebuilding for snapshot1, we will make the rebuilding lvol as snapshot1 for replica2. Then a new rebuilding lvol can be created for snapshot 2 rebuilding. But here, there is no need to use bitmap anymore. We can directly copy all data of snapshot 2 to the new rebuilding lvol.

replica 1 (healthy): existing snapshots lvol -> snapshot 1->     snapshot 2     -> snapshot 3 -> valid head lvol
replica 2 (offline): existing snapshots lvol -> snapshot 1 ->  rebuilding lvol

Similar to step 4, we can do full copy when rebuilding snapshot 3

In this case, the bitmap is necessary for snapshot1 rebuilding only. In other words, when the first snapshot of the offline period is created, we can stop modifying the bitmap.

DamiaSan · 2023-11-15T15:32:20Z

Thanks @shuo-wu for these considerations.

The concern of the current design is, if the node is suddenly powered off when a replica/lvol keeps handling write requests, is the last succeed write (retuned to the caller) really flushed into disk rather than cached in SPDK/OS? This determines if we can use that write bitmap recorded in the replica offline period to apply the delta rebuilding.

Inside SPDK RAID and Lvolstore layers, data are not cached. SPDK NVMe driver provides a zero-copy data transfer path (using huge pages): in this way, also in NVMe layer of SPDK, data are not cached.
So:

if we use SPDK NVMe driver to attach Lvolstore directly to the disk
if we use O_DIRECT opening the block device connected via NVMe-oF to the RAID1 bdev

we can assume that data are really sent to the disk when a write operation returns to the caller.

Instead, if we rely on Linux block device using SPDK AIO Bdev to provide a backing device for Lvolstore, both for NVMe disks and for older disk technologies, this is not true. IIUC, Linux block devices provide buffered access to hardware devices.
What do you think @l-mb ?

I think that, if we don't rely on SPDK NVMe driver, before to start the rebuild process we should perform a checksum of the snapshot to be rebuilt, maybe over all the clusters that are not present in the bitmap.

In this case, the bitmap is necessary for snapshot1 rebuilding only. In other words, when the first snapshot of the offline period is created, we can stop modifying the bitmap.

Ok, so when the bitmap is retrieved by the caller it can be deleted from RAID.

shuo-wu · 2023-11-22T11:57:56Z

In today's discussion, we confirmed that spdk aio will open the device with flag O_DIRECT. Hence we don't need to worry about the cache issue.

But there is another scenario raised: If the node/spdk_tgt the RAID resides on gets crashed, how do we handle the failed replica on this node after the RAID back? There are 2 issues here:

When the RAID gets crashed, some base bdevs/replicas finish the in-fly writes while others do not. In other words, there may be still inconsistency among the RW replicas. Currently, Longhorn SPDK volumes ignore this issue and consider all healthy replicas as available since there is no inconsistency mechanism (like revision counter for v1 volumes) to protect them. ==> In the upcoming release v1.6, we cannot introduce the mechanism hence we will keep the current behavior for SPDK volumes.
When the RAID gets crashed and restarted, there is nothing for the bitmap. Then the question is, how can we do for the failed replica?
If we need to rebuild the replica, which healthy replica should we use when there is no revision counter to figure out the latest replica? Derek said that we can pick up anyone. Since neither Longhorn nor users should expect that the in-fly IO data is already in any healthy replicas. Typically even if we can pick up the latest replica, it may contain a crashed filesystem. After the fsck repair, the partially writing succeed data may get removed. (And after launching an appropriate inconsistency detection mechanism, we can pick up a random healthy replica then do sync-up for all other healthy replicas). Then the next is, how do we rebuild the failed replica? We can do full rebuilding or use the v1 volume way.
Or we don't need to rebuild the failed replica for the engine crash case at all. If the in-fly writes are not taken into consideration, there is no difference between the failed replica and healthy replicas on other nodes. But the controller plane needs much more effort to figure out if the failed replica and the engine are crashed at the same time. The rebuilding can be skipped in this case only.

Besides, we discussed the bitmap update flow. The basic idea is, setting the bit when there is a write coming, and unsetting the bit when all base bdevs are mode RW and finish that writing. Without the bit cancellation, all bits may be already 1 even if the failed replica just failed for seconds, which means a full rebuilding.

innobead · 2023-12-18T10:00:30Z

I believe we will update this soon after some following previous discussion?

github-actions · 2024-02-02T01:46:25Z

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions · 2024-03-19T01:46:27Z

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

DamiaSan · 2024-04-09T06:45:36Z

I have just updated the proposal with all the feedback and suggestions received from @shuo-wu , @derekbit , @innobead and SPDK core maintainers.

enhancements/20231030-spdk-raid-delta-copy.md

innobead · 2024-05-20T09:21:58Z

@DamiaSan is this refined as per the recent implementation?

DamiaSan · 2024-05-22T07:05:20Z

@DamiaSan is this refined as per the recent implementation?

Yes, all general concepts don't change, I am only changing the internal implementation of the delta copy handling inside spdk.

DamiaSan · 2024-06-21T15:43:02Z

Just pushed a new revision with the new APIs for the delta map handling and the decision about the calculation of the snapshot checksum.

github-actions · 2024-08-30T01:56:58Z

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

enhancements/20231030-spdk-raid-delta-copy.md

Signed-off-by: Damiano Cipriani <damiano.cipriani@suse.com>

DamiaSan requested review from innobead, l-mb, derekbit, shuo-wu and james-munson November 2, 2023 16:13

DamiaSan requested a review from a team as a code owner November 2, 2023 16:13

DamiaSan force-pushed the lep_spdk_delta_copy branch 2 times, most recently from 4dee3e1 to a50e7c9 Compare November 2, 2023 16:37

DamiaSan mentioned this pull request Nov 14, 2023

[FEATURE] Support v2 volume delta replica rebuilding based on snapshot checksum: SPDK part #5573

Closed

innobead previously approved these changes Nov 14, 2023

View reviewed changes

enhancements/20231030-spdk-raid-delta-copy.md Outdated Show resolved Hide resolved

DamiaSan dismissed innobead’s stale review via eed6a5a November 14, 2023 16:48

DamiaSan force-pushed the lep_spdk_delta_copy branch 2 times, most recently from eed6a5a to a2a4b06 Compare November 14, 2023 16:50

DamiaSan force-pushed the lep_spdk_delta_copy branch from a2a4b06 to f2defe2 Compare November 16, 2023 07:12

shuo-wu mentioned this pull request Nov 23, 2023

[FEATURE] Identify failed replica of RAID1 bdev for replica rebuilding #5422

Closed

github-actions bot added the stale label Feb 2, 2024

DamiaSan removed the stale label Feb 2, 2024

github-actions bot added the stale label Mar 19, 2024

DamiaSan removed the stale label Mar 19, 2024

DamiaSan force-pushed the lep_spdk_delta_copy branch 2 times, most recently from 6d4015d to 43326ce Compare April 9, 2024 06:44

DamiaSan requested a review from innobead April 9, 2024 06:46

shuo-wu reviewed Apr 10, 2024

View reviewed changes

enhancements/20231030-spdk-raid-delta-copy.md Outdated Show resolved Hide resolved

enhancements/20231030-spdk-raid-delta-copy.md Outdated Show resolved Hide resolved

enhancements/20231030-spdk-raid-delta-copy.md Outdated Show resolved Hide resolved

DamiaSan force-pushed the lep_spdk_delta_copy branch from 43326ce to f8a5442 Compare April 15, 2024 10:06

DamiaSan requested a review from shuo-wu April 15, 2024 10:10

DamiaSan force-pushed the lep_spdk_delta_copy branch from f8a5442 to 9c3cdd0 Compare April 16, 2024 09:49

DamiaSan mentioned this pull request Apr 24, 2024

[FEATURE] v2 volume supports auto salvage #8430

Closed

DamiaSan force-pushed the lep_spdk_delta_copy branch from 9c3cdd0 to b039dcb Compare April 30, 2024 08:38

DamiaSan mentioned this pull request May 13, 2024

[IMPROVEMENT] Expose local logical volume and attach it to local node #8551

Closed

innobead removed request for l-mb and james-munson May 20, 2024 09:21

DamiaSan force-pushed the lep_spdk_delta_copy branch 2 times, most recently from 1e21eb8 to e8f1316 Compare June 21, 2024 15:37

DamiaSan force-pushed the lep_spdk_delta_copy branch 2 times, most recently from 03027ad to b58a7a3 Compare July 15, 2024 10:01

github-actions bot added the stale label Aug 30, 2024

DamiaSan removed the stale label Aug 30, 2024

This was referenced Sep 19, 2024

[IMPROVEMENT] v2 volume supports delta replica rebuilding based on snapshot checksum #8771

Closed

feat(lep): auto-salvage support for v2 volumes #9486

Merged

DamiaSan commented Sep 19, 2024

View reviewed changes

enhancements/20231030-spdk-raid-delta-copy.md Show resolved Hide resolved

feat(lep): implement volume delta copy inside SPDK

f22f3ac

Signed-off-by: Damiano Cipriani <damiano.cipriani@suse.com>

DamiaSan force-pushed the lep_spdk_delta_copy branch from b58a7a3 to f22f3ac Compare November 25, 2024 07:06

DamiaSan mentioned this pull request Dec 11, 2024

[FEATURE] implement delta bitmap in go-spdk-helper #9766

Closed

derekbit assigned DamiaSan Dec 12, 2024

DamiaSan mentioned this pull request Dec 16, 2024

[FEATURE] v2 volume supports snapshot checksum #8666

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(lep): implement volume delta copy inside SPDK #7031

feat(lep): implement volume delta copy inside SPDK #7031

DamiaSan commented Nov 2, 2023

innobead left a comment

shuo-wu commented Nov 15, 2023 •

edited

Loading

DamiaSan commented Nov 15, 2023 •

edited

Loading

shuo-wu commented Nov 22, 2023

innobead commented Dec 18, 2023

github-actions bot commented Feb 2, 2024

github-actions bot commented Mar 19, 2024

DamiaSan commented Apr 9, 2024

innobead commented May 20, 2024

DamiaSan commented May 22, 2024

DamiaSan commented Jun 21, 2024

github-actions bot commented Aug 30, 2024

feat(lep): implement volume delta copy inside SPDK #7031

Are you sure you want to change the base?

feat(lep): implement volume delta copy inside SPDK #7031

Conversation

DamiaSan commented Nov 2, 2023

innobead left a comment

Choose a reason for hiding this comment

shuo-wu commented Nov 15, 2023 • edited Loading

DamiaSan commented Nov 15, 2023 • edited Loading

shuo-wu commented Nov 22, 2023

innobead commented Dec 18, 2023

github-actions bot commented Feb 2, 2024

github-actions bot commented Mar 19, 2024

DamiaSan commented Apr 9, 2024

innobead commented May 20, 2024

DamiaSan commented May 22, 2024

DamiaSan commented Jun 21, 2024

github-actions bot commented Aug 30, 2024

shuo-wu commented Nov 15, 2023 •

edited

Loading

DamiaSan commented Nov 15, 2023 •

edited

Loading