Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(lep): implement volume delta copy inside SPDK #7031

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

DamiaSan
Copy link
Contributor

@DamiaSan DamiaSan commented Nov 2, 2023

There are a couple of options about delta copy implementation in SPDK, I described both so we can discuss together and choose the best one for Longhorn.

innobead
innobead previously approved these changes Nov 14, 2023
Copy link
Member

@innobead innobead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, LGTM. Just one question needs to clarify.

enhancements/20231030-spdk-raid-delta-copy.md Outdated Show resolved Hide resolved
@DamiaSan DamiaSan force-pushed the lep_spdk_delta_copy branch 2 times, most recently from eed6a5a to a2a4b06 Compare November 14, 2023 16:50
@shuo-wu
Copy link
Contributor

shuo-wu commented Nov 15, 2023

The concern of the current design is, if the node is suddenly powered off when a replica/lvol keeps handling write requests, is the last succeed write (retuned to the caller) really flushed into disk rather than cached in SPDK/OS? This determines if we can use that write bitmap recorded in the replica offline period to apply the delta rebuilding.

Regardless of of the above concerns, I raised an alternative for the special case (snapshot creation during the replica offline period) mentioned in LEP option 1. I would explain it with the following example:

  1. Let's say a volume has 2 replicas:
replica 1: existing snapshots lvol -> valid head lvol
replica 2: existing snapshots lvol -> valid head lvol
  1. Suddently, replica 2 is offline. And there are 3 new snapshots created during the offline periods:
replica 1 (healthy): existing snapshots lvol ->     snapshot 1    ---> snapshot 2 -> snapshot 3 -> valid head lvol
replica 2 (offline): existing snapshots lvol -> invalid head lvol
  1. After the replica 2 back, of course there would be mismatching between the replica 2 invalid head/rebuilding lvol and replica 1 snapshot 1. In this case, we can rely on the bitmap recorded in the replica 2 offline period to do delta rebuilding.
replica 1 (healthy): existing snapshots lvol ->    snapshot 1   ---> snapshot 2 -> snapshot 3 -> valid head lvol
replica 2 (offline): existing snapshots lvol ->  rebuilding lvol
  1. After finishing the rebuilding for snapshot1, we will make the rebuilding lvol as snapshot1 for replica2. Then a new rebuilding lvol can be created for snapshot 2 rebuilding. But here, there is no need to use bitmap anymore. We can directly copy all data of snapshot 2 to the new rebuilding lvol.
replica 1 (healthy): existing snapshots lvol -> snapshot 1->     snapshot 2     -> snapshot 3 -> valid head lvol
replica 2 (offline): existing snapshots lvol -> snapshot 1 ->  rebuilding lvol
  1. Similar to step 4, we can do full copy when rebuilding snapshot 3

In this case, the bitmap is necessary for snapshot1 rebuilding only. In other words, when the first snapshot of the offline period is created, we can stop modifying the bitmap.

@DamiaSan
Copy link
Contributor Author

DamiaSan commented Nov 15, 2023

Thanks @shuo-wu for these considerations.

The concern of the current design is, if the node is suddenly powered off when a replica/lvol keeps handling write requests, is the last succeed write (retuned to the caller) really flushed into disk rather than cached in SPDK/OS? This determines if we can use that write bitmap recorded in the replica offline period to apply the delta rebuilding.

Inside SPDK RAID and Lvolstore layers, data are not cached. SPDK NVMe driver provides a zero-copy data transfer path (using huge pages): in this way, also in NVMe layer of SPDK, data are not cached.
So:

  • if we use SPDK NVMe driver to attach Lvolstore directly to the disk
  • if we use O_DIRECT opening the block device connected via NVMe-oF to the RAID1 bdev

we can assume that data are really sent to the disk when a write operation returns to the caller.

Instead, if we rely on Linux block device using SPDK AIO Bdev to provide a backing device for Lvolstore, both for NVMe disks and for older disk technologies, this is not true. IIUC, Linux block devices provide buffered access to hardware devices.
What do you think @l-mb ?

I think that, if we don't rely on SPDK NVMe driver, before to start the rebuild process we should perform a checksum of the snapshot to be rebuilt, maybe over all the clusters that are not present in the bitmap.

In this case, the bitmap is necessary for snapshot1 rebuilding only. In other words, when the first snapshot of the offline period is created, we can stop modifying the bitmap.

Ok, so when the bitmap is retrieved by the caller it can be deleted from RAID.

@DamiaSan DamiaSan force-pushed the lep_spdk_delta_copy branch from a2a4b06 to f2defe2 Compare November 16, 2023 07:12
@shuo-wu
Copy link
Contributor

shuo-wu commented Nov 22, 2023

In today's discussion, we confirmed that spdk aio will open the device with flag O_DIRECT. Hence we don't need to worry about the cache issue.


But there is another scenario raised: If the node/spdk_tgt the RAID resides on gets crashed, how do we handle the failed replica on this node after the RAID back? There are 2 issues here:

  1. When the RAID gets crashed, some base bdevs/replicas finish the in-fly writes while others do not. In other words, there may be still inconsistency among the RW replicas. Currently, Longhorn SPDK volumes ignore this issue and consider all healthy replicas as available since there is no inconsistency mechanism (like revision counter for v1 volumes) to protect them. ==> In the upcoming release v1.6, we cannot introduce the mechanism hence we will keep the current behavior for SPDK volumes.
  2. When the RAID gets crashed and restarted, there is nothing for the bitmap. Then the question is, how can we do for the failed replica?
  3. If we need to rebuild the replica, which healthy replica should we use when there is no revision counter to figure out the latest replica? Derek said that we can pick up anyone. Since neither Longhorn nor users should expect that the in-fly IO data is already in any healthy replicas. Typically even if we can pick up the latest replica, it may contain a crashed filesystem. After the fsck repair, the partially writing succeed data may get removed. (And after launching an appropriate inconsistency detection mechanism, we can pick up a random healthy replica then do sync-up for all other healthy replicas). Then the next is, how do we rebuild the failed replica? We can do full rebuilding or use the v1 volume way.
  4. Or we don't need to rebuild the failed replica for the engine crash case at all. If the in-fly writes are not taken into consideration, there is no difference between the failed replica and healthy replicas on other nodes. But the controller plane needs much more effort to figure out if the failed replica and the engine are crashed at the same time. The rebuilding can be skipped in this case only.

Besides, we discussed the bitmap update flow. The basic idea is, setting the bit when there is a write coming, and unsetting the bit when all base bdevs are mode RW and finish that writing. Without the bit cancellation, all bits may be already 1 even if the failed replica just failed for seconds, which means a full rebuilding.

@innobead
Copy link
Member

I believe we will update this soon after some following previous discussion?

Copy link

github-actions bot commented Feb 2, 2024

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the stale label Feb 2, 2024
@DamiaSan DamiaSan removed the stale label Feb 2, 2024
Copy link

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the stale label Mar 19, 2024
@DamiaSan DamiaSan removed the stale label Mar 19, 2024
@DamiaSan DamiaSan force-pushed the lep_spdk_delta_copy branch 2 times, most recently from 6d4015d to 43326ce Compare April 9, 2024 06:44
@DamiaSan
Copy link
Contributor Author

DamiaSan commented Apr 9, 2024

I have just updated the proposal with all the feedback and suggestions received from @shuo-wu , @derekbit , @innobead and SPDK core maintainers.

@DamiaSan DamiaSan requested a review from innobead April 9, 2024 06:46
enhancements/20231030-spdk-raid-delta-copy.md Outdated Show resolved Hide resolved
enhancements/20231030-spdk-raid-delta-copy.md Outdated Show resolved Hide resolved
enhancements/20231030-spdk-raid-delta-copy.md Outdated Show resolved Hide resolved
@innobead
Copy link
Member

@DamiaSan is this refined as per the recent implementation?

@DamiaSan
Copy link
Contributor Author

@DamiaSan is this refined as per the recent implementation?

Yes, all general concepts don't change, I am only changing the internal implementation of the delta copy handling inside spdk.

@DamiaSan DamiaSan force-pushed the lep_spdk_delta_copy branch 2 times, most recently from 1e21eb8 to e8f1316 Compare June 21, 2024 15:37
@DamiaSan
Copy link
Contributor Author

Just pushed a new revision with the new APIs for the delta map handling and the decision about the calculation of the snapshot checksum.

@DamiaSan DamiaSan force-pushed the lep_spdk_delta_copy branch 2 times, most recently from 03027ad to b58a7a3 Compare July 15, 2024 10:01
Copy link

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

Signed-off-by: Damiano Cipriani <damiano.cipriani@suse.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: New Issues
Development

Successfully merging this pull request may close these issues.

3 participants