feat(lep): implement volume delta copy inside SPDK

Signed-off-by: Damiano Cipriani <damiano.cipriani@suse.com>
longhorn · Apr 16, 2024 · 9c3cdd0 · 9c3cdd0
1 parent cb95df3
commit 9c3cdd0
Showing 1 changed file with 126 additions and 0 deletions.
diff --git a/enhancements/20231030-spdk-raid-delta-copy.md b/enhancements/20231030-spdk-raid-delta-copy.md
@@ -0,0 +1,126 @@
+# Implement a logical volume delta copy in RAID bdev module
+
+## Summary
+
+To implement different replicas of a Longhorn volume, inside SPDK we use a RAID1 bdev made of logical volumes stored in different nodes.
+Actually, with shallow copy and set parent operations, we can perform a full rebuild of a logical volume. But we need another
+rebuild functionality, called delta copy, that will be used to copy only a subset of the data that a logical volume is
+made of. For example, we will use delta copy in case of temporary network disconnections or node restart, issues that can
+lead to data loss in a replica of a Longhorn volume.
+We must also deal with the failure of the node where the RAID1 bdev reside, ensuring data consistency between different replicas.
+
+## Motivation
+
+### Goals
+
+Perform a fast rebuild of replicas that aren't full aligned with other replicas of the same volume and ensure data consistency
+in every type of node failure.
+
+## Proposal for delta copy
+
+The basic idea of delta copy is to maintain in memory, for the time a replica is unavailable, a bitmap of the "regions" over which a write operation has been performed. We will talk about the dimension of this regions later, keep in mind that a region could be a cluster (typically 4Mb) or a wider region of the Gb order.
+
+Actually inside SPDK, when a base bdev of a RAID1 becomes unavailable, we have the following process:
+* base_bdev_N becomes unavailable.
+* NVMe-oF tries to reconnect to the source of base_bdev_N.
+* during the reconnection time, write operations over the block device connected via NVMe-oF to the RAID1 bdev remains stuck.
+* after a configurable time, base_bdev_N is deleted by NVMe-oF layer.
+* the deletion event arrives to the RAID1 which remove base_bdev_N as base bdev.
+* write operations over the block device restart without error.
+
+The proposal is:
+* when the deletion event arrives to the RAID1, and the base_bdev_N is removed from the RAID, a new bitmap for this bdev is created.
+* every write operation will update all the regions of the bitmap where the write operation has been performed.
+* if the replica node from which base_bdev_N originate doesn't become available again within a certain amount of time, base_bdev_N's
+  bitmap is deleted. When the replica node will become available again, the replica volume will need a full rebuild.
+* if the replica node from which base_bdev_N originate becomes available within a certain amount of time, we have the 2 options described below.
+
+### Option 1
+
+Perform a rebuild in a similar way the full rebuild operate.  
+Suppose we have 3 nodes, node1 with lvol1, node2 with lvol2 and node0 with the raid bdev; the raid is composed by the bdev created attaching to the NVMe-oF exported lvol1 and lvol2. We will call these 2 base bdevs replica1n1 and replica2n1.
+Node1 goes offline for a while and then come back online again:
+* pause I/O.
+* flush I/O.
+* retrieve from the RAID1, with a new RPC, the bitmap of replica1n1
+* perform a snapshot of lvol2 called snap2
+* export via NVMe-oF snap2 and attach to it on node1, creating the bdev esnap2n1
+* on node1 create a clone of esnap2n1 called lvol1_new
+* export lvol1_new via NVMe-oF and attach to it on node0 creating the bdev replica1n1
+* grow the raid adding replica1n1 as base bdev
+* resume I/O.
+* on node0 (but we can make this operation on every node) we connect to lvol1 and to snap2 with `nvme connect`
+* copy over lvol1 all the clusters contained in the bitmap, reading the data from snap2
+* make a snapshot of lvol1, snap1
+* delete lvol1
+* pause I/O
+* rename lvol1_new to lvol1
+* set snap1 as parent of lvol1
+* resume I/O
+
+Special case: if during the offline period one or more snapshots are made over the volume, we can stop to update the bimap after the creation
+of the first snapshot. When the replica comes back again, the bitmap will be used, as above, to update lvol1. For all the following snapshots that have been made, we can perform a copy of all data: to know what clusters the source snapshot contains, i.e. what clusters to copy, we can retrieve the fragmap of such a snapshot.
+
+Advantages:
+* this solution is very similar to the full rebuild, the only difference is that we read/write directly from/to the block devices connected to the volumes via NVMe-oF instead of calling shallow copy.
+* this solution is quite similar to the delta copy with v1 engine.
+
+Disadvantages:
+* we need to implement a new RPC to retrieve the bitmap.
+* the "driver" of the process is outside SPDK, so it has to export volumes, get bitmap, read and write blocks, align the snapshot chain ...
+
+### Option 2
+
+Make leverage on the RAID rebuild feature: when a new base bdev is added to a RAID1, a rebuild process to copy all the data over the new base bdev can start automatically. We could modify this behaviour so that a rebuild process of only the regions inside the bitmap would start.
+
+Advantages:
+* the rebuild is made entirely into the RAID module, without any operation made outside SPDK. The only task to be made outside is to connect again to the replica and add it again to the RAID.
+* we don't need to make any snapshot over all the replicas before to start rebuilding, because the rebuild process works directly over the live data.
+
+Disadvantages:
+* during the rebuild process, the user can't make snapshots over the volume. This is because RAID has no knowledge of what its base bdevs are made of, so with logical volumes the RAID layer can works only over live data.
+* to work over live data, the rebuild process must quiesce the ranges over which it has to operate, that in our case are the regions contained in the bitmap. This means that, during the rebuild, if the user writes data over these regions, the writes operation can remain stuck until the rebuild of that region has finished.
+
+## Proposal for data consistency handling
+
+We must also deal with the crash of the node where the RAID1 resides. The point is: before to create again the raid, we must ensure
+that all the replicas have the same data, so we must elect an healthy replica and align all others to this one.
+To make this, we can have different solutions.
+
+### Revision Counter
+In Longhorn v1 engine we have a revision counter, which means the counter of block write operations received by a replica. The replica with
+the greater revision counter contain newer data and so can be elected as the healthy replica.
+But this operations is too costly, because it means an additional write for every write operation. Moreover, the write of the data and the update of the revision counter aren't atomic, so an inconsistency can still happen.
+
+### Write to one replica first
+We could achieve the same result of the revision counter always writing to one replica first, then the others (once the first replica has ended). This will have the same latency as adding an extra metadata write (the revision counter), but you don't need any metadata.
+
+### Pick up anyone
+If we don't use neither the revision counter nor the write to one replica first, which healthy replica should we use? We can pick up anyone. Since neither Longhorn nor users should expect that the in-fly IO data is already in any healthy replicas.
+
+### Align the replicas
+Once the healthy replica is chosen, it may contain a crashed filesystem. So we have to mount it and, after the fsck repair, the partially written data may get removed. Now we can do the sync-up for all other healthy replicas, rebuilding the entire live volume of the faulty replicas. To do this, we must retrieve the fragmap of the volumes, calculate a checksum of every allocated cluster for all the replicas, make a comparison and finally copy from the healthy replica in case the checksums differ.
+
+### Boost the sync-up?
+If we could persist the delta copy bitmap, for example in the raid superblock, then we could make this process faster: in this way we should rebuild only the regions not aligned with the healthy replica ones. But to do this, we will have to store the bitmap on disk for every write operations (possibly in atomic way), and this is too costly as we have seen for the revision counter.
+
+There is an optimal solution to do this: storing some metadata in the LBA of the disk, for example a revision counter of every block. Doing so, it doesn't need an additional write, because the write of block data and metadata is done with an unique operation; it also makes the sync-up faster, because we should have to align only the blocks with a different revision counter.  
+We can think to couple the storing of block metadata with the storing of the bitmap on disk, because in this case we could write the bitmap not for every write operations but for example every ms.  
+The bad part is that not always we can have block metadata support:
+* not every NVMe disks support metadata
+* not every bdev inside SPDK support metadata, for example AIO bdev doesn't
+* if we handle SATA disks with AIO bdev, we can't have metadata with this kind of drives
+
+So this is an optimal solution but it is not always available.
+
+## Bitmap regions
+If we use the bitmap only in memory, then we can use the blob cluster as region to be tracked in the bitmap.
+But if we decide to store the bitmap on the disk, and so to write the bitmap with every write operation (or every ms), probably we will have to track larger regions on it, regions of the Gb order. Because the bandwidth needed to write such a bitmap would be pretty large.
+
+## Conclusions
+What about the delta copy, I think the first option is better, because it is more similar to the full rebuild and to what Longhorn actually does with v1 engine. Moreover, it hasn't the limitation of snapshotting not available during the rebuild, even if this process should be quite fast.  
+What about the data consistency, probably the best solution is to pick up any replica as the healthy one and then rebuild the entire live
+volume of other replicas. Because other solutions haven't big advantages or aren't always available.
+
+
+