Recovery readiness roadmap #212

sentientwaffle · 2022-10-20T16:32:35Z

Right now we depend on VSR recovery protocol to recover a node that crashes & comes back. This has a huge downside: if the majority of the cluster goes down, you can't (safely) use recovery protocol to bring it back. This means you can't actually bring a whole cluster down & back up!
- To replace this we will implement recovery on top of our superblock instead.
- Implemented: VSR: SuperBlock VSR recovery #426
The LSM stores all tables & metadata in 64KB "blocks". Right now we don't have any way to recover these blocks if they are found to be corrupt. But since storage is (supposed to be) deterministic, we will be able to request the exact block from another replica to repair it.
If a replica falls so far behind that the blocks to repair are no longer available, we need "state transfer", which is slower and more comprehensive, but can bring the replica back up to speed no matter how far behind it is from the cluster.
Right now storage isn't actually deterministic due to an issue with how compaction concurrency is implemented.
Many improvements & optimizations for compaction are planned to guard against latency spikes.
- Rework the superblock's client table in a way that will allow us to support many more clients efficiently. (Right now additional clients would add too much memory overhead + checkpoint latency, so this needs to be rearchitected).
- Checkpoint early, but asynchronously.
- Diff trailers to avoid rewriting identical blocks.

sentientwaffle · 2024-02-08T20:54:04Z

All done!

sentientwaffle self-assigned this Oct 20, 2022

sentientwaffle changed the title ~~Recovery feature roadmap~~ Recovery readiness roadmap Oct 20, 2022

sentientwaffle mentioned this issue Nov 3, 2022

lsm unit/fuzz testing #189

Open

46 tasks

sentientwaffle closed this as completed Feb 8, 2024

Provide feedback