You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Right now we depend on VSR recovery protocol to recover a node that crashes & comes back. This has a huge downside: if the majority of the cluster goes down, you can't (safely) use recovery protocol to bring it back. This means you can't actually bring a whole cluster down & back up!
To replace this we will implement recovery on top of our superblock instead.
The LSM stores all tables & metadata in 64KB "blocks". Right now we don't have any way to recover these blocks if they are found to be corrupt. But since storage is (supposed to be) deterministic, we will be able to request the exact block from another replica to repair it.
If a replica falls so far behind that the blocks to repair are no longer available, we need "state transfer", which is slower and more comprehensive, but can bring the replica back up to speed no matter how far behind it is from the cluster.
Right now storage isn't actually deterministic due to an issue with how compaction concurrency is implemented.
Many improvements & optimizations for compaction are planned to guard against latency spikes.
Rework the superblock's client table in a way that will allow us to support many more clients efficiently. (Right now additional clients would add too much memory overhead + checkpoint latency, so this needs to be rearchitected).
Checkpoint early, but asynchronously.
Diff trailers to avoid rewriting identical blocks.
The text was updated successfully, but these errors were encountered:
The text was updated successfully, but these errors were encountered: