Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recovery readiness roadmap #212

Closed
8 tasks done
sentientwaffle opened this issue Oct 20, 2022 · 1 comment
Closed
8 tasks done

Recovery readiness roadmap #212

sentientwaffle opened this issue Oct 20, 2022 · 1 comment
Assignees

Comments

@sentientwaffle
Copy link
Member

sentientwaffle commented Oct 20, 2022

  • Right now we depend on VSR recovery protocol to recover a node that crashes & comes back. This has a huge downside: if the majority of the cluster goes down, you can't (safely) use recovery protocol to bring it back. This means you can't actually bring a whole cluster down & back up!
  • The LSM stores all tables & metadata in 64KB "blocks". Right now we don't have any way to recover these blocks if they are found to be corrupt. But since storage is (supposed to be) deterministic, we will be able to request the exact block from another replica to repair it.
  • If a replica falls so far behind that the blocks to repair are no longer available, we need "state transfer", which is slower and more comprehensive, but can bring the replica back up to speed no matter how far behind it is from the cluster.
  • Right now storage isn't actually deterministic due to an issue with how compaction concurrency is implemented.
  • Many improvements & optimizations for compaction are planned to guard against latency spikes.
    • Rework the superblock's client table in a way that will allow us to support many more clients efficiently. (Right now additional clients would add too much memory overhead + checkpoint latency, so this needs to be rearchitected).
    • Checkpoint early, but asynchronously.
    • Diff trailers to avoid rewriting identical blocks.
@sentientwaffle sentientwaffle self-assigned this Oct 20, 2022
@sentientwaffle sentientwaffle changed the title Recovery feature roadmap Recovery readiness roadmap Oct 20, 2022
@sentientwaffle
Copy link
Member Author

All done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant