Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llmq|init|test: Implement DKG data recovery / quorum verification vector sync #3964

Merged
merged 14 commits into from
Feb 1, 2021

Conversation

xdustinface
Copy link

@xdustinface xdustinface commented Jan 22, 2021

This PR implements DKG data recovery in 4d54892 which allows a member of LLMQx which looses it's required DKG data (quorum verification vector and secret key share) e.g. due to unexpected loss/corruption of the database to recover itself with the help of the other members of LLMQx and the P2P messages QGETDATA/QDATA. Before this PR such an invalid member would just stay invalid for the remaining "active time" of the LLMQ.

To solve a requirement of platform/tenderdash it also introduces quorum verification vector sync in eef2691 which enables to configure a masternode to request the quorum verification vectors of all existing/new LLMQs with the type llmqType if the masternode itself is a member of any other active LLMQ with the type llmqType.

New command line parameter (hidden?):

  • -llmq-data-recovery - Enable 1, Disable 0 automated data recovery/qvvec requests.
  • -llmq-qvvec-sync - Defines from which LLMQ type the masternode should sync quorum verification vectors (Can be used multiple times with different LLMQ types.

New thread

  • dash-q-recovery This new thread can exist exactly one time for each quorum (llmqType, quorumHash) and its job is to query the missing DKG data from related quorum members. This is done in a loop like: Calculate the next masternode to try, connect to it, send the QGETDATA request and wait until the QDATA message comes in with a timeout of 10 seconds before we try the next member. This is continues until the required data has been received successfully or until all members of the quorum were asked without success.

The recovery threads are getting started by CQuorumManager::TriggerQuorumDataRecoveryThreads() if required which gets called with every block update in CQuorumManager::UpdatedBlockTip().

The masternode decides based on CQuourm::GetQuorumRecoveryStartOffset() and the sorted protx hashes from all members of the quorum to which member of the quorum it tries to connect first. It also sleeps offset * 100ms before it tries to connect to the next member. I implemented those two as attempt to balance the incoming requests equally for each member of the target quorum. This means if -llmq-qvvec-sync=llmq_100_67 is used on all masternodes by platform and a new llmq_100_67 quorum just formed we will have in worst case ~2400 nodes asking ~100 members for their quorum verification vectors which means each member will then in best case get 24 requests and then all nodes should have the verification vector. Should be just good or do i miss something?

Let me know your thoughts!

Based on #3953

@xdustinface xdustinface added this to the 17 milestone Jan 22, 2021
@xdustinface xdustinface force-pushed the pr-llmq-data-recovery branch from 936045a to f2e1dd5 Compare January 28, 2021 22:35
@xdustinface xdustinface marked this pull request as ready for review January 28, 2021 22:36
@xdustinface xdustinface force-pushed the pr-llmq-data-recovery branch from f2e1dd5 to 872a6b5 Compare January 28, 2021 23:47
@xdustinface
Copy link
Author

Just added 744c8d0 with the last force push. This is an issue which comes from #3953 where recover=True for wait_for_quorum_data wasn't even used/needed. The recover parameter should have just not exist in that PR but that was obviously a mistake i made when i during development added a generalized version of wait_for_quorum_data to test_framework.py instead of having a specific version in p2p_quorum_data.py and feature_llmq_data_recovery.py. I guess we can just fix it in here?

@xdustinface xdustinface force-pushed the pr-llmq-data-recovery branch from 872a6b5 to 92127d2 Compare January 29, 2021 02:49
@xdustinface
Copy link
Author

And one more force push:

src/llmq/quorums.cpp Outdated Show resolved Hide resolved
src/llmq/quorums_utils.h Outdated Show resolved Hide resolved
src/llmq/quorums_utils.cpp Outdated Show resolved Hide resolved
Copy link

@UdjinM6 UdjinM6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See below or maybe UdjinM6@16a59e0 instead of suggestions in src/llmq/quorums.cpp

src/llmq/quorums.cpp Outdated Show resolved Hide resolved
src/llmq/quorums.cpp Outdated Show resolved Hide resolved
test/functional/feature_llmq_data_recovery.py Outdated Show resolved Hide resolved
@@ -182,6 +335,66 @@ CQuorumManager::CQuorumManager(CEvoDB& _evoDb, CBLSWorker& _blsWorker, CDKGSessi
CLLMQUtils::InitQuorumsCache(scanQuorumsCache);
}

void CQuorumManager::TriggerQuorumDataRecoveryThreads(const CBlockIndex* pIndex) const
{
if (!fMasternodeMode || !CLLMQUtils::QuorumDataRecoveryEnabled() || pIndex == nullptr) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

argument checks are probably the most expensive (and most likely to be okay) so imo should be last thing to check

Suggested change
if (!fMasternodeMode || !CLLMQUtils::QuorumDataRecoveryEnabled() || pIndex == nullptr) {
if (!fMasternodeMode || pIndex == nullptr || !CLLMQUtils::QuorumDataRecoveryEnabled()) {

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm actually, i guess since pIndex can't even be nullptr in this case (Because of UpdateBlockTip won't ever be called with nullptr and i don't yet see a reason to trigger it from somewhere else) it should be still better how it is right now, no? That at least was my reason to put them in this order. Basically with the suggested change you would always do a (trivial, but) redundant pIndex == nullptr check if QuorumDataRecoveryEnabled is false and if it's true it doesn't matter if i don't miss something?

xdustinface and others added 2 commits January 31, 2021 02:43
@xdustinface
Copy link
Author

See below or maybe UdjinM6@16a59e0 instead of suggestions in src/llmq/quorums.cpp

I like 16a59e0 👍 Partially picked it in 00434a9, not the cache thread related stuff though because that seemed unrelated to here? But imo it still makes sense in a separate PR unless you both feel we should still just push it in here, i can, let me know :D

UdjinM6
UdjinM6 previously approved these changes Jan 31, 2021
Copy link

@UdjinM6 UdjinM6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK

Copy link

@UdjinM6 UdjinM6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK

@PastaPastaPasta
Copy link
Member

if the masternode itself is a member of any other active LLMQ with the type llmqType.

This should probably be accessible via nodes connecting via qwatch no?

Additionally, could this be a dos vector? What happens if I repeatedly request from the same node?

@xdustinface
Copy link
Author

xdustinface commented Feb 1, 2021

if the masternode itself is a member of any other active LLMQ with the type llmqType.

This should probably be accessible via nodes connecting via qwatch no?

Can you explain a bit more what you mean here?

Additionally, could this be a dos vector? What happens if I repeatedly request from the same node?

Im not sure where your concerns are coming from since this PR doesn't change any behaviour of the underlying P2P messages, it just uses them (means the same rules as in #3953 still apply). Can you explain your thoughts?

Copy link
Member

@PastaPastaPasta PastaPastaPasta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I misunderstood what was going on :)

Now that I thought a bit more and drank more coffee, I think everything makes sense.

utACK

@xdustinface xdustinface merged commit d6c6174 into dashpay:develop Feb 1, 2021
gades pushed a commit to cosanta/cosanta-core that referenced this pull request Mar 12, 2022
…tor sync (dashpay#3964)

* llmq: Implement automated DKG recovery threads

* llmq: Implement quorum verification vector sync

* init: Validiate quorum data recovery related command line parameter

* test: Add quorum_data_request_timeout_seconds in DashTestFramework

* test: Test quorum data recovery in feature_llmq_data_recovery.py

* test: Add feature_llmq_data_recovery.py to BASE_SCRIPTS

* test: Fix quorum_data_request_expiration_timeout in wait_for_quorum_data

* test: Always test the existence of secretKeyShare in test_mn_quorum_data

With this change it also validates that "secretKeyShare" is not in `quorum_info` if its not expected to be in there. Before this was basically just not tested.

* llmq|test: Use bool as argument type for -llmq-data-recovery

* llmq: Always set nTimeLastSuccess to 0

* test: Set -llmq-data-recovery=0 in p2p_quorum_data.py

* test: Simplify test_mns

Co-authored-by: UdjinM6 <UdjinM6@users.noreply.github.com>

* refactor: pass CQuorumCPtr to StartQuorumDataRecoveryThread

* test: Fix thread name in comment

Co-authored-by: UdjinM6 <UdjinM6@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants