-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llmq|init|test: Implement DKG data recovery / quorum verification vector sync #3964
llmq|init|test: Implement DKG data recovery / quorum verification vector sync #3964
Conversation
936045a
to
f2e1dd5
Compare
f2e1dd5
to
872a6b5
Compare
Just added 744c8d0 with the last force push. This is an issue which comes from #3953 where |
With this change it also validates that "secretKeyShare" is not in `quorum_info` if its not expected to be in there. Before this was basically just not tested.
872a6b5
to
92127d2
Compare
And one more force push:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See below or maybe UdjinM6@16a59e0 instead of suggestions in src/llmq/quorums.cpp
@@ -182,6 +335,66 @@ CQuorumManager::CQuorumManager(CEvoDB& _evoDb, CBLSWorker& _blsWorker, CDKGSessi | |||
CLLMQUtils::InitQuorumsCache(scanQuorumsCache); | |||
} | |||
|
|||
void CQuorumManager::TriggerQuorumDataRecoveryThreads(const CBlockIndex* pIndex) const | |||
{ | |||
if (!fMasternodeMode || !CLLMQUtils::QuorumDataRecoveryEnabled() || pIndex == nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
argument checks are probably the most expensive (and most likely to be okay) so imo should be last thing to check
if (!fMasternodeMode || !CLLMQUtils::QuorumDataRecoveryEnabled() || pIndex == nullptr) { | |
if (!fMasternodeMode || pIndex == nullptr || !CLLMQUtils::QuorumDataRecoveryEnabled()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm actually, i guess since pIndex
can't even be nullptr
in this case (Because of UpdateBlockTip
won't ever be called with nullptr
and i don't yet see a reason to trigger it from somewhere else) it should be still better how it is right now, no? That at least was my reason to put them in this order. Basically with the suggested change you would always do a (trivial, but) redundant pIndex == nullptr
check if QuorumDataRecoveryEnabled
is false
and if it's true it doesn't matter if i don't miss something?
Co-authored-by: UdjinM6 <UdjinM6@users.noreply.github.com>
I like 16a59e0 👍 Partially picked it in 00434a9, not the cache thread related stuff though because that seemed unrelated to here? But imo it still makes sense in a separate PR unless you both feel we should still just push it in here, i can, let me know :D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utACK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utACK
This should probably be accessible via nodes connecting via qwatch no? Additionally, could this be a dos vector? What happens if I repeatedly request from the same node? |
Can you explain a bit more what you mean here?
Im not sure where your concerns are coming from since this PR doesn't change any behaviour of the underlying P2P messages, it just uses them (means the same rules as in #3953 still apply). Can you explain your thoughts? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I misunderstood what was going on :)
Now that I thought a bit more and drank more coffee, I think everything makes sense.
utACK
…tor sync (dashpay#3964) * llmq: Implement automated DKG recovery threads * llmq: Implement quorum verification vector sync * init: Validiate quorum data recovery related command line parameter * test: Add quorum_data_request_timeout_seconds in DashTestFramework * test: Test quorum data recovery in feature_llmq_data_recovery.py * test: Add feature_llmq_data_recovery.py to BASE_SCRIPTS * test: Fix quorum_data_request_expiration_timeout in wait_for_quorum_data * test: Always test the existence of secretKeyShare in test_mn_quorum_data With this change it also validates that "secretKeyShare" is not in `quorum_info` if its not expected to be in there. Before this was basically just not tested. * llmq|test: Use bool as argument type for -llmq-data-recovery * llmq: Always set nTimeLastSuccess to 0 * test: Set -llmq-data-recovery=0 in p2p_quorum_data.py * test: Simplify test_mns Co-authored-by: UdjinM6 <UdjinM6@users.noreply.github.com> * refactor: pass CQuorumCPtr to StartQuorumDataRecoveryThread * test: Fix thread name in comment Co-authored-by: UdjinM6 <UdjinM6@users.noreply.github.com>
This PR implements DKG data recovery in 4d54892 which allows a member of
LLMQx
which looses it's required DKG data (quorum verification vector and secret key share) e.g. due to unexpected loss/corruption of the database to recover itself with the help of the other members ofLLMQx
and the P2P messagesQGETDATA
/QDATA
. Before this PR such an invalid member would just stay invalid for the remaining "active time" of the LLMQ.To solve a requirement of platform/tenderdash it also introduces quorum verification vector sync in eef2691 which enables to configure a masternode to request the quorum verification vectors of all existing/new LLMQs with the type
llmqType
if the masternode itself is a member of any other active LLMQ with the typellmqType
.New command line parameter (hidden?):
-llmq-data-recovery
- Enable1
, Disable0
automated data recovery/qvvec requests.-llmq-qvvec-sync
- Defines from which LLMQ type the masternode should sync quorum verification vectors (Can be used multiple times with different LLMQ types.New thread
dash-q-recovery
This new thread can exist exactly one time for each quorum (llmqType, quorumHash) and its job is to query the missing DKG data from related quorum members. This is done in a loop like: Calculate the next masternode to try, connect to it, send theQGETDATA
request and wait until theQDATA
message comes in with a timeout of 10 seconds before we try the next member. This is continues until the required data has been received successfully or until all members of the quorum were asked without success.The recovery threads are getting started by
CQuorumManager::TriggerQuorumDataRecoveryThreads()
if required which gets called with every block update inCQuorumManager::UpdatedBlockTip()
.The masternode decides based on
CQuourm::GetQuorumRecoveryStartOffset()
and the sorted protx hashes from all members of the quorum to which member of the quorum it tries to connect first. It also sleepsoffset * 100ms
before it tries to connect to the next member. I implemented those two as attempt to balance the incoming requests equally for each member of the target quorum. This means if-llmq-qvvec-sync=llmq_100_67
is used on all masternodes by platform and a newllmq_100_67
quorum just formed we will have in worst case ~2400 nodes asking ~100 members for their quorum verification vectors which means each member will then in best case get 24 requests and then all nodes should have the verification vector. Should be just good or do i miss something?Let me know your thoughts!
Based on #3953