-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[consensus] split round timeout msg out of vote msg #14433
Conversation
⏱️ 15h 19m total CI duration on this PR
🚨 1 job on the last run was significantly faster/slower than expected
|
1bccc3d
to
141cab9
Compare
timeout: TwoChainTimeout, | ||
author: Author, | ||
/// LedgerInfo of a block that is going to be committed in case this vote gathers QC. | ||
ledger_info: LedgerInfo, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, why the timeout carries a ledger info?
consensus/src/network.rs
Outdated
response_sender: callback, | ||
}) | ||
}, | ||
ConsensusMsg::RandGenMessage(req) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, is this block just formatting change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added RoundTimeoutMsg
handling, and then the formatter took over.
consensus/src/pending_votes.rs
Outdated
} | ||
} | ||
|
||
/// Insert a vote and if the vote is valid, return a QuorumCertificate preferentially over a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the comment seems out-of-date, this function only aggregates tc?
VoteReceptionResult::NewQuorumCertificate(qc) => { | ||
unreachable!() | ||
}, | ||
VoteReceptionResult::New2ChainTimeoutCertificate(tc) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably can split the vote from timeout in the pending_vote too, so the result can be separate (after we disable the current vote msg)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add a TODO to do this after disabling the current vote handling.
consensus/src/round_manager.rs
Outdated
@@ -1545,6 +1671,9 @@ impl RoundManager { | |||
VerifiedEvent::VoteMsg(vote_msg) => { | |||
monitor!("process_vote", self.process_vote_msg(*vote_msg).await) | |||
} | |||
VerifiedEvent::RoundTimeoutV2Msg(vote_msg) => { | |||
monitor!("process_vote", self.process_round_timeout_msg(*vote_msg).await) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/process_vote/process_timeout
main: ConsensusConfigV1, | ||
quorum_store_enabled: bool, | ||
order_vote_enabled: bool, | ||
round_timeout_msg_enabled: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason this needs to be on-chain config? we should be able to support both modes and transition in two releases?
5672524
to
c37d436
Compare
c37d436
to
1daf468
Compare
1daf468
to
94ec984
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
94ec984
to
bb52454
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
a594bc8
to
9948aac
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
3218147
to
a1c10ef
Compare
a1c10ef
to
4e1780e
Compare
810b8ea
to
0f4ff74
Compare
@@ -225,7 +228,7 @@ impl RoundState { | |||
|
|||
/// Return if already voted for timeout | |||
pub fn is_vote_timeout(&self) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the name is a bit confusing, this should be already_vote_timeout
or something?
consensus/src/network.rs
Outdated
@@ -405,6 +406,12 @@ impl NetworkSender { | |||
self.broadcast(msg).await | |||
} | |||
|
|||
pub async fn broadcast_round_timeout(&self, round_timeout: RoundTimeoutMsg) { | |||
fail_point!("consensus::send::vote", |_| ()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/vote/timeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reused the name to preserve any existing usages but I don't see any usages for these failpoints.
@@ -99,6 +100,83 @@ impl PendingVotes { | |||
} | |||
} | |||
|
|||
/// Insert a RoundTimeout and return a TimeoutCertificate if it can be formed | |||
pub fn insert_round_timeout( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should re-use the code for insert_vote as well? at least majority piece
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean checking if there was an existing timeout vote?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no I mean code wise it's duplicate of the piece in insert_vote?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return result changes between the two implementations. In the old code, we return vote voting power if we don't have enough to aggregate regardless, but in new code, we return the tc voting power. If we want to keep existing logic then we have to keep them separate.
consensus/src/round_manager.rs
Outdated
@@ -1197,6 +1253,9 @@ impl RoundManager { | |||
state = vote.vote_data().proposed().executed_state_id(), | |||
is_timeout = vote.is_timeout(), | |||
); | |||
if self.local_config.enable_round_timeout_msg { | |||
error!("Received Vote with Timeout with Round Timeout Message enabled."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this holds, during rollout, some nodes will have the flag enabled while others don't. the flag can only control sender side but not receiver
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is meant to be a debugging statement, but since we don't send debug logs to humio, I made them errors. 🤦🏽. I just record the log and continue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but it'll happen all the time during rollout (if there's timeout)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is what my expectation is. If it can be useful to debug, why not? Don't you think it's useful? I can downgrade to warn as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably don't want noise during the rollout, this doesn't give much information as well since it's expected during the release
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed it. Done.
consensus/src/round_manager.rs
Outdated
epoch = timeout.epoch(), | ||
round = timeout.round(), | ||
); | ||
if !self.local_config.enable_round_timeout_msg { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this too, the first release needs to have nodes enabling receiving, then the following release enabling the flag for sending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above. just logging for debugging and send to humio. I plan to ignore it from alerts. I could also make this a warn instead.
8d78e14
to
410c341
Compare
410c341
to
26d5580
Compare
); | ||
ensure!( | ||
self.round_timeout.two_chain_timeout().hqc_round() | ||
<= self.sync_info.highest_certified_round(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need <= here?
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
✅ Forge suite
|
✅ Forge suite
|
✅ Forge suite
|
Description
This PR splits out the timeout data from the
Vote
struct into newRoundTimeout
struct. This means validators do not need to form and vote on a nil block when timing out a round. Also, the purpose of the nil block is satisfied by the order vote mechanism, which ensures the parent gets a QC even if the next round proposal fails/timeout.The change is behind a local config, which can be flipped after a full binary rollout.
Test Plan
Compat test flipping config flag false to true: https://github.com/aptos-labs/aptos-core/actions/runs/10603758720
This PR adds a Smoke test to flip the flag as well