Description
The consensus reactor only disconnects from peers in three cases:
- msg fails to decode
- msg fails basic validity checks
- bad VoteSetMaj23Message
While we log other kinds of errors in processing (invalid blocks, proposals, votes), we never disconnect from peers due to them, which open us up to being spammed with bad consensus messages.
We also don't enforce that our peers ever send us anything useful. While we mark them as good if they send us 10,000 unique block parts or votes, we never mark them bad if they just fail to send us anything. This could result in us being connected to lots of useless peers.
We need to address both of these things:
- peers sending us bad messages
- peers not sending us any good messages
Bad Messages
Let's enumerate all messages and see when they indicate a bad peer (beyond failing their ValidateBasic() check).
In general we could probably add some more rules around all of these to prevent eg. receiving the same information multiple times from the same peer, but we'd need to ensure on the sending side that it never happens too.
NewRoundStepMessage, NewValidBlockMessage, HasVoteMessage
These are just informational so we know what to send the peer.
HasVote may contain an index beyond the size of the validator set for the given height, which should be a sign of malice, but currently we just ignore it.
VoteSetMaj23Message
We already stop peers for errors here.
ProposalHeartbeatMessage
Comes with a signature, and could be checked, but not actually helpful and should probably be eliminated outright (#2626).
ProposalMessage
Handled in the receiveRoutine - calls setProposal.
Any error here should result in disconnect from peers (as noted in #2158 (comment))
ProposalPOLMessage
Don't think there's much we can do here
BlockPartMessage
Handled in the receiveRoutine - calls addProposalBlockPart.
Some errors here (eg. in AddPart) should cause us to disconnect from the peer, but others (eg. error in unmarshaling) are not the particular peers fault, but the original proposer, and we don't know which peer corresponds to the actual proposer, so there's nothing we can do here (ultimately, we'd like to publish evidence about this so the app can punish the proposer).
VoteMessage
Handled in the receiveRoutine - calls tryAddVote. This can result in many kinds of errors - some resulting in disconnecting, and others not, hence we should use sentinels (#1327)
tryAddVote calls addVote, which may return ErrVoteHeightMismatch
, for which we shouldn't disconnect.
We then call cs.LastCommit.AddVote. Some errors here could come from honest peers, so we shouldn't disconnect:
- ErrVoteUnexpectedStep
- ErrVoteNonDeterministicSignature
- NewConflictingVoteError
The rest are from bad peers, and we should disconnect:
- ErrVoteInvalidValidatorIndex
- ErrVoteInvalidValidatorAddress
- vote.Verify errors (needs sentinel)
VoteSetBitsMessage
Don't think there's much we can do here
Useless Peers
A peer could just never send us any consensus messages and we wouldn't disconnect from it.
We need to enforce some cadence on our peers. While this is partially mitigated by preferentially connecting to at least some peers marked good via the PEX, we would probably benefit from having additional protections within the reactor, though we can leave that to a separate issue.
Summary
For now we should do the following, as described above:
- Stop peers for bad proposalsStop peers for bad block partsStop peers for bad votesRemove the heartbeat message
Activity
srmo commentedon Nov 17, 2018
@ebuchman I'd love trying myself on removal of
ProposalHeartbeatMessage
though I struggle with understanding why it was required in the first place. As a heavy user of the "no-empty-blocks" option I'm interested in any feature that helps with stability. So why is it OK to remove, i.e. why was it needed in the first place?Anyway, when I start working on it (be it with just removing the signing in the first step) should such a thing be done in a separate issue or just a banch referencing this issue here?
srmo commentedon Nov 17, 2018
I've done https://github.com/srmo/tendermint/tree/2871-remove-proposal-hearbeat
It shows that proposalHeartbeat signing (?) is mentioned in ADR-24.
Do we need an additional ADR for this issue here?
I'd really like to discuss the ramifications of removing a heartbeat mechanism in "no-empty-blocks" scenarios, where it is likely to have peers not sending anything meaningful for extended periods of time. How will this be covered?
ebuchman commentedon Nov 17, 2018
It was never introduced for any purpose other than the thought that "maybe it will be useful for debugging some time".
While it might be true, so far in debugging issues with this feature we've always been able to resort to the logs of each instance, so the extra message wasn't useful. In a case where you don't have access to the proposer's logs or otherwise lose them, we'd be losing that information.
Note that this message isn't gossipped, only sent to directly connected peers, so, in deployments where you want to hide the validator, eg. behind sentries, only the sentries would receive this message.
If there's a strong desire to keep this message, we would have to think securing it, and it's not clear it's worth it.
srmo commentedon Nov 17, 2018
OK. Understood.
I just wanted to make sure that it isn't some kind of keep alive mechanism in no-empty-blocks world.
So please, feel free to have a look at #2874
21 remaining items