-
Notifications
You must be signed in to change notification settings - Fork 36.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhaul transaction request logic #19184
Conversation
b4cf759
to
0b836ea
Compare
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
Concept ACK. I was reading the description and this part is a bit unclear:
Why invalid transactions are re-requested? Maybe I can figure out after reading the code, but right now this is confusing. |
It requires invalid witnesses. When you request a txid T from peer A, and get a response with a transaction with invalid witness, but txid T, then the announcement (T,A) gets marked as COMPLETED. |
Concept ACK |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK
There's a lot to review here. It'll take a while to work through it.
I've left a few small style comments from a quick read through the new header file.
0b836ea
to
37c2614
Compare
a3dac75
to
1edf640
Compare
Concept ACK Very nice fuzzing harness! |
9beff75
to
38e2ff4
Compare
38e2ff4
to
57bdd00
Compare
I've made a few changes:
As there are still other changes I want to make to the behavior, I've marked this as Draft for now. |
57bdd00
to
1309d85
Compare
1309d85
to
4ffeaa9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approach ACK, mostly asking for specification/behavior clarifications. I see better how boost::multi_index_container fits well for the job here, not a concern to me anymore.
Overall, great work!
* - Entries are called delayed if they have a reqtime different from std::chrono::microseconds::min(). This is the | ||
* case for entries from inbound non-whitelisted nodes. The priority for delayed entries is always higher than | ||
* non-delayed ones, so non-delayed ones are always preferred. | ||
* - Within the set of delayed peers and within the set of non-delayed peers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wasn't clear at first read if m_first
was applying for each set or the union of them. Why not restrain m_first
to non-delayed peer only to favor fastest among more trusted peers ? We always favor non-delayed ones over delayed ones, so in PromoteCandidateNew
, an entry from the former is likely to be favored but we sorting them uniformly randomly obviously don't favor the fastest one. Or do you assume we receive and fetch in same threading sequence (ThreadMessageHandler
) and reqtime
initialized to now ensure this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "first" concept works differently now. Have a look at #19988 to see if your comment still applies.
* | ||
* === High level behavior === | ||
* | ||
* Transaction downloads are attempted in random order, first choosing among non-delayed peers, then delayed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a bit confusing to talk about delayed/non-delayed peers. Transaction entries in function of their announcing peers types but we may process those uniformly in other context. Also random order minus bias for first announcer ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #19988. The "delayed"/"nondelayed" distinction doesn't exist anymore. There is an explicit separate boolean to ReceivedInv
now for preferred/nonpreferred, and a way to specify reqtime explicitly. (in particular, because txid peers get a delay, even though they're not preferred like outbounds are - so these concepts needed to be split).
* ones whose delay has passed. When a NOTFOUND is received, an invalid transaction is received, or a download | ||
* times out (after a configurable delay), the next candidate is immediately scheduled according to the same rules | ||
* above. The same transaction is never re-requested from the same peer, unless the transaction was forgotten about | ||
* in the mean time. This happens whenever no candidates remain, or when a transaction is successfully received. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean exactly here ? If a transaction has been successfully received (AlreadyHave() == true) we won't re-request anyway, unless g_recently_confirmed
rolls over ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is just clarifying when a transaction is forgotten about, which can happen when either no candidates for a tx remain, or when the transaction is successfully received. Only the first of those two matters to re-requesting, but it's still important to specify that transactions are forgotten about when they're successfully received. See if it's clearer now.
* above. The same transaction is never re-requested from the same peer, unless the transaction was forgotten about | ||
* in the mean time. This happens whenever no candidates remain, or when a transaction is successfully received. | ||
* | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you might laid out more high-level properties of transaction request logic here, among others:
"Transaction downloads never guarantee to succeed, namely receiving transaction if you hear about its identifier. Success should be define as a best-effort, guaranteeing we advance towards entry removal for any peer who announced a txid"
"To optimize bandwidth-saving, we limit parallel-fetching and favor sequential fetching (there MUST be only one REQUESTED entry at anytime for any given txid)"
"We don't guarantee fetching again a transaction for which validity/standardness may have change between re-announcements, until next block announcement (a AlreadyHave() triggers all txid entries removal)"
"Assuming a fast-inbound attacker-controlled peer and low outbound not-controlled ones, a transaction propagation can be delayed at most for one GETDATA_TX_INTERVAL period"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a lot of comments explaining the rationale and some analysis to txrequest.h in #19989
"Transaction downloads never guarantee to succeed, namely receiving transaction if you hear about its identifier. Success should be define as a best-effort, guaranteeing we advance towards entry removal for any peer who announced a txid"
I have difficulty parsing what you're trying to say here, but I've added some related comments. Please have a look to see if it's better addressed now.
"To optimize bandwidth-saving, we limit parallel-fetching and favor sequential fetching (there MUST be only one REQUESTED entry at anytime for any given txid)"
Good idea, done.
"We don't guarantee fetching again a transaction for which validity/standardness may have change between re-announcements, until next block announcement (a AlreadyHave() triggers all txid entries removal)"
I don't understand this. New block announcements don't affect TxRequestTracker (except for transactions that are in the block, and we don't need anymore).
"Assuming a fast-inbound attacker-controlled peer and low outbound not-controlled ones, a transaction propagation can be delayed at most for one GETDATA_TX_INTERVAL period"
I added a much more accurate formula.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have difficulty parsing what you're trying to say here, but I've added some related comments. Please have a look to see if it's better addressed now.
IIRC I meaned the following. "Learning about a transaction identifier doesn't guarantee the node will successfully receive the transaction, no more take special steps to ensure it as sending non-solicited GETDATAs. An entry is guarantee to never stuck, assuming the local clock is moving forward"
The first point is a minor. The second underscores an important property of new TxRequestTracker design, its state machine always progress towards a terminal state and should never halt. Though it's so obvious it's worthy to comment.
I don't understand this. New block announcements don't affect TxRequestTracker (except for transactions that are in the block, and we don't need anymore).
IIRC I meaned the following scenario:
- a peer announces and sends Tx 1, it is rejected due to its unconfirmed parent being invalid
- all other pending requests for Tx 1 are deleted (ForgetTxHash)
- the unconfirmed parent is re-announced, sent and accepted under a different wtxid
- a peer announces Tx 1, the TxRequestTracker will never learn about it due to AlreadyHave()==true
- a block is received, the rejection filter is received
- a peer announces Tx 1, the TxRequestTracker learns and requests it
That said, not reevaluating validity of child transactions when the parent one has changed is already an existing issue. Maybe RequestTx comment could be updated to underscore that a received INV isn't unconditionally process by TxRequestTracker but depends of existing transaction relay state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Learning about a transaction identifier doesn't guarantee the node will successfully receive the transaction, no more take special steps to ensure it as sending non-solicited GETDATAs.
I still don't understand what you're trying to say.
The second underscores an important property of new TxRequestTracker design, its state machine always progress towards a terminal state and should never halt.
I just today realized that there is some nuance here. As long as you have at least one non-attacker peer that announces a transaction, I believe it is probabilistically true (over time the probability of not reaching a final state, where the transaction is forgotten about, tends to 0), but not more than that. An attacker can disconnect and reconnect, and with (arbitrary) amounts of luck, an attacker can be chosen every time, making the transaction stay arbitrarily long.
I'm not sure it's worth going that deep in comments.
That said, not reevaluating validity of child transactions when the parent one has changed is already an existing issue. Maybe RequestTx comment could be updated to underscore that a received INV isn't unconditionally process by TxRequestTracker but depends of existing transaction relay state.
I don't think this has anything to do with TxRequestTracker - it just does as it's told, and the current code in net_processing, will not immediately retry for entries in recentRejects. We can document this, but it's a far more high level thing, not specifying the behavior of the TxRequestTracker. Maybe more something for the wiki?
* - ReceivedInv(txid, peer) adds a new CANDIDATE entry, unless one already exists for that (txid, peer) combination | ||
* (whether it's CANDIDATE, REQUESTED, or COMPLETED). | ||
* | ||
* - DeletedPeer(peer) deletes all entries for a given peer. It should be called when a peer goes offline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should underscore state transition to CANDIDATE_BEST for the highest-scored CANDIDATE_READY. If we reintroduce in-flight rate-limiting we should avoid that a malicious A disconnecting can slowdown fetching from B, unique peer also announcing the malevolent set of txn max-in-flight-sized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the description here is about the specification, where the distinction between the various CANDIDATE classes doesn't exist; that's an implementation detail.
There are many caveats to having in-flight limits... I couldn't come up with a way to do it basically, and then realized it's not desirable either (if a peer is overloaded, but still the only one to announce a tx, we still want to fetch it!). There are many ways by which overfull peers could be deprioritized safely though without introducing a hard cap.
|
||
void TxRequestTracker::ReceivedResponse(uint64_t peer, const uint256& txid) | ||
{ | ||
// We need to search the ByPeer index for both (peer, false, txid) and (peer, true, txid). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ByPeer index second member as true signals a state equal to CANDIDATE_BEST but I'm not sure if you can effectively hit here. An entry state is _BEST only between GetRequestable
and either RequestedTx/AlreadyHaveTx
, both of them respectively transitioning to REQUESTED/entry removal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As explained elsewhere, no, BEST exists for much longer than that, and I'm 99% sure this can be hit.
@@ -2599,7 +2482,7 @@ void ProcessMessage( | |||
pfrom.fDisconnect = true; | |||
return; | |||
} else if (!fAlreadyHave && !chainman.ActiveChainstate().IsInitialBlockDownload()) { | |||
RequestTx(State(pfrom.GetId()), inv.hash, current_time); | |||
RequestTx(pfrom, inv.hash, current_time); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If fAlreadyHave==true
call AlreadyHaveTx
to clean transaction entries early ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's necessary - we just have to call ForgetTx
whenever a transaction becomes alreadyhave, and prevent adding announcements for alreadyhave ones. If that's done correctly, an addition ForgetTx
here would always be a no-op.
@ariard Thanks for the thorough review. I'll go over your comments soon, but know there are a few imminent changes first (namely the suggestion #19184 (comment), and adding support for wtxid relay tracking from #18044). If you want to help working on the tests, it's probably best to wait until those are complete. |
Concept ACK, on first read-through the changes look good, will review when this comes out of draft status. |
concept ACK |
fd9a006 Report and verify expirations (Pieter Wuille) 86f50ed Delete limitedmap as it is unused now (Pieter Wuille) cc16fff Make txid delay penalty also apply to fetches of orphan's parents (Pieter Wuille) 173a1d2 Expedite removal of tx requests that are no longer needed (Pieter Wuille) de11b0a Reduce MAX_PEER_TX_ANNOUNCEMENTS for non-PF_RELAY peers (Pieter Wuille) 242d164 Change transaction request logic to use txrequest (Pieter Wuille) 5b03121 Add txrequest fuzz tests (Pieter Wuille) 3c7fe0e Add txrequest unit tests (Pieter Wuille) da3b8fd Add txrequest module (Pieter Wuille) Pull request description: This replaces the transaction request logic with an encapsulated class that maintains all the state surrounding it. By keeping it stand alone, it can be easily tested (using included unit tests and fuzz tests). The major changes are: * Announcements from outbound (and whitelisted) peers are now always preferred over those from inbound peers. This used to be the case for the first request (by delaying the first request from inbound peers), and a bias afters. The 2s delay for requests from inbound peers still exists, but after that, if viable outbound peers remain for any given transaction, they will always be tried first. * No more hard cap of 100 in flight transactions per peer, as there is less need for it (memory usage is linear in the number of announcements, but independent from the number in flight, and CPU usage isn't affected by it). Furthermore, if only one peer announces a transaction, and it has over 100 in flight already, we still want to request it from them. The cap is replaced with a rule that announcements from such overloaded peers get an additional 2s delay (possibly combined with the existing 2s delays for inbound connections, and for txid peers when wtxid peers are available). * The limit of 100000 tracked announcements is reduced to 5000; this was excessive. This can be bypassed using the PF_RELAY permission (to accommodate locally dumping a batch of many transactions). This replaces #19184, rebased on #18044 and with many small changes. ACKs for top commit: ariard: Code Review ACK fd9a006. I've reviewed the new TxRequestTracker, its integration in net_processing, unit/functional/fuzzing test coverage. I looked more for soundness of new specification rather than functional consistency with old transaction request logic. MarcoFalke: Approach ACK fd9a006 🏹 naumenkogs: Code Review ACK fd9a006. I've reviewed everything, mostly to see how this stuff works at the lower level (less documentation-wise, more implementation-wise), and to try breaking it with unexpected sequences of events. jnewbery: utACK fd9a006 jonatack: WIP light ACK fd9a006 have read the code, verified that each commit is hygienic, e.g. debug build clean and tests green, and have been running a node on and off with this branch and grepping the net debug log. Am still unpacking the discussion hidden by GitHub by fetching it via the API and connecting the dots, storing notes and suggestions in a local branch; at this point none are blockers. ryanofsky: Light code review ACK fd9a006, looking at txrequest implementation, unit test implementation, and net_processing integration, just trying to understand how it works and looking for anything potentially confusing in the implementation. Didn't look at functional tests or catch up on review discussion. Just a sanity check review focused on: Tree-SHA512: ea7b52710371498b59d9c9cfb5230dd544fe9c6cb699e69178dea641646104f38a0b5ec7f5f0dbf1eb579b7ec25a31ea420593eff3b7556433daf92d4b0f0dd7
…cific MN p2p logic overhault fd9a006 Report and verify expirations (Pieter Wuille) 86f50ed Delete limitedmap as it is unused now (Pieter Wuille) cc16fff Make txid delay penalty also apply to fetches of orphan's parents (Pieter Wuille) 173a1d2 Expedite removal of tx requests that are no longer needed (Pieter Wuille) de11b0a Reduce MAX_PEER_TX_ANNOUNCEMENTS for non-PF_RELAY peers (Pieter Wuille) 242d164 Change transaction request logic to use txrequest (Pieter Wuille) 5b03121 Add txrequest fuzz tests (Pieter Wuille) 3c7fe0e Add txrequest unit tests (Pieter Wuille) da3b8fd Add txrequest module (Pieter Wuille) Pull request description: This replaces the transaction request logic with an encapsulated class that maintains all the state surrounding it. By keeping it stand alone, it can be easily tested (using included unit tests and fuzz tests). The major changes are: * Announcements from outbound (and whitelisted) peers are now always preferred over those from inbound peers. This used to be the case for the first request (by delaying the first request from inbound peers), and a bias afters. The 2s delay for requests from inbound peers still exists, but after that, if viable outbound peers remain for any given transaction, they will always be tried first. * No more hard cap of 100 in flight transactions per peer, as there is less need for it (memory usage is linear in the number of announcements, but independent from the number in flight, and CPU usage isn't affected by it). Furthermore, if only one peer announces a transaction, and it has over 100 in flight already, we still want to request it from them. The cap is replaced with a rule that announcements from such overloaded peers get an additional 2s delay (possibly combined with the existing 2s delays for inbound connections, and for txid peers when wtxid peers are available). * The limit of 100000 tracked announcements is reduced to 5000; this was excessive. This can be bypassed using the PF_RELAY permission (to accommodate locally dumping a batch of many transactions). This replaces bitcoin#19184, rebased on bitcoin#18044 and with many small changes. ACKs for top commit: ariard: Code Review ACK fd9a006. I've reviewed the new TxRequestTracker, its integration in net_processing, unit/functional/fuzzing test coverage. I looked more for soundness of new specification rather than functional consistency with old transaction request logic. MarcoFalke: Approach ACK fd9a006 🏹 naumenkogs: Code Review ACK fd9a006. I've reviewed everything, mostly to see how this stuff works at the lower level (less documentation-wise, more implementation-wise), and to try breaking it with unexpected sequences of events. jnewbery: utACK fd9a006 jonatack: WIP light ACK fd9a006 have read the code, verified that each commit is hygienic, e.g. debug build clean and tests green, and have been running a node on and off with this branch and grepping the net debug log. Am still unpacking the discussion hidden by GitHub by fetching it via the API and connecting the dots, storing notes and suggestions in a local branch; at this point none are blockers. ryanofsky: Light code review ACK fd9a006, looking at txrequest implementation, unit test implementation, and net_processing integration, just trying to understand how it works and looking for anything potentially confusing in the implementation. Didn't look at functional tests or catch up on review discussion. Just a sanity check review focused on: Tree-SHA512: ea7b52710371498b59d9c9cfb5230dd544fe9c6cb699e69178dea641646104f38a0b5ec7f5f0dbf1eb579b7ec25a31ea420593eff3b7556433daf92d4b0f0dd7
This replaces the transaction request logic with an encapsulated class that maintains all the state surrounding it. By keeping it stand alone, it can be easily tested (using included unit tests and fuzz tests).
The new logic is close in spirit to the current one, but much easier to specify I think:
One significant change is the removal of the in-flight limit of 100 txids. I think this limit is not exactly desirable (e.g. when a txid is only announced by one peer, we should request it, regardless of how many other txids we're already waiting for), and with the changes to handle NOTFOUND immediately, there is less problems with high limits. This implementation's performance is also only determined by the number of total transaction/peer pairs tracked (O(log n)), not by the number in-flight, removing another reason for having that limit.
A few further improvements could be investigated and relatively easily added if desirable:
While the implementation is non-trivial due to performance and resource usage considerations, it has a significantly simpler exact specification. This specification is documented in src/txrequest.h, and reimplemented naively in the src/test/fuzz/txrequest.cpp fuzz test. It may be a good idea to review those first to understand the desired semantics before digging into the optimized implementation.