-
Notifications
You must be signed in to change notification settings - Fork 792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net/libp2p: Enforce outbound request-response timeout limits #7222
Conversation
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looks good to me, one thing I would do is burn it in on the kusama validator over the weekend, just to see if this triggers in real work conditions, our assumption it doesn't but we might be surprised.
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks alright to me.
@@ -1086,10 +1164,10 @@ mod tests { | |||
}; | |||
use std::{iter, time::Duration}; | |||
|
|||
struct TokioExecutor(tokio::runtime::Runtime); | |||
struct TokioExecutor; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we need this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a tiny change to not pass the runtime when building the swarm and let tokio::spawn
below do the job
failure Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
…nv/enforce-timeouts
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would change the collect & remove code path to a simple retain.
.iter() | ||
.filter_map(|(id, req)| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not retain? Then we don't need collect & remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a tiny borrow-checker issue coming from the sender consuming self: retain will give us &mut req
and the req.response
oneshot::Sender consumes self
when we send out the send(Err(RequestFailure::Network(OutboundFailure::Timeout)))
.
We could still avoid the extra alloc by adding an Option over the sender, will have a look if it complicates the code, thanks
.collect(); | ||
|
||
if !timedout_requests.is_empty() { | ||
log::warn!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this could only be triggered by a bug in libp2p/litep2p?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep this could be triggered only by a bug in libp2p (libp2p/rust-libp2p#5429). I expected that the latest libp2p version would solve this issue, however, I could reproduce it on our libp2p kusama validator (and locally):
2025-01-21 05:43:14.885 WARN tokio-runtime-worker sub-libp2p: Requests [(ProtocolRequestId { protocol: OnHeap("/b0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe/sync/2"), request_id: OutboundRequestId(53207) }, Some(20.479579184s))] force detected as timeout.
2025-01-21 05:43:18.178 WARN tokio-runtime-worker sub-libp2p: Received `RequestResponseEvent::OutboundFailure` with unexpected request id OutboundRequestId(53207) error Timeout from PeerId("12D3KooWJZuPc7KPtJcbYurX7brn3booPwbK53ucJhhPmuc1zZPt")
In this case libp2p produces a timeout event ~4s after we have detected the request as timed out. I'll also have a look at our poll implementation, maybe there's something there
Co-authored-by: Bastian Köcher <git@kchr.de>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
timeouts Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
All GitHub workflows were cancelled due to failure one of the required jobs. |
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Created backport PR for
Please cherry-pick the changes locally and resolve any conflicts. git fetch origin backport-7222-to-stable2407
git worktree add --checkout .worktree/backport-7222-to-stable2407 backport-7222-to-stable2407
cd .worktree/backport-7222-to-stable2407
git reset --hard HEAD^
git cherry-pick -x fd64a1e7768ba6e8676cbbf25c4e821a901c0a7f
git push --force-with-lease |
Created backport PR for
Please cherry-pick the changes locally and resolve any conflicts. git fetch origin backport-7222-to-stable2409
git worktree add --checkout .worktree/backport-7222-to-stable2409 backport-7222-to-stable2409
cd .worktree/backport-7222-to-stable2409
git reset --hard HEAD^
git cherry-pick -x fd64a1e7768ba6e8676cbbf25c4e821a901c0a7f
git push --force-with-lease |
Created backport PR for
Please cherry-pick the changes locally and resolve any conflicts. git fetch origin backport-7222-to-stable2412
git worktree add --checkout .worktree/backport-7222-to-stable2412 backport-7222-to-stable2412
cd .worktree/backport-7222-to-stable2412
git reset --hard HEAD^
git cherry-pick -x fd64a1e7768ba6e8676cbbf25c4e821a901c0a7f
git push --force-with-lease |
…imeout limits (#7302) Backport #7222 into `stable2412` from lexnv. See the [documentation](https://github.com/paritytech/polkadot-sdk/blob/master/docs/BACKPORT.md) on how to use this bot. <!-- # To be used by other automation, do not modify: original-pr-number: #${pull_number} --> --------- Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io> Co-authored-by: Alexandru Vasile <60601340+lexnv@users.noreply.github.com> Co-authored-by: Alexandru Vasile <alexandru.vasile@parity.io>
This PR enforces that outbound requests are finished within the specified protocol timeout.
The stable2412 version running libp2p 0.52.4 contains a bug which does not track request timeouts properly:
The issue has been detected while submitting libp2p -> litep2p requests in kusama. This aims to check that pending outbound requests have not timedout. Although the issue has been fixed in libp2p, there might be other cases where this may happen. For example:
For more context see: #7076 (comment)
RequestFinished { error: .. }
. That event is only used internally by substrate to increment metrics. However, we don't have the peer information available to propagate the event properly when we force-timeout the request. Considering this should most likely not happen in production (origin/master) and that we'll be able to extract information by warnings, I would say this is a good tradeoff for code simplicity:polkadot-sdk/substrate/client/network/src/service.rs
Line 1543 in 06e3b5c
Testing
Added a new test to ensure the timeout is reached properly, even if libp2p does not produce a response in due time.
I've also transitioned the tests to using
tokio::test
due to a limitation of CIcc @paritytech/networking