Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/libp2p: Enforce outbound request-response timeout limits #7222

Merged
merged 24 commits into from
Jan 22, 2025

Conversation

lexnv
Copy link
Contributor

@lexnv lexnv commented Jan 17, 2025

This PR enforces that outbound requests are finished within the specified protocol timeout.

The stable2412 version running libp2p 0.52.4 contains a bug which does not track request timeouts properly:

The issue has been detected while submitting libp2p -> litep2p requests in kusama. This aims to check that pending outbound requests have not timedout. Although the issue has been fixed in libp2p, there might be other cases where this may happen. For example:

For more context see: #7076 (comment)

  1. Ideally, the force-timeout mechanism in this PR should never be triggered in production. However, origin/stable2412 occasionally encounters this issue. When this happens, 2 warnings may be generated:
  • one warning introduced by this PR wrt force timeout terminating the request
  • possible one warning when the libp2p decides (if at all) to provide the response back to substrate (as mentioned by @alexggh here and here
  1. This implementation does not propagate to the substrate service the RequestFinished { error: .. }. That event is only used internally by substrate to increment metrics. However, we don't have the peer information available to propagate the event properly when we force-timeout the request. Considering this should most likely not happen in production (origin/master) and that we'll be able to extract information by warnings, I would say this is a good tradeoff for code simplicity:

RequestFailure::Network(OutboundFailure::Timeout) => "timeout",

Testing

Added a new test to ensure the timeout is reached properly, even if libp2p does not produce a response in due time.

I've also transitioned the tests to using tokio::test due to a limitation of CI

--- TRY 1 STDERR:        sc-network request_responses::tests::max_response_size_exceeded ---
thread 'request_responses::tests::max_response_size_exceeded' panicked at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/time/interval.rs:139:26:
there is no reactor running, must be called from the context of a Tokio 1.x runtime

cc @paritytech/networking

lexnv added 2 commits January 17, 2025 11:57
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@lexnv lexnv added T0-node This PR/Issue is related to the topic “node”. I2-bug The node fails to follow expected behavior. labels Jan 17, 2025
@lexnv lexnv self-assigned this Jan 17, 2025
@lexnv lexnv changed the title net/libp2p: Enfore outbound request-response timeout limits net/libp2p: Enforce outbound request-response timeout limits Jan 17, 2025
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@lexnv lexnv added the A4-needs-backport Pull request must be backported to all maintained releases. label Jan 17, 2025
Copy link
Contributor

@alexggh alexggh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good to me, one thing I would do is burn it in on the kusama validator over the weekend, just to see if this triggers in real work conditions, our assumption it doesn't but we might be surprised.

lexnv added 3 commits January 17, 2025 17:14
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
lexnv and others added 8 commits January 20, 2025 12:40
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Copy link
Contributor

@alexggh alexggh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks alright to me.

@@ -1086,10 +1164,10 @@ mod tests {
};
use std::{iter, time::Duration};

struct TokioExecutor(tokio::runtime::Runtime);
struct TokioExecutor;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we need this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a tiny change to not pass the runtime when building the swarm and let tokio::spawn below do the job

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Copy link
Member

@bkchr bkchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change the collect & remove code path to a simple retain.

substrate/client/network/src/request_responses.rs Outdated Show resolved Hide resolved
Comment on lines 627 to 628
.iter()
.filter_map(|(id, req)| {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not retain? Then we don't need collect & remove.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a tiny borrow-checker issue coming from the sender consuming self: retain will give us &mut req and the req.response oneshot::Sender consumes self when we send out the send(Err(RequestFailure::Network(OutboundFailure::Timeout))).

We could still avoid the extra alloc by adding an Option over the sender, will have a look if it complicates the code, thanks

.collect();

if !timedout_requests.is_empty() {
log::warn!(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this could only be triggered by a bug in libp2p/litep2p?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep this could be triggered only by a bug in libp2p (libp2p/rust-libp2p#5429). I expected that the latest libp2p version would solve this issue, however, I could reproduce it on our libp2p kusama validator (and locally):

2025-01-21 05:43:14.885  WARN tokio-runtime-worker sub-libp2p: Requests [(ProtocolRequestId { protocol: OnHeap("/b0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe/sync/2"), request_id: OutboundRequestId(53207) }, Some(20.479579184s))] force detected as timeout.

2025-01-21 05:43:18.178  WARN tokio-runtime-worker sub-libp2p: Received `RequestResponseEvent::OutboundFailure` with unexpected request id OutboundRequestId(53207) error Timeout from PeerId("12D3KooWJZuPc7KPtJcbYurX7brn3booPwbK53ucJhhPmuc1zZPt")

In this case libp2p produces a timeout event ~4s after we have detected the request as timed out. I'll also have a look at our poll implementation, maybe there's something there

lexnv and others added 2 commits January 21, 2025 11:49
Co-authored-by: Bastian Köcher <git@kchr.de>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
timeouts

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@paritytech-workflow-stopper
Copy link

All GitHub workflows were cancelled due to failure one of the required jobs.
Failed workflow url: https://github.com/paritytech/polkadot-sdk/actions/runs/12907455229
Failed job name: cargo-clippy

lexnv added 2 commits January 22, 2025 14:22
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@lexnv lexnv enabled auto-merge January 22, 2025 15:39
@lexnv lexnv added this pull request to the merge queue Jan 22, 2025
Merged via the queue into master with commit fd64a1e Jan 22, 2025
203 of 204 checks passed
@lexnv lexnv deleted the lexnv/enforce-timeouts branch January 22, 2025 17:26
@paritytech-cmd-bot-polkadot-sdk

Created backport PR for stable2407:

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin backport-7222-to-stable2407
git worktree add --checkout .worktree/backport-7222-to-stable2407 backport-7222-to-stable2407
cd .worktree/backport-7222-to-stable2407
git reset --hard HEAD^
git cherry-pick -x fd64a1e7768ba6e8676cbbf25c4e821a901c0a7f
git push --force-with-lease

@paritytech-cmd-bot-polkadot-sdk

Created backport PR for stable2409:

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin backport-7222-to-stable2409
git worktree add --checkout .worktree/backport-7222-to-stable2409 backport-7222-to-stable2409
cd .worktree/backport-7222-to-stable2409
git reset --hard HEAD^
git cherry-pick -x fd64a1e7768ba6e8676cbbf25c4e821a901c0a7f
git push --force-with-lease

@paritytech-cmd-bot-polkadot-sdk

Created backport PR for stable2412:

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin backport-7222-to-stable2412
git worktree add --checkout .worktree/backport-7222-to-stable2412 backport-7222-to-stable2412
cd .worktree/backport-7222-to-stable2412
git reset --hard HEAD^
git cherry-pick -x fd64a1e7768ba6e8676cbbf25c4e821a901c0a7f
git push --force-with-lease

EgorPopelyaev pushed a commit that referenced this pull request Jan 23, 2025
…imeout limits (#7302)

Backport #7222 into `stable2412` from lexnv.

See the
[documentation](https://github.com/paritytech/polkadot-sdk/blob/master/docs/BACKPORT.md)
on how to use this bot.

<!--
  # To be used by other automation, do not modify:
  original-pr-number: #${pull_number}
-->

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Co-authored-by: Alexandru Vasile <60601340+lexnv@users.noreply.github.com>
Co-authored-by: Alexandru Vasile <alexandru.vasile@parity.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A4-needs-backport Pull request must be backported to all maintained releases. I2-bug The node fails to follow expected behavior. T0-node This PR/Issue is related to the topic “node”.
Projects
Status: Blocked ⛔️
Development

Successfully merging this pull request may close these issues.

4 participants