Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

network: Allow short-lived connections to query /status endpoint when at full capacity #6009

Merged

Conversation

hsoerensen
Copy link
Contributor

@hsoerensen hsoerensen commented May 31, 2024

Summary

This PR allows a small set of connections (10 reserved in the proposal) to be made to the websocket network, even if at full connection capacity. This solves the Operational Excellence risk that is currently in place, as the public /statusendpoint used for health service checks will fail responding if a server is operating at maximum peer capacity. This subsequently creates a risk that a server may be rebooted / on-call rotation may be receiving a paging alarm; both outcomes which should not happen if it's a matter of a server operating at max peer limit.

The 10 reserved connections will be used from the reserved FD pool of 256. In my testing an algod node uses 56 open connections if running on an M2 Mac from within GoLand, meaning there's a lot of headroom. 10 reserved connection limit was chosen as multiple load balancers / status check scripts may be in front of an algod node.

The underlying issue was discussed in #5766, but with no remediation of underlying issue (health service endpoint being unreliable under load), and where proposed solutions with tokens / limits would not work for the newly introduced public health service endpoint.

FAQ

Q: Why not introduce a new IncomingGossipLimit parameter?
A: This could be done, however it would still mean that additional headroom somehow would have to be permitted, as changing semantics of existing configuration may leave node runners behind, as they may not adopt new configuration parameters. I see no great way to change semantics of the current parameter without breaking backwards compatibility.

Q: Don't all servers have plenty of headroom with regards to peer limit?
A: This may or may not be the case depending on the network that the relays are operating on.

Q: Could this not be done on a different service endpoint?
A: To fit into where the health service endpoint is running, and which is part of the gossip service, we need to allow connections to that specific endpoint.

Q: Are there other places where the existing limit is enforced?
A: Yes. Existing limit is enforced by the checkIncomingConnectionLimits function in wsNetwork.go, which happens later in the code path after the rejectingLimitListener has been executed.

Q: Will this cause an issue with file descriptors?
A: It should not. There is a pool of 256 FD's reserved for short-lived operations, and where up to 10 additional FD may be used in a short-lived fashion.

Q: Why not make the change in rejectingLimitListener.go?
A: This runs before we know the request route, and as such would require parsing HTTP payloads before middleware can access it.

Q: Will this introduce breaking changes as some requests may be accepted and not served?
A: This will change the response mechanism for up to 10 requests that may be accepted before being dropped. requestTracker do already limit and respond with HTTP failure codes (4xx / 5xx) if a client is being throttled. Responding with a 503 does not change client behavior in my testing.

Q: Why make a change in requestTracker.go?
A: The exception path in requestTracker.go is to enforce limits for other services. The gossip protocol currently checks connection limits, however other services would be able to service requests without the check in requst tracker.

Q: Why not use HealthServiceStatusPath?
A: This leads to a circular dependency. Would love to extract paths across all services if so makes sense and put into a route path configuration. For now opted for duplication to reduce amount of changes, as this is the first change where routes are checked directly in the requestTracker. If additional route checks are needed I think refactoring at that point in time would be advisable. Current PR operates with the 2 > 0 principle in mind - it's better to have an improvement, even if duplicated, vs no improvement.

Test Plan

Setup a relay server by configuration NetAddress, and set "IncomingConnectionsLimit": 0.

Make requests to

  • /v1/{genesis}/gossip. Expected fail with HTTP response code 503
  • /v1/{genesis}/block/1. Expected fail with HTTP response code 503
  • /status. Expected success with HTTP response code 200

Set "IncomingConnectionsLimit": 1.
Make requests to

  • /v1/{genesis}/gossip. Expected response code 412 as pre-conditions are not fulfilled.
  • /v1/{genesis}/block/1. Expected response code 404 as block does not exist
  • /status. Expected success with HTTP response code 200

…capacity.

File descriptors used for allow these short-lived are coming from the reserved pool of 256 file descriptors.
@cce cce requested a review from ohill June 10, 2024 17:02
@gmalouf gmalouf changed the title Allow short-lived connections to query /status endpoint when at full capacity Monitoring: Allow short-lived connections to query /status endpoint when at full capacity Jun 10, 2024
network/requestTracker.go Outdated Show resolved Hide resolved
@ohill
Copy link
Contributor

ohill commented Jun 10, 2024

Appreciate all your digging into the guts of the network package, and for providing such a detailed explanation of your work!

network/requestTracker.go Outdated Show resolved Hide resolved
network/requestTracker.go Outdated Show resolved Hide resolved
Copy link

codecov bot commented Jun 11, 2024

Codecov Report

Attention: Patch coverage is 25.00000% with 9 lines in your changes missing coverage. Please review.

Project coverage is 55.87%. Comparing base (88b0ca5) to head (6e40bc3).

Files Patch % Lines
network/requestTracker.go 11.11% 7 Missing and 1 partial ⚠️
daemon/algod/server.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6009      +/-   ##
==========================================
- Coverage   55.89%   55.87%   -0.02%     
==========================================
  Files         482      482              
  Lines       68438    68447       +9     
==========================================
- Hits        38251    38248       -3     
- Misses      27592    27597       +5     
- Partials     2595     2602       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@algorandskiy algorandskiy changed the title Monitoring: Allow short-lived connections to query /status endpoint when at full capacity network: Allow short-lived connections to query /status endpoint when at full capacity Jun 11, 2024
@algorandskiy algorandskiy merged commit 985512b into algorand:master Jun 11, 2024
19 checks passed
@hsoerensen hsoerensen deleted the allow-healthcheck-pass-when-busy branch June 11, 2024 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants