network: Allow short-lived connections to query /status endpoint when at full capacity #6009

hsoerensen · 2024-05-31T19:27:17Z

Summary

This PR allows a small set of connections (10 reserved in the proposal) to be made to the websocket network, even if at full connection capacity. This solves the Operational Excellence risk that is currently in place, as the public /statusendpoint used for health service checks will fail responding if a server is operating at maximum peer capacity. This subsequently creates a risk that a server may be rebooted / on-call rotation may be receiving a paging alarm; both outcomes which should not happen if it's a matter of a server operating at max peer limit.

The 10 reserved connections will be used from the reserved FD pool of 256. In my testing an algod node uses 56 open connections if running on an M2 Mac from within GoLand, meaning there's a lot of headroom. 10 reserved connection limit was chosen as multiple load balancers / status check scripts may be in front of an algod node.

The underlying issue was discussed in #5766, but with no remediation of underlying issue (health service endpoint being unreliable under load), and where proposed solutions with tokens / limits would not work for the newly introduced public health service endpoint.

FAQ

Q: Why not introduce a new IncomingGossipLimit parameter?
A: This could be done, however it would still mean that additional headroom somehow would have to be permitted, as changing semantics of existing configuration may leave node runners behind, as they may not adopt new configuration parameters. I see no great way to change semantics of the current parameter without breaking backwards compatibility.

Q: Don't all servers have plenty of headroom with regards to peer limit?
A: This may or may not be the case depending on the network that the relays are operating on.

Q: Could this not be done on a different service endpoint?
A: To fit into where the health service endpoint is running, and which is part of the gossip service, we need to allow connections to that specific endpoint.

Q: Are there other places where the existing limit is enforced?
A: Yes. Existing limit is enforced by the checkIncomingConnectionLimits function in wsNetwork.go, which happens later in the code path after the rejectingLimitListener has been executed.

Q: Will this cause an issue with file descriptors?
A: It should not. There is a pool of 256 FD's reserved for short-lived operations, and where up to 10 additional FD may be used in a short-lived fashion.

Q: Why not make the change in rejectingLimitListener.go?
A: This runs before we know the request route, and as such would require parsing HTTP payloads before middleware can access it.

Q: Will this introduce breaking changes as some requests may be accepted and not served?
A: This will change the response mechanism for up to 10 requests that may be accepted before being dropped. requestTracker do already limit and respond with HTTP failure codes (4xx / 5xx) if a client is being throttled. Responding with a 503 does not change client behavior in my testing.

Q: Why make a change in requestTracker.go?
A: The exception path in requestTracker.go is to enforce limits for other services. The gossip protocol currently checks connection limits, however other services would be able to service requests without the check in requst tracker.

Q: Why not use HealthServiceStatusPath?
A: This leads to a circular dependency. Would love to extract paths across all services if so makes sense and put into a route path configuration. For now opted for duplication to reduce amount of changes, as this is the first change where routes are checked directly in the requestTracker. If additional route checks are needed I think refactoring at that point in time would be advisable. Current PR operates with the 2 > 0 principle in mind - it's better to have an improvement, even if duplicated, vs no improvement.

Test Plan

Setup a relay server by configuration NetAddress, and set "IncomingConnectionsLimit": 0.

Make requests to

/v1/{genesis}/gossip. Expected fail with HTTP response code 503
/v1/{genesis}/block/1. Expected fail with HTTP response code 503
/status. Expected success with HTTP response code 200

Set "IncomingConnectionsLimit": 1.
Make requests to

/v1/{genesis}/gossip. Expected response code 412 as pre-conditions are not fulfilled.
/v1/{genesis}/block/1. Expected response code 404 as block does not exist
/status. Expected success with HTTP response code 200

…capacity. File descriptors used for allow these short-lived are coming from the reserved pool of 256 file descriptors.

network/requestTracker.go

ohill · 2024-06-10T20:11:21Z

Appreciate all your digging into the guts of the network package, and for providing such a detailed explanation of your work!

network/requestTracker.go

…telemetry for dropped connections

network/requestTracker.go

…-pass-when-busy

codecov · 2024-06-11T17:32:07Z

Codecov Report

Attention: Patch coverage is 25.00000% with 9 lines in your changes missing coverage. Please review.

Project coverage is 55.87%. Comparing base (88b0ca5) to head (6e40bc3).

Files	Patch %	Lines
network/requestTracker.go	11.11%	7 Missing and 1 partial ⚠️
daemon/algod/server.go	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6009      +/-   ##
==========================================
- Coverage   55.89%   55.87%   -0.02%     
==========================================
  Files         482      482              
  Lines       68438    68447       +9     
==========================================
- Hits        38251    38248       -3     
- Misses      27592    27597       +5     
- Partials     2595     2602       +7

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Allow short-lived connections to query /status endpoint when at full …

f205b31

…capacity. File descriptors used for allow these short-lived are coming from the reserved pool of 256 file descriptors.

gmalouf assigned ohill Jun 5, 2024

cce added the Enhancement label Jun 10, 2024

cce requested a review from ohill June 10, 2024 17:02

gmalouf changed the title ~~Allow short-lived connections to query /status endpoint when at full capacity~~ Monitoring: Allow short-lived connections to query /status endpoint when at full capacity Jun 10, 2024

ohill reviewed Jun 10, 2024

View reviewed changes

network/requestTracker.go Outdated Show resolved Hide resolved

move HealthServiceStatusPath to network package

575c808

ohill added 2 commits June 11, 2024 09:45

fix test compilation

7189771

fix linter

d0344d1

ohill reviewed Jun 11, 2024

View reviewed changes

network/requestTracker.go Outdated Show resolved Hide resolved

Remove healthService check from per-host rate limiting & add metrics/…

070b2fc

…telemetry for dropped connections

ohill reviewed Jun 11, 2024

View reviewed changes

network/requestTracker.go Outdated Show resolved Hide resolved

ohill added 2 commits June 11, 2024 12:50

fix race in tests

1850468

simplify

cd6e676

ohill approved these changes Jun 11, 2024

View reviewed changes

algorandskiy approved these changes Jun 11, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/master' into allow-healthcheck…

6e40bc3

…-pass-when-busy

algorandskiy changed the title ~~Monitoring: Allow short-lived connections to query /status endpoint when at full capacity~~ network: Allow short-lived connections to query /status endpoint when at full capacity Jun 11, 2024

algorandskiy merged commit 985512b into algorand:master Jun 11, 2024
19 checks passed

gmalouf mentioned this pull request Jun 11, 2024

IncomingConnectionsLimit limits short lived connections #5766

Closed

hsoerensen deleted the allow-healthcheck-pass-when-busy branch June 11, 2024 22:15

ohill added the external contribution label Jun 13, 2024

Algo-devops-service mentioned this pull request Sep 17, 2024

go-algorand 3.26.0-beta Release PR #6133

Merged

Algo-devops-service mentioned this pull request Sep 26, 2024

go-algorand 3.26.0-stable Release PR #6145

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

network: Allow short-lived connections to query /status endpoint when at full capacity #6009

network: Allow short-lived connections to query /status endpoint when at full capacity #6009

hsoerensen commented May 31, 2024 •

edited

Loading

ohill commented Jun 10, 2024

codecov bot commented Jun 11, 2024 •

edited

Loading

network: Allow short-lived connections to query /status endpoint when at full capacity #6009

network: Allow short-lived connections to query /status endpoint when at full capacity #6009

Conversation

hsoerensen commented May 31, 2024 • edited Loading

Summary

FAQ

Test Plan

ohill commented Jun 10, 2024

codecov bot commented Jun 11, 2024 • edited Loading

Codecov Report

hsoerensen commented May 31, 2024 •

edited

Loading

codecov bot commented Jun 11, 2024 •

edited

Loading