-
Notifications
You must be signed in to change notification settings - Fork 490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
network: Allow short-lived connections to query /status endpoint when at full capacity #6009
Merged
algorandskiy
merged 8 commits into
algorand:master
from
hsoerensen:allow-healthcheck-pass-when-busy
Jun 11, 2024
Merged
network: Allow short-lived connections to query /status endpoint when at full capacity #6009
algorandskiy
merged 8 commits into
algorand:master
from
hsoerensen:allow-healthcheck-pass-when-busy
Jun 11, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…capacity. File descriptors used for allow these short-lived are coming from the reserved pool of 256 file descriptors.
gmalouf
changed the title
Allow short-lived connections to query /status endpoint when at full capacity
Monitoring: Allow short-lived connections to query /status endpoint when at full capacity
Jun 10, 2024
ohill
reviewed
Jun 10, 2024
Appreciate all your digging into the guts of the network package, and for providing such a detailed explanation of your work! |
ohill
reviewed
Jun 11, 2024
…telemetry for dropped connections
ohill
reviewed
Jun 11, 2024
ohill
approved these changes
Jun 11, 2024
algorandskiy
approved these changes
Jun 11, 2024
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6009 +/- ##
==========================================
- Coverage 55.89% 55.87% -0.02%
==========================================
Files 482 482
Lines 68438 68447 +9
==========================================
- Hits 38251 38248 -3
- Misses 27592 27597 +5
- Partials 2595 2602 +7 ☔ View full report in Codecov by Sentry. |
algorandskiy
changed the title
Monitoring: Allow short-lived connections to query /status endpoint when at full capacity
network: Allow short-lived connections to query /status endpoint when at full capacity
Jun 11, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR allows a small set of connections (10 reserved in the proposal) to be made to the websocket network, even if at full connection capacity. This solves the Operational Excellence risk that is currently in place, as the public
/status
endpoint used for health service checks will fail responding if a server is operating at maximum peer capacity. This subsequently creates a risk that a server may be rebooted / on-call rotation may be receiving a paging alarm; both outcomes which should not happen if it's a matter of a server operating at max peer limit.The 10 reserved connections will be used from the reserved FD pool of 256. In my testing an algod node uses 56 open connections if running on an M2 Mac from within GoLand, meaning there's a lot of headroom. 10 reserved connection limit was chosen as multiple load balancers / status check scripts may be in front of an algod node.
The underlying issue was discussed in #5766, but with no remediation of underlying issue (health service endpoint being unreliable under load), and where proposed solutions with tokens / limits would not work for the newly introduced public health service endpoint.
FAQ
Q: Why not introduce a new IncomingGossipLimit parameter?
A: This could be done, however it would still mean that additional headroom somehow would have to be permitted, as changing semantics of existing configuration may leave node runners behind, as they may not adopt new configuration parameters. I see no great way to change semantics of the current parameter without breaking backwards compatibility.
Q: Don't all servers have plenty of headroom with regards to peer limit?
A: This may or may not be the case depending on the network that the relays are operating on.
Q: Could this not be done on a different service endpoint?
A: To fit into where the health service endpoint is running, and which is part of the gossip service, we need to allow connections to that specific endpoint.
Q: Are there other places where the existing limit is enforced?
A: Yes. Existing limit is enforced by the
checkIncomingConnectionLimits
function inwsNetwork.go
, which happens later in the code path after the rejectingLimitListener has been executed.Q: Will this cause an issue with file descriptors?
A: It should not. There is a pool of 256 FD's reserved for short-lived operations, and where up to 10 additional FD may be used in a short-lived fashion.
Q: Why not make the change in rejectingLimitListener.go?
A: This runs before we know the request route, and as such would require parsing HTTP payloads before middleware can access it.
Q: Will this introduce breaking changes as some requests may be accepted and not served?
A: This will change the response mechanism for up to 10 requests that may be accepted before being dropped.
requestTracker
do already limit and respond with HTTP failure codes (4xx / 5xx) if a client is being throttled. Responding with a 503 does not change client behavior in my testing.Q: Why make a change in
requestTracker.go
?A: The exception path in requestTracker.go is to enforce limits for other services. The gossip protocol currently checks connection limits, however other services would be able to service requests without the check in requst tracker.
Q: Why not use HealthServiceStatusPath?
A: This leads to a circular dependency. Would love to extract paths across all services if so makes sense and put into a route path configuration. For now opted for duplication to reduce amount of changes, as this is the first change where routes are checked directly in the requestTracker. If additional route checks are needed I think refactoring at that point in time would be advisable. Current PR operates with the 2 > 0 principle in mind - it's better to have an improvement, even if duplicated, vs no improvement.
Test Plan
Setup a relay server by configuration NetAddress, and set
"IncomingConnectionsLimit": 0
.Make requests to
Set
"IncomingConnectionsLimit": 1
.Make requests to