Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: wait to close connection until incoming socket is drained (with timeout) #6977

Merged
merged 2 commits into from
Feb 12, 2024

Conversation

dfawley
Copy link
Member

@dfawley dfawley commented Feb 9, 2024

Replaces #6957

Fixes #5358 (A modified version of the test in that issue runs for 5+ minutes without failing with these changes. On a broken branch it fails reliably within 30s, and often much faster.)

See also grpc/grpc-java#9566 and golang/net@cd69bc3 / golang/go#18701

This change also adjusts the ping timer after sending the initial GOAWAY from 1 minute to 5 seconds at @ejona86's recommendation.

RELEASE NOTES:

  • server: wait to close connection until incoming socket is drained (with timeout) to prevent data loss on client-side

@dfawley dfawley added this to the 1.62 Release milestone Feb 9, 2024
Copy link

codecov bot commented Feb 9, 2024

Codecov Report

Merging #6977 (ff4b564) into master (f135e98) will increase coverage by 0.07%.
The diff coverage is 100.00%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6977      +/-   ##
==========================================
+ Coverage   82.38%   82.45%   +0.07%     
==========================================
  Files         296      296              
  Lines       31432    31450      +18     
==========================================
+ Hits        25896    25933      +37     
+ Misses       4477     4460      -17     
+ Partials     1059     1057       -2     
Files Coverage Δ
internal/transport/controlbuf.go 87.30% <ø> (-1.72%) ⬇️
internal/transport/http2_client.go 90.64% <100.00%> (-0.51%) ⬇️
internal/transport/http2_server.go 89.98% <100.00%> (+0.41%) ⬆️

... and 14 files with indirect coverage changes

Copy link
Member

@arvindbr8 arvindbr8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! LGTM

@arvindbr8 arvindbr8 assigned dfawley and unassigned arvindbr8 Feb 10, 2024
@dfawley dfawley merged commit 05db80f into grpc:master Feb 12, 2024
14 checks passed
@dfawley dfawley deleted the drain branch February 12, 2024 16:39
dfawley added a commit to dfawley/grpc-go that referenced this pull request Feb 12, 2024
dfawley added a commit that referenced this pull request Feb 12, 2024
@ash2k
Copy link
Contributor

ash2k commented Feb 13, 2024

Can we get a release with this fix please? Keen to try and see if it fixes our troubles.

@dfawley
Copy link
Member Author

dfawley commented Feb 13, 2024

It will be in 1.62.0 next week. I'll see if it's easy to backport to 1.61.x and push a 1.61 patch release if so.

dfawley added a commit to dfawley/grpc-go that referenced this pull request Feb 13, 2024
dfawley added a commit that referenced this pull request Feb 13, 2024
@dfawley
Copy link
Member Author

dfawley commented Feb 14, 2024

I pushed a patch release with the fix: https://github.com/grpc/grpc-go/releases/tag/v1.61.1

@ash2k
Copy link
Contributor

ash2k commented Feb 14, 2024

Thank you!

// https://github.com/grpc/grpc-go/issues/5358
select {
case <-t.readerDone:
case <-time.After(time.Second):
Copy link

@mtekeli mtekeli Feb 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that no one can stop this scheduled timer when the reader is done first (which would probably happen most of the times). I would suggest to change it to a timer object so it can be stopped manually (and therefore released) when the reader is done first.

In a scenario where this happens with many clients (or the same client) repeatedly wouldn't you just schedule many timers that cannot be stopped until they fire (i.e a temporary memory leak)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True that it isn't stopped, but it is fixed for 1 second, so I don't believe this could be a real problem for anyone.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even when many connections having non IO errors? Do you think it's unlikely or even not possible?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the key here is that it's a non-I/O error, which means it would be initiated by the server itself. If the server is terminating already-handshaked connections at a rate of >1 per second, then that seems like a real problem.

That said, I think it's OK to just forbid the use of time.After in our repo (outside of tests) since it's always a potential concern. I'll add a vet.sh check for this and change it on master.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 14, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

gRPC connection closing with unexpected EOF on high-latency connections
5 participants