Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpc: disable and document overrides of OS default TCP keepalive by Go #6672

Merged
merged 7 commits into from
Nov 7, 2023
2 changes: 1 addition & 1 deletion benchmark/benchmain/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -373,7 +373,7 @@ func makeClients(bf stats.Features) ([]testgrpc.BenchmarkServiceClient, func())
logger.Fatalf("Failed to listen: %v", err)
}
opts = append(opts, grpc.WithContextDialer(func(ctx context.Context, address string) (net.Conn, error) {
return nw.ContextDialer((&net.Dialer{}).DialContext)(ctx, "tcp", lis.Addr().String())
return nw.ContextDialer((&net.Dialer{KeepAlive: time.Duration(-1)}).DialContext)(ctx, "tcp", lis.Addr().String())
}))
}
lis = nw.Listener(lis)
Expand Down
9 changes: 9 additions & 0 deletions dialoptions.go
Original file line number Diff line number Diff line change
Expand Up @@ -413,6 +413,15 @@ func WithTimeout(d time.Duration) DialOption {
// connections. If FailOnNonTempDialError() is set to true, and an error is
// returned by f, gRPC checks the error's Temporary() method to decide if it
// should try to reconnect to the network address.
//
// Note: As of Go 1.21, the standard library overrides the OS defaults for
// TCP keepalive time and interval to 15s.
// To retain OS defaults, use a net.Dialer with the KeepAlive field set to a
// negative value.
//
// For more information, please see [issue 23459] in the Go github repo.
//
// [issue 23459]: https://github.com/golang/go/issues/23459
arvindbr8 marked this conversation as resolved.
Show resolved Hide resolved
func WithContextDialer(f func(context.Context, string) (net.Conn, error)) DialOption {
return newFuncDialOption(func(o *dialOptions) {
o.copts.Dialer = f
Expand Down
4 changes: 3 additions & 1 deletion internal/transport/http2_client.go
Original file line number Diff line number Diff line change
Expand Up @@ -176,7 +176,9 @@ func dial(ctx context.Context, fn func(context.Context, string) (net.Conn, error
if networkType == "tcp" && useProxy {
return proxyDial(ctx, address, grpcUA)
}
return (&net.Dialer{}).DialContext(ctx, networkType, address)
// KeepAlive is set to a negative value to prevent Go's override of the TCP
// keepalive time and interval; retain the OS default.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this comment is technically correct, I'm not sure this is what you intended and I think the comment could be more clear: the default on linux is to disable keepalives unless opted in via SO_KEEPALIVE, which then uses the os-level keepalive configuration.

At least this is different from Java and what Eric proposed in #6250 (comment), which would be to set SO_KEEPALIVE to true unconditionally. This could be done by adding something like:

		Control: func(network, address string, c syscall.RawConn) error {
			return c.Control(func(fd uintptr) {
				syscall.SetsockoptInt(int(fd), syscall.SOL_SOCKET, syscall.SO_KEEPALIVE, 1)
			})
		},

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default on linux is to disable keepalives unless opted in via SO_KEEPALIVE, which then uses the os-level keepalive configuration.

Hmm, I believe you are right. We were under the impression that keepalives were always enabled and that a negative value would not disable them, but would use the OS defaults for the timers. However, this doesn't call setKeepAlive when the value is negative:

https://go.dev/src/net/tcpsock.go#L238

And they set SO_KEEPALIVE explicitly for posix systems in setKeepAlive:

https://go.dev/src/net/sockopt_posix.go#L116

So presumably not doing that means keepalives will be disabled otherwise.

Copy link
Collaborator

@atollena atollena Nov 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of unconditionally enabling keepalives via Dialer control then?

(3) is something that Java's been doing forever, and still do independent of (1). Even with (1), (3) is still useful as some people configure their OS to use more aggressive settings for their specific environment (e.g., AWS). For those environments (3) behaves well without any extra work from users. And if you aren't in such an environment, then (3) is very low or zero cost.

I think we should try to do that before the release.

edit: nevermind, just saw your comment on #6250 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of unconditionally enabling keepalives via Dialer control then?

A little bit more thought is required I guess. We do set the Dialer.Control to a different function when running inside of the Google production network to set certain socket options that are appropriate for that environment. Unconditionally enabling TCP keepalives using the Dialer.Control here could still work (if we enhance the above function to also do the same). Thoughts @dfawley ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it appears that the proposal mentioned here has made significant progress and has a pending change. This would allow us to disable TCP keepalive or use OS defaults with a simple API. We would have to wait for a new Go version though, to be able to use this and would be a while until that Go version becomes the least supported Go version for grpc.

return (&net.Dialer{KeepAlive: time.Duration(-1)}).DialContext(ctx, networkType, address)
JaydenTeoh marked this conversation as resolved.
Show resolved Hide resolved
JaydenTeoh marked this conversation as resolved.
Show resolved Hide resolved
}

func isTemporary(err error) bool {
Expand Down
3 changes: 2 additions & 1 deletion internal/transport/proxy.go
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ import (
"net/http"
"net/http/httputil"
"net/url"
"time"
)

const proxyAuthHeaderKey = "Proxy-Authorization"
Expand Down Expand Up @@ -122,7 +123,7 @@ func proxyDial(ctx context.Context, addr string, grpcUA string) (conn net.Conn,
newAddr = proxyURL.Host
}

conn, err = (&net.Dialer{}).DialContext(ctx, "tcp", newAddr)
conn, err = (&net.Dialer{KeepAlive: time.Duration(-1)}).DialContext(ctx, "tcp", newAddr)
if err != nil {
return
}
Expand Down
9 changes: 9 additions & 0 deletions server.go
Original file line number Diff line number Diff line change
Expand Up @@ -813,6 +813,15 @@ func (l *listenSocket) Close() error {
// Serve returns when lis.Accept fails with fatal errors. lis will be closed when
// this method returns.
// Serve will return a non-nil error unless Stop or GracefulStop is called.
//
// Note: As of Go 1.21, the standard library overrides the OS defaults for
// TCP keepalive time and interval to 15s.
// To retain OS defaults, pass a net.Listener created by calling the Listen method
// on a net.ListenConfig with the `KeepAlive` field set to a negative value.
//
// For more information, please see [issue 23459] in the Go github repo.
//
// [issue 23459]: https://github.com/golang/go/issues/23459
arvindbr8 marked this conversation as resolved.
Show resolved Hide resolved
func (s *Server) Serve(lis net.Listener) error {
s.mu.Lock()
s.printf("serving")
Expand Down
Loading