-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
grpclb: teach the manual resolver to handle restarts #6635
Conversation
balancer/grpclb/grpclb_util.go
Outdated
@@ -62,12 +62,25 @@ import ( | |||
type lbManualResolver struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh... this has its own manual resolver too? Can we reuse the one in resolver/manual
?
If not, we probably should fix that to be tolerant to idleness by setting bootstrapState
here:
grpc-go/resolver/manual/manual.go
Line 102 in 62726d4
err := r.CC.UpdateState(s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can reuse the manual
resolver here with some changes. I will send a separate PR for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to use the manual
resolver, but we still need this lbManualResolver
type since it provides some level of wrapping of the one particular piece of functionality that is provided here. Otherwise, we will have to move all this functionality to the top-level lbBalancer
type and also store the resolver.ClientConn
in there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it provides some level of wrapping of the one particular piece of functionality that is provided here
...which is what, specifically? Calling ResolveNow
on the balancer.ClientConn
when the ResolveNowCallback
is invoked? Does that necessitate a wrapping type though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. We don't need this type at all. Got it to work by just using the manual.Resolver
.
balancer/grpclb/grpclb_util.go
Outdated
r.lastState = s | ||
r.stateUpdateRecv = true | ||
r.stateMu.Unlock() | ||
|
||
r.ccr.UpdateState(s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... so what if the channel this is attached to went idle and then UpdateState
is called during that time? This will just be ignored? Maybe that's fine? But there's definitely a race here with leaving idleness now, with Build
being called again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ccr
here is the resolver.ClientConn
passed to the manual resolver, and is the one implemented by the resolver_conn_wrapper
on the grpclb
channel. The parent channel going IDLE should not affect any of this interaction. If the grpclb
channel went IDLE, we shouldn't have this resolver active.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the grpclb channel went IDLE, we shouldn't have this resolver active.
I'm not sure what this means. If the grpclb channel is idle (due to a bug, since we always have an active stream) then we could still try to update it while it's exiting idle, causing a race between calling the old CC's UpdateState
and the new CC being set in ccr
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have this type anymore. But thinking about the case you are mentioning, let's say the following happens:
grpclb
channel isIDLE
because of a bug.- the parent
grpclb
LB policy receives aClientConnUpdate
and it has a new set of balancer addresses, and it callsUpdateState
on the manual resolver - At the same time, the
grpclb
channel is exiting IDLE.- This will cause a new
manual.Resolver
to be built and will update theresolver.ClientConn
inside it.
- This will cause a new
- If the call to
UpdateState
happens first, then forwarding of that update to the oldresolver.ClientConn
within themanual.Resolver
will have no effect. But it will still update thelastSeenState
field. So, when thegrpclb
channel eventually exits IDLE and a newmanual.Resolver
is built, it will still send this most recentlastSeenState
to the channel. - If the
grpclb
channel exits IDLE first, we will see two updates pushed on to the channel. One with the previouslastSeenState
and one with the new state pushed by thegrpclb
LB policy.
Eventually, both are consistent. Am I still missing something?
5d8f0ae
to
9b88f99
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL
balancer/grpclb/grpclb_util.go
Outdated
@@ -62,12 +62,25 @@ import ( | |||
type lbManualResolver struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to use the manual
resolver, but we still need this lbManualResolver
type since it provides some level of wrapping of the one particular piece of functionality that is provided here. Otherwise, we will have to move all this functionality to the top-level lbBalancer
type and also store the resolver.ClientConn
in there.
balancer/grpclb/grpclb_util.go
Outdated
r.lastState = s | ||
r.stateUpdateRecv = true | ||
r.stateMu.Unlock() | ||
|
||
r.ccr.UpdateState(s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the grpclb channel went IDLE, we shouldn't have this resolver active.
I'm not sure what this means. If the grpclb channel is idle (due to a bug, since we always have an active stream) then we could still try to update it while it's exiting idle, causing a race between calling the old CC's UpdateState
and the new CC being set in ccr
.
balancer/grpclb/grpclb_util.go
Outdated
r.ccb.ResolveNow(o) | ||
mr.BuildCallback = func(_ resolver.Target, ccr resolver.ClientConn, _ resolver.BuildOptions) { | ||
r.mu.Lock() | ||
r.ccr = ccr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this functionality (and UpdateState
) handled by the manual resolver internals? Or is there something different here that I'm missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, it is. I didn't realize it was being handled there. So, we don't this type anymore.
balancer/grpclb/grpclb_util.go
Outdated
resolver.Builder // The underlying manual resolver builder. | ||
resolver.Resolver // The underlying manual resolver. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this just embed *manual.Resolver
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have this type anymore.
balancer/grpclb/grpclb_util.go
Outdated
mr.ResolveNowCallback = func(o resolver.ResolveNowOptions) { | ||
r.ccb.ResolveNow(o) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just do: mr.ResolveNowCallback = r.ccb.ResolveNow
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
balancer/grpclb/grpclb_util.go
Outdated
@@ -62,12 +62,25 @@ import ( | |||
type lbManualResolver struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it provides some level of wrapping of the one particular piece of functionality that is provided here
...which is what, specifically? Calling ResolveNow
on the balancer.ClientConn
when the ResolveNowCallback
is invoked? Does that necessitate a wrapping type though?
// The parent ClientConn should re-resolve when grpclb loses connection to the | ||
// remote balancer. When the ClientConn inside grpclb gets a TransientFailure, | ||
// it calls lbManualResolver.ResolveNow(), which calls parent ClientConn's | ||
// ResolveNow, and eventually results in re-resolve happening in parent | ||
// ClientConn's resolver (DNS for example). | ||
// | ||
// parent | ||
// ClientConn | ||
// +-----------------------------------------------------------------+ | ||
// | parent +---------------------------------+ | | ||
// | DNS ClientConn | grpclb | | | ||
// | resolver balancerWrapper | | | | ||
// | + + | grpclb grpclb | | | ||
// | | | | ManualResolver ClientConn | | | ||
// | | | | + + | | | ||
// | | | | | | Transient | | | ||
// | | | | | | Failure | | | ||
// | | | | | <--------- | | | | ||
// | | | <--------------- | ResolveNow | | | | ||
// | | <--------- | ResolveNow | | | | | | ||
// | | ResolveNow | | | | | | | ||
// | | | | | | | | | ||
// | + + | + + | | | ||
// | +---------------------------------+ | | ||
// +-----------------------------------------------------------------+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this not worth keeping? It still seems largely accurate for now, except s/lbManualResolver/manual.Resolver/
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I retained only the last paragraph from this comment, and added that to the place where we set the ResolveNowCallback
on the manual resolver. I feel that it conveys enough and that probably don't need this nicely drawn ASCII art. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems fine. This comment was in a pretty random place anyway.
An instance of
lbManualResolver
is created when thegrpclb
policy is built. After this, every time the LB policy receives new balancer addresses as part of its configuration, it pushes those addresses to thegrpclb
channel via this manual resolver.If the
grpclb
channel goesIDLE
, which it technically should never because it always has a stream open to the remote balancer, grpc shuts down the manual name resolver and when it exitsIDLE
, grpc recreates the name resolver. But at this point, there is no configuration update on the parentgrpclb
LB policy. So, thegrpclb
channel is stuck forever waiting for addresses.This PR caches the most recent list of addresses received in the manual resolver, and replays the same when it is recreated (when exiting IDLE).
RELEASE NOTES:
grpclb
channel goes IDLE