Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpclb: teach the manual resolver to handle restarts #6635

Merged
merged 3 commits into from
Sep 18, 2023

Conversation

easwars
Copy link
Contributor

@easwars easwars commented Sep 14, 2023

An instance of lbManualResolver is created when the grpclb policy is built. After this, every time the LB policy receives new balancer addresses as part of its configuration, it pushes those addresses to the grpclb channel via this manual resolver.

If the grpclb channel goes IDLE, which it technically should never because it always has a stream open to the remote balancer, grpc shuts down the manual name resolver and when it exits IDLE, grpc recreates the name resolver. But at this point, there is no configuration update on the parent grpclb LB policy. So, the grpclb channel is stuck forever waiting for addresses.

This PR caches the most recent list of addresses received in the manual resolver, and replays the same when it is recreated (when exiting IDLE).

RELEASE NOTES:

  • grpclb: fix a corner case bug when the grpclb channel goes IDLE

@easwars easwars requested a review from dfawley September 14, 2023 20:03
@easwars easwars added this to the 1.59 Release milestone Sep 14, 2023
@@ -62,12 +62,25 @@ import (
type lbManualResolver struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh... this has its own manual resolver too? Can we reuse the one in resolver/manual?

If not, we probably should fix that to be tolerant to idleness by setting bootstrapState here:

err := r.CC.UpdateState(s)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can reuse the manual resolver here with some changes. I will send a separate PR for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to use the manual resolver, but we still need this lbManualResolver type since it provides some level of wrapping of the one particular piece of functionality that is provided here. Otherwise, we will have to move all this functionality to the top-level lbBalancer type and also store the resolver.ClientConn in there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it provides some level of wrapping of the one particular piece of functionality that is provided here

...which is what, specifically? Calling ResolveNow on the balancer.ClientConn when the ResolveNowCallback is invoked? Does that necessitate a wrapping type though?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. We don't need this type at all. Got it to work by just using the manual.Resolver.

r.lastState = s
r.stateUpdateRecv = true
r.stateMu.Unlock()

r.ccr.UpdateState(s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... so what if the channel this is attached to went idle and then UpdateState is called during that time? This will just be ignored? Maybe that's fine? But there's definitely a race here with leaving idleness now, with Build being called again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ccr here is the resolver.ClientConn passed to the manual resolver, and is the one implemented by the resolver_conn_wrapper on the grpclb channel. The parent channel going IDLE should not affect any of this interaction. If the grpclb channel went IDLE, we shouldn't have this resolver active.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the grpclb channel went IDLE, we shouldn't have this resolver active.

I'm not sure what this means. If the grpclb channel is idle (due to a bug, since we always have an active stream) then we could still try to update it while it's exiting idle, causing a race between calling the old CC's UpdateState and the new CC being set in ccr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have this type anymore. But thinking about the case you are mentioning, let's say the following happens:

  • grpclb channel is IDLE because of a bug.
  • the parent grpclb LB policy receives a ClientConnUpdate and it has a new set of balancer addresses, and it calls UpdateState on the manual resolver
  • At the same time, the grpclb channel is exiting IDLE.
    • This will cause a new manual.Resolver to be built and will update the resolver.ClientConn inside it.
  • If the call to UpdateState happens first, then forwarding of that update to the old resolver.ClientConn within the manual.Resolver will have no effect. But it will still update the lastSeenState field. So, when the grpclb channel eventually exits IDLE and a new manual.Resolver is built, it will still send this most recent lastSeenState to the channel.
  • If the grpclb channel exits IDLE first, we will see two updates pushed on to the channel. One with the previous lastSeenState and one with the new state pushed by the grpclb LB policy.

Eventually, both are consistent. Am I still missing something?

@dfawley dfawley assigned easwars and unassigned dfawley Sep 14, 2023
@easwars easwars force-pushed the grpclb_manual_resolver branch from 5d8f0ae to 9b88f99 Compare September 15, 2023 17:54
Copy link
Contributor Author

@easwars easwars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL

@@ -62,12 +62,25 @@ import (
type lbManualResolver struct {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to use the manual resolver, but we still need this lbManualResolver type since it provides some level of wrapping of the one particular piece of functionality that is provided here. Otherwise, we will have to move all this functionality to the top-level lbBalancer type and also store the resolver.ClientConn in there.

@easwars easwars assigned dfawley and unassigned easwars Sep 15, 2023
r.lastState = s
r.stateUpdateRecv = true
r.stateMu.Unlock()

r.ccr.UpdateState(s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the grpclb channel went IDLE, we shouldn't have this resolver active.

I'm not sure what this means. If the grpclb channel is idle (due to a bug, since we always have an active stream) then we could still try to update it while it's exiting idle, causing a race between calling the old CC's UpdateState and the new CC being set in ccr.

r.ccb.ResolveNow(o)
mr.BuildCallback = func(_ resolver.Target, ccr resolver.ClientConn, _ resolver.BuildOptions) {
r.mu.Lock()
r.ccr = ccr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this functionality (and UpdateState) handled by the manual resolver internals? Or is there something different here that I'm missing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, it is. I didn't realize it was being handled there. So, we don't this type anymore.

Comment on lines 66 to 67
resolver.Builder // The underlying manual resolver builder.
resolver.Resolver // The underlying manual resolver.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this just embed *manual.Resolver instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have this type anymore.

Comment on lines 89 to 91
mr.ResolveNowCallback = func(o resolver.ResolveNowOptions) {
r.ccb.ResolveNow(o)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just do: mr.ResolveNowCallback = r.ccb.ResolveNow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -62,12 +62,25 @@ import (
type lbManualResolver struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it provides some level of wrapping of the one particular piece of functionality that is provided here

...which is what, specifically? Calling ResolveNow on the balancer.ClientConn when the ResolveNowCallback is invoked? Does that necessitate a wrapping type though?

@dfawley dfawley assigned easwars and unassigned dfawley Sep 15, 2023
@easwars easwars assigned dfawley and unassigned easwars Sep 15, 2023
Comment on lines -30 to -54
// The parent ClientConn should re-resolve when grpclb loses connection to the
// remote balancer. When the ClientConn inside grpclb gets a TransientFailure,
// it calls lbManualResolver.ResolveNow(), which calls parent ClientConn's
// ResolveNow, and eventually results in re-resolve happening in parent
// ClientConn's resolver (DNS for example).
//
// parent
// ClientConn
// +-----------------------------------------------------------------+
// | parent +---------------------------------+ |
// | DNS ClientConn | grpclb | |
// | resolver balancerWrapper | | |
// | + + | grpclb grpclb | |
// | | | | ManualResolver ClientConn | |
// | | | | + + | |
// | | | | | | Transient | |
// | | | | | | Failure | |
// | | | | | <--------- | | |
// | | | <--------------- | ResolveNow | | |
// | | <--------- | ResolveNow | | | | |
// | | ResolveNow | | | | | |
// | | | | | | | |
// | + + | + + | |
// | +---------------------------------+ |
// +-----------------------------------------------------------------+
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not worth keeping? It still seems largely accurate for now, except s/lbManualResolver/manual.Resolver/?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I retained only the last paragraph from this comment, and added that to the place where we set the ResolveNowCallback on the manual resolver. I feel that it conveys enough and that probably don't need this nicely drawn ASCII art. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine. This comment was in a pretty random place anyway.

@dfawley dfawley assigned easwars and unassigned dfawley Sep 18, 2023
@easwars easwars merged commit 3156151 into grpc:master Sep 18, 2023
1 check passed
@easwars easwars deleted the grpclb_manual_resolver branch September 18, 2023 21:04
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants