-
Notifications
You must be signed in to change notification settings - Fork 10.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keep LB policy alive during high freq of resolver updates #10645
Keep LB policy alive during high freq of resolver updates #10645
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments are all minor.
GPR_ASSERT(wc_arg->wrapped_closure != NULL); | ||
GPR_ASSERT(wc_arg->lb_policy != NULL); | ||
GPR_ASSERT(wc_arg->free_when_done != NULL); | ||
grpc_closure_sched(exec_ctx, wc_arg->wrapped_closure, GRPC_ERROR_REF(error)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can use grpc_closure_run()
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
grpc_lb_policy *lb_policy; | ||
|
||
/* heap memory to be freed upon closure execution. Usually this arg. */ | ||
void *free_when_done; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we're always setting this to the wrapped_on_pick_closure_arg
struct, we probably don't need it. Instead, we can just have the callback unconditionally free the arg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
/* mask= */ GRPC_INITIAL_METADATA_WAIT_FOR_READY, | ||
/* check= */ 0, GRPC_ERROR_REF(error)); | ||
if (chand->lb_policy != NULL) { | ||
if (state == GRPC_CHANNEL_TRANSIENT_FAILURE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's put a TODO here about improving how we handle failure cases. In particular:
-
At the moment, I suspect it's not possible for any LB policy to actually return TRANSIENT_FAILURE. We should probably fix that, which I think will require changing round_robin to be able to proactively unref and recreate subchannels for failed connections.
-
If we are replacing the LB policy, we should ideally re-submit any pending picks to the new LB policy instead of failing them directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
Trigger:
Resolvers providing very frequent updates, O(ms), recreate lb policies at that same rate.
Current state:
Older policies may have received requests for picks. With no time to make progress on those, the application-provided callback is put in the policy's "pending picks" list. When the new policy arrives shortly thereafter, the old policy's single reference count is unref'd, triggering a policy shutdown. The application callback is invoked with error as part of the policy's shutdown process for pending picks.
Solution:
Pending application picks should increase the reference count of the policy they are requesting the pick from. The corresponding unref should happen right before scheduling the application callback. In other words, an application's pending pick must keep its associated lb policy alive.