Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server stall fix #4453

Merged
merged 10 commits into from
Dec 16, 2015
Merged

Server stall fix #4453

merged 10 commits into from
Dec 16, 2015

Conversation

ctiller
Copy link
Member

@ctiller ctiller commented Dec 14, 2015

Built on #4451, #4452 which should be merged first

@sreecha - if I don't get the above merged before you're gone, I'll find another reviewer

exec_ctx, &stream_global->recv_initial_metadata_finished, 0);
grpc_chttp2_complete_closure_step(
exec_ctx, &stream_global->recv_trailing_metadata_finished, 0);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should recv_message_ready closure also be queued here ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. This is built on top of #4451 - and that's what was required
for those tests. I'm going to start merging your fixes into this and see if
I can come up with a grand unified PR that addresses all the lost callback
bugs.

On Mon, Dec 14, 2015 at 2:02 PM Sree Kuchibhotla notifications@github.com
wrote:

In src/core/transport/chttp2_transport.c
#4453 (comment):

@@ -748,6 +753,21 @@ void grpc_chttp2_complete_closure_step(grpc_exec_ctx *exec_ctx,
*pclosure = NULL;
}

+static void fail_all_outstanding_ops(

  • grpc_exec_ctx *exec_ctx, grpc_chttp2_transport_global *transport_global,
  • grpc_chttp2_stream_global *stream_global) {
  • grpc_chttp2_complete_closure_step(
  •  exec_ctx, &stream_global->send_initial_metadata_finished, 0);
    
  • grpc_chttp2_complete_closure_step(
  •  exec_ctx, &stream_global->send_trailing_metadata_finished, 0);
    
  • grpc_chttp2_complete_closure_step(exec_ctx,
  •                                &stream_global->send_message_finished, 0);
    
  • grpc_chttp2_complete_closure_step(
  •  exec_ctx, &stream_global->recv_initial_metadata_finished, 0);
    
  • grpc_chttp2_complete_closure_step(
  •  exec_ctx, &stream_global->recv_trailing_metadata_finished, 0);
    
    +}

Should recv_message_ready closure also be queued here ?


Reply to this email directly or view it on GitHub
https://github.com/grpc/grpc/pull/4453/files#r47565257.

@ctiller ctiller changed the title Server stall reproduction Server stall fix Dec 14, 2015
@ctiller
Copy link
Member Author

ctiller commented Dec 14, 2015

It'd be worth @sreecha looking at 38edec6 however.

if (from_parsing_thread) {
gpr_mu_lock(&bs->transport->mu);
}
grpc_exec_ctx_enqueue(exec_ctx, bs->on_next, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it guaranteed that bs->on_next would be non-null here ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grpc_exec_ctx_enqueue accepts a null closure (and does nothing) -- since
it's awfully convenient.

On Mon, Dec 14, 2015 at 3:36 PM Sree Kuchibhotla notifications@github.com
wrote:

In src/core/transport/chttp2_transport.c
#4453 (comment):

@@ -1513,7 +1552,29 @@ void grpc_chttp2_incoming_byte_stream_push(grpc_exec_ctx *exec_ctx,
}

void grpc_chttp2_incoming_byte_stream_finished(

  • grpc_exec_ctx *exec_ctx, grpc_chttp2_incoming_byte_stream *bs) {
  • grpc_exec_ctx *exec_ctx, grpc_chttp2_incoming_byte_stream *bs, int success,
  • int from_parsing_thread) {
  • if (!success) {
  • if (from_parsing_thread) {
  •  gpr_mu_lock(&bs->transport->mu);
    
  • }
  • grpc_exec_ctx_enqueue(exec_ctx, bs->on_next, 0);

Is it guaranteed that bs->on_next would be non-null here ?


Reply to this email directly or view it on GitHub
https://github.com/grpc/grpc/pull/4453/files#r47576693.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

@sreecha
Copy link
Contributor

sreecha commented Dec 14, 2015

Looks good but I do not see this addressing the other reason for server stall i.e trying to execute pending requests on a closed transport. i.e in queue_call_request() function

static grpc_call_error queue_call_request(...) {
...
} else {
 GPR_ASSERT(calld->state == PENDING);
        calld->state = ACTIVATED;
        gpr_mu_unlock(&calld->mu_state);
        begin_call(exec_ctx, server, calld,
                   &server->requested_calls[request_id]);

I know that this will eventually call perform_stream_op_locked() function in chttp2_transport.c but that function checks for contains_non_ok_status() in only two cases (i.e op->sending_initial_metadata or op->send_trailing_metadata) and in this particular case, i believe they both would be set to NULL and its the op->recv_message that is non-null.

How will it handle this case ?

@ctiller
Copy link
Member Author

ctiller commented Dec 15, 2015

Accidentally I think this is covered (although I'd prefer it be called out better: perhaps future work).

We either explicity check read_closed or write_closed and early out the completion, or add the stream to the check_read_ops list... and then do the verification there.

@ctiller
Copy link
Member Author

ctiller commented Dec 15, 2015

Such green.

@sreecha
Copy link
Contributor

sreecha commented Dec 15, 2015

The changes LGTM. Btw, do we also need to make changes at the C++ level ? https://github.com/grpc/grpc/pull/4450/files#diff-7b669dd2948b796b465aa95ec1c42962

(when I checked in my testing mrd->Request(server_, cq_.cq()); was not called if ok == false after auto* mrd = SyncRequest::Wait(&cq_, &ok); (which would be the case here).

@ctiller
Copy link
Member Author

ctiller commented Dec 15, 2015

I'm not sure honestly. I think I'd like to get a test up for this for C++ also.

@sreecha
Copy link
Contributor

sreecha commented Dec 16, 2015

The interop test failure seems unrelated to this change. It fails even for other pull requests. This change can be merged.
I did not get a chance to test this today (but the instructions are here: #4179 ) although I think unless we make changes in C++ code, we will continue seeing this error..

sreecha added a commit that referenced this pull request Dec 16, 2015
@sreecha sreecha merged commit e5cdbea into grpc:master Dec 16, 2015
@ctiller
Copy link
Member Author

ctiller commented Dec 16, 2015

I'm going to experiment if I get a chance in the next few days... I expect
you're right, but want to prove it to myself first.

On Tue, Dec 15, 2015, 6:23 PM Sree Kuchibhotla notifications@github.com
wrote:

Merged #4453 #4453.


Reply to this email directly or view it on GitHub
#4453 (comment).

@sreecha
Copy link
Contributor

sreecha commented Dec 16, 2015

Sounds good Craig!.. thanks for taking care of this and see you next year!

@lock lock bot locked as resolved and limited conversation to collaborators Jan 29, 2019
@lock lock bot unassigned sreecha Jan 29, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants