Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Execution context sanitizer #5474

Merged
merged 7 commits into from
Mar 16, 2016
Merged

Execution context sanitizer #5474

merged 7 commits into from
Mar 16, 2016

Conversation

ctiller
Copy link
Member

@ctiller ctiller commented Feb 29, 2016

Shows up a bunch of bugs. C tests currently suffer from some dodgy pollset usage; C++ tests are gtg though it seems.

Fixes #5432.

Basic approach: for each grpc_exec_ctx_enqueue, spawn a new thread to execute the associate closure.

Configs made available: etsan, easan, and edbg - extending the associated previous configs with this new internal thread scheme.

@ctiller ctiller changed the title ESAN prototype Execution context sanitizer Feb 29, 2016
@vjpai
Copy link
Member

vjpai commented Feb 29, 2016

Very nice! I would definitely like to kick the tires on this.

@vjpai
Copy link
Member

vjpai commented Feb 29, 2016

I just want to clarify: Issue #5432 suggested adding a random delay. This PR achieves some amount of decoupling and randomization by using threads for each callback but doesn't explicitly delay events. I don't know if that additional step is worth it, but it may be worth considering as well if the current paths aren't enough to shake out the sorts of timing assumptions. But that's yet another knob to turn (probably 2 knobs - one to set the frequency of perturbation and one to set a delay parameter).

@ctiller
Copy link
Member Author

ctiller commented Feb 29, 2016

Yeah. I did an experiment adding a usleep(rand()) to the spawned thread and
didn't see any new failures. It's probably worth trying again once the
noise level from this is reduced.

On Mon, Feb 29, 2016 at 8:39 AM Vijay Pai notifications@github.com wrote:

I just want to clarify: Issue #5432
#5432 suggested adding a random
delay. This PR achieves some amount of decoupling and randomization by
using threads for each callback but doesn't explicitly delay events. I
don't know if that additional step is worth it, but it may be worth
considering as well if the current paths aren't enough to shake out the
sorts of timing assumptions. But that's yet another knob to turn (probably
2 knobs - one to set the frequency of perturbation and one to set a delay
parameter).


Reply to this email directly or view it on GitHub
#5474 (comment).

@ctiller
Copy link
Member Author

ctiller commented Feb 29, 2016

Also: one way to think about this is that the kernel scheduler adds some
level of random perturbation :)

On Mon, Feb 29, 2016 at 8:46 AM Craig Tiller ctiller@google.com wrote:

Yeah. I did an experiment adding a usleep(rand()) to the spawned thread
and didn't see any new failures. It's probably worth trying again once the
noise level from this is reduced.

On Mon, Feb 29, 2016 at 8:39 AM Vijay Pai notifications@github.com
wrote:

I just want to clarify: Issue #5432
#5432 suggested adding a random
delay. This PR achieves some amount of decoupling and randomization by
using threads for each callback but doesn't explicitly delay events. I
don't know if that additional step is worth it, but it may be worth
considering as well if the current paths aren't enough to shake out the
sorts of timing assumptions. But that's yet another knob to turn (probably
2 knobs - one to set the frequency of perturbation and one to set a delay
parameter).


Reply to this email directly or view it on GitHub
#5474 (comment).

@ctiller
Copy link
Member Author

ctiller commented Feb 29, 2016

Also also: it'd be good to get this in pre-code freeze I suspect, as it'll
be useful during the week as we nail down some of the failures we've had.

On Mon, Feb 29, 2016 at 8:46 AM Craig Tiller ctiller@google.com wrote:

Also: one way to think about this is that the kernel scheduler adds some
level of random perturbation :)

On Mon, Feb 29, 2016 at 8:46 AM Craig Tiller ctiller@google.com wrote:

Yeah. I did an experiment adding a usleep(rand()) to the spawned thread
and didn't see any new failures. It's probably worth trying again once the
noise level from this is reduced.

On Mon, Feb 29, 2016 at 8:39 AM Vijay Pai notifications@github.com
wrote:

I just want to clarify: Issue #5432
#5432 suggested adding a random
delay. This PR achieves some amount of decoupling and randomization by
using threads for each callback but doesn't explicitly delay events. I
don't know if that additional step is worth it, but it may be worth
considering as well if the current paths aren't enough to shake out the
sorts of timing assumptions. But that's yet another knob to turn (probably
2 knobs - one to set the frequency of perturbation and one to set a delay
parameter).


Reply to this email directly or view it on GitHub
#5474 (comment).

@vjpai
Copy link
Member

vjpai commented Feb 29, 2016

Yes, I was thinking that also wrt code freeze.

On Mon, Feb 29, 2016 at 8:48 AM Craig Tiller notifications@github.com
wrote:

Also also: it'd be good to get this in pre-code freeze I suspect, as it'll
be useful during the week as we nail down some of the failures we've had.

On Mon, Feb 29, 2016 at 8:46 AM Craig Tiller ctiller@google.com wrote:

Also: one way to think about this is that the kernel scheduler adds some
level of random perturbation :)

On Mon, Feb 29, 2016 at 8:46 AM Craig Tiller ctiller@google.com wrote:

Yeah. I did an experiment adding a usleep(rand()) to the spawned thread
and didn't see any new failures. It's probably worth trying again once
the
noise level from this is reduced.

On Mon, Feb 29, 2016 at 8:39 AM Vijay Pai notifications@github.com
wrote:

I just want to clarify: Issue #5432
#5432 suggested adding a random
delay. This PR achieves some amount of decoupling and randomization by
using threads for each callback but doesn't explicitly delay events. I
don't know if that additional step is worth it, but it may be worth
considering as well if the current paths aren't enough to shake out the
sorts of timing assumptions. But that's yet another knob to turn
(probably
2 knobs - one to set the frequency of perturbation and one to set a
delay
parameter).


Reply to this email directly or view it on GitHub
#5474 (comment).


Reply to this email directly or view it on GitHub
#5474 (comment).

@vjpai
Copy link
Member

vjpai commented Feb 29, 2016

LGTM for both code and results. The number of failures that this yields is beautiful. This is showing up races that I'm pretty sure we've never seen before. We need this in. If you do the clang-format and re-merge with master, I'll do a final review and merge before the freeze.

@vjpai
Copy link
Member

vjpai commented Feb 29, 2016

Wait, one second. The configuration edbg sees an error. Looks like a clang vs gcc thing:

test/core/util/port_posix.c: In function ‘free_port_using_server’:
test/core/util/port_posix.c:98:3: error: missing initializer for field ‘unused’ of ‘grpc_exec_ctx’ [-Werror=missing-field-initializers]
grpc_exec_ctx exec_ctx = GRPC_EXEC_CTX_INIT;
^
In file included from ./src/core/iomgr/pollset.h:41:0,
from ./src/core/iomgr/endpoint.h:37,
from ./src/core/httpcli/httpcli.h:41,
from test/core/util/port_posix.c:53:
./src/core/iomgr/exec_ctx.h:71:28: note: ‘unused’ declared here
struct grpc_exec_ctx { int unused; };

@ctiller
Copy link
Member Author

ctiller commented Mar 1, 2016

I'm unlikely to get to a keyboard before 8-9pm. It'll either be a late
merge or a resource we can patch in to help during the week.

On Mon, Feb 29, 2016, 3:34 PM Vijay Pai notifications@github.com wrote:

Wait, one second. The configuration edbg sees an error. Looks like a clang
vs gcc thing:

test/core/util/port_posix.c: In function ‘free_port_using_server’:
test/core/util/port_posix.c:98:3: error: missing initializer for field
‘unused’ of ‘grpc_exec_ctx’ [-Werror=missing-field-initializers]
grpc_exec_ctx exec_ctx = GRPC_EXEC_CTX_INIT;
^
In file included from ./src/core/iomgr/pollset.h:41:0,
from ./src/core/iomgr/endpoint.h:37,
from ./src/core/httpcli/httpcli.h:41,
from test/core/util/port_posix.c:53:
./src/core/iomgr/exec_ctx.h:71:28: note: ‘unused’ declared here
struct grpc_exec_ctx { int unused; };


Reply to this email directly or view it on GitHub
#5474 (comment).

@vjpai
Copy link
Member

vjpai commented Mar 1, 2016

"I can send you a PR on your PR" - vjpai. Nope. I don't have write access to your repo, but the change is simple; just fill in the number 0 between the open and close braces of GRPC_EXEC_CTX_INIT and all is well.

@ctiller
Copy link
Member Author

ctiller commented Mar 1, 2016

You can still send a PR, but let's hold submitting this this week. It's easy to patch in for investigative purposes, and I'd like to scrub some of the lower hanging fruit out first.

@vjpai
Copy link
Member

vjpai commented Mar 1, 2016

That's fine, but I will keep using it to find flakes!

@vjpai vjpai closed this Mar 1, 2016
@vjpai vjpai reopened this Mar 1, 2016
@vjpai
Copy link
Member

vjpai commented Mar 10, 2016

LGTM

@jtattermusch
Copy link
Contributor

Suspiciously lots of test have failed on mac opt with this build:
tests.bins/opt/h2_full_nosec_test graceful_server_shutdown
tests.bins/opt/h2_sockpair+trace_test cancel_after_invoke
tests.bins/opt/h2_compress_nosec_test hpack_size
tests.bins/opt/h2_sockpair+trace_test compressed_payload
tests.bins/opt/h2_sockpair+trace_test cancel_before_invoke
tests.bins/opt/h2_sockpair+trace_test graceful_server_shutdown
tests.bins/opt/h2_sockpair+trace_nosec_test cancel_after_invoke

https://grpc-testing.appspot.com/job/gRPC_pull_requests/6846/

Let's take another look.

@vjpai
Copy link
Member

vjpai commented Mar 15, 2016

LGTM.
FWIW, things are much greener now. Only tsan qps is bad, and that is infrastructure related.

@ctiller
Copy link
Member Author

ctiller commented Mar 16, 2016

@jtattermusch @nicolasnoble - ready to merge?

jtattermusch added a commit that referenced this pull request Mar 16, 2016
Execution context sanitizer
@jtattermusch jtattermusch merged commit f4f57f0 into grpc:master Mar 16, 2016
@lock lock bot locked as resolved and limited conversation to collaborators Jan 28, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants