Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add flaky test mitigation #261

Merged
merged 6 commits into from
Jul 16, 2016
Merged

Add flaky test mitigation #261

merged 6 commits into from
Jul 16, 2016

Conversation

lavalamp
Copy link
Contributor

Implements #260

@tedsuo
Copy link
Contributor

tedsuo commented Jun 28, 2016

I see no integration tests for this feature, and I suspect that these new flags are not actually propagated. Can you add a test to integration/flags_test to prove that this works?

@lavalamp
Copy link
Contributor Author

OK, I'll add a test asap.

@lavalamp lavalamp changed the title WIP: add flaky test mitigation Add flaky test mitigation Jun 29, 2016
@lavalamp
Copy link
Contributor Author

Integration tests added. @tedsuo was there some particular thing you were concerned about? It passed as-is.

@tedsuo
Copy link
Contributor

tedsuo commented Jun 29, 2016

At a glance, it looked like these flags would have to propagate through BuildFlagArgs in order to work, at least in some cases: https://github.com/lavalamp/ginkgo/blob/cd831bb9a6e6a46115ca953b13b4a5b61d2c2ca1/config/config.go#L100
https://github.com/onsi/ginkgo/blob/master/ginkgo/testrunner/test_runner.go#L226

Thanks for the integration tests. I'll manually review tomorrow.

@lavalamp
Copy link
Contributor Author

Ah, I didn't see that section. I will add them there tomorrow, that looks
important.

On Tue, Jun 28, 2016 at 10:31 PM, Ted Young notifications@github.com
wrote:

At a glance, it looked like these flags would have to propagate through
BuildFlagArgs in order to work, at least in some cases:
https://github.com/lavalamp/ginkgo/blob/cd831bb9a6e6a46115ca953b13b4a5b61d2c2ca1/config/config.go#L100

https://github.com/onsi/ginkgo/blob/master/ginkgo/testrunner/test_runner.go#L226

Thanks for the integration tests. I'll manually review tomorrow.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#261 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAnglipmdJ4LU6ibjWp3vDskXOcAzEUNks5qQgM5gaJpZM4I7TBP
.

@lavalamp
Copy link
Contributor Author

@tedsuo I've added to that function and filed #263.

@lavalamp
Copy link
Contributor Author

@tedsuo sorry to be pushy-- any more comments?

@onsi
Copy link
Owner

onsi commented Jul 1, 2016

hi @lavalamp - sorry for the delay.

I'll commit to merging this in, but some comments:

  • let's remove FlakePassesRequired - I'm not completely convinced of the usecase and would prefer to see it emerge from real-world usage
  • I think we're missing some messaging to convey "this test has failed and is being retried". Doing this well may entail changing the spec summary object or possibly the reporter interface itself. As it stands I'm concerned folks will simply see the test suite pass but some tests look like they've failed. I won't gate the PR on this but if you could spend a few minutes taking a look at the reporter/stenographer and consider whether there's a simple way to convey this context I'd appreciate it.

@lavalamp
Copy link
Contributor Author

lavalamp commented Jul 2, 2016

remove FlakePassesRequired

Done-- I was thinking I might want to require more than a 50% pass rate but we can cross that bridge when it becomes an actual problem.

changing the spec summary object or possibly the reporter interface

I've run out of steam tonight but I'll try to take a look at this soon.

@lavalamp
Copy link
Contributor Author

lavalamp commented Jul 2, 2016

I guess the obvious thing to do is add a "NumberOfFlakySpecs" to the suite summary and let the reporters handle that as appropriate.

The stenographer interface is harder, since a "flake" is really N failures followed by a pass for the same spec. I'll think about this more tomorrow.

At the very least I can add some documentation warning users about the behavior of this flag (pushed just now).

@lavalamp
Copy link
Contributor Author

lavalamp commented Jul 6, 2016

I added the flake concept to the suite summary and to the spec object.

Please let me know if you think this is good, or if I should do something like run the test N times and then report the result a single time.

@lavalamp
Copy link
Contributor Author

lavalamp commented Jul 9, 2016

@onsi anything I can do to move this along?

@onsi
Copy link
Owner

onsi commented Jul 11, 2016

Hey @lavalamp bear with me. We're exploring a couple of different patterns for doing this. I have a commit on a branch:

https://github.com/onsi/ginkgo/tree/runlists

That adds the ability to emit failed tests to a file then run only tests in said file. Details:

ginkgo --writeFailuresFile=path/to/file.json

will write to file.json if and only if there are test failures. it will export a list of failed tests detailing the name of the test, the coordinates of the test, and the failure reason. if nothing fails no file is written however if a file with that name already exists it will not be removed (should it be?)

ginkgo --runFailuresFile=path/to/file.json

will run only the tests listed in file.json. if file.json doesn't exist or is malformed ginkgo panics (so its on you to write the code that says "is this file here? should i rerun the tests?). i think passing both flags in will work so you can rerun until there are no failures.

We're going to try this interface IRL and weigh it against your approach. This requires a bit more work on the test runners side (e.g. a bash script that reruns ginkgo until it exits 0) but it does leave a better trail of breadcrumbs and will be harder for folks to abuse.

@lavalamp
Copy link
Contributor Author

That approach will call before/afterSuite multiple times. Also, putting
logic in bash is unfortunate. Maybe I can look into using the logic in this
change but emitting a second JUnit file for reruns?
On Jul 10, 2016 8:46 PM, "Onsi Fakhouri" notifications@github.com wrote:

Hey @lavalamp https://github.com/lavalamp bear with me. We're exploring
a couple of different patterns for doing this. I have a commit on a branch:

https://github.com/onsi/ginkgo/tree/runlists

That adds the ability to emit failed tests to a file then run only tests
in said file. Details:

ginkgo --writeFailuresFile=path/to/file.json

will write to file.json if and only if there are test failures. it will
export a list of failed tests detailing the name of the test, the
coordinates of the test, and the failure reason. if nothing fails no file
is written however if a file with that name already exists it will not be
removed (should it be?)

ginkgo --runFailuresFile=path/to/file.json

will run only the tests listed in file.json. if file.json doesn't exist or
is malformed ginkgo panics (so its on you to write the code that says "is
this file here? should i rerun the tests?). i think passing both flags in
will work so you can rerun until there are no failures.

We're going to try this interface IRL and weigh it against your approach.
This requires a bit more work on the test runners side (e.g. a bash script
that reruns ginkgo until it exits 0) but it does leave a better trail of
breadcrumbs and will be harder for folks to abuse.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#261 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAnglg5bZGWmjJS8KXY2MTj8ze9mQNI8ks5qUbyagaJpZM4I7TBP
.

@tedsuo
Copy link
Contributor

tedsuo commented Jul 11, 2016

FWIW, there are ways to wed the two approaches without any extra mechanisms:

ginkgo --writeReport=path/to/file.json
ginkgo --writeReport=path/to/file.json --retryFailures=2
ginkgo --retryReport=path/to/file.json --retryFailures=1

ginkgo --writeReport=path/to/file.json [[--retryFailures=0]]

writeReport writes a Ginkgo Report File just before calling AfterSuite. It contains test telemetry. If ginkgo exits with success, the report contains details of a successful pass. If any tests fail, the report records the reason and trace for each failure.

If retryFailures is set to a positive integer, failures in the report are then consumed.

ginkgo --retryReport=path/to/file.json [[--retryFailures=0]]

retryReport loads a Ginkgo Report File. If any failures were recorded, BeforeSuite is called, and the recorded failures are rerun according to the behavior of retryFailures.

@lavalamp
Copy link
Contributor Author

I guess I don't understand what that gains us-- can't I just do that now by parsing the junit file and rerunning the failures?

@onsi
Copy link
Owner

onsi commented Jul 16, 2016

hey @lavalamp am going to pull this in. so you mind writing up some docs for the gh-pages branch when you have a moment?

@onsi onsi merged commit 7d32401 into onsi:master Jul 16, 2016
@lavalamp
Copy link
Contributor Author

Thanks! I will send you some docs.

k8s-github-robot pushed a commit to kubernetes/kubernetes that referenced this pull request Jul 18, 2016
Automatic merge from submit-queue

Detect flakes in PR builder e2e runs

Won't be mergable until onsi/ginkgo#261 is agreed upon and merged.

Tossing a PR here to get the e2e test to run on it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants