Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add automatic logging of e2e flakes, and automatic restarts #10654

Closed
wants to merge 58 commits into from

Conversation

nithusha21
Copy link
Contributor

Overview

  1. This PR fixes or fixes part of N/A
  2. This PR does the following: We track all e2e flakes in a sheet. This PR edits the e2e tests runner to interact with the sheet to automatically detect if a failure is a flake. If so, it will automatically log that in the sheet, and restart the test upto 3 runs. This helps with better management and prioritization of the list of flakes, and automatically restarts, making the flakes less annoying for contributors.

Essential Checklist

  • The PR title starts with "Fix #bugnum: ", followed by a short, clear summary of the changes. (If this PR fixes part of an issue, prefix the title with "Fix part of #bugnum: ...".)
  • The linter/Karma presubmit checks have passed locally on your machine.
  • "Allow edits from maintainers" is checked. (See here for instructions on how to enable it.)
    • This lets reviewers restart your CircleCI tests for you.
  • The PR is made from a branch that's not called "develop".

PR Pointers

  • Make sure to follow the instructions for making a code change.
  • Oppiabot will notify you when you don't add a PR_CHANGELOG label. If you are unable to do so, please @-mention a code owner (who will be in the Reviewers list), or ask on Gitter.
  • For what code owners will expect, see the Code Owner's wiki page.
  • Make sure your PR follows conventions in the style guide, otherwise this will lead to review delays.
  • Never force push. If you do, your PR will be closed.

@oppiabot
Copy link

oppiabot bot commented Sep 12, 2020

Hi, @nithusha21, this pull request does not have a "CHANGELOG: ..." label as mentioned in the PR checkbox list. Assigning @nithusha21to add the required label. PRs without this label will not be merged. If you are unsure of which label to add, please ask the reviewers for guidance. Thanks!

@kevintab95
Copy link
Member

Thanks @nithusha21! I don't think my review is required here -- so unassigning myself.

@kevintab95 kevintab95 removed their assignment Sep 13, 2020
@U8NWXD
Copy link
Member

U8NWXD commented Sep 14, 2020

@nithusha21 since we still have to fix the backend tests, can this be made a draft?

We need to import `google.oauth2` after we import `python_utils` because
importing `python_utils` sets `sys.path` as needed for `google.oauth2`.
This commit replicates changes from oppia#11018 that fix the linter to
recognize google.oauth2 as a third-party library.
@U8NWXD
Copy link
Member

U8NWXD commented Oct 22, 2020

If you need this to get in soonish, maybe just cherrypick or redo the changes from #11018 here?

There'll be merge conflicts down the road, but if getting this in is worth it, then that's one way to speed things up.

done in 24adaff

Copy link
Contributor

@Hudda Hudda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Hudda Hudda removed their assignment Oct 22, 2020
Copy link
Member

@DubeySandeep DubeySandeep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a pass and left a few comments. Most of the comments are related to code structure, if required we can resolve that comment after getting this PR merged as this PR will help every other contributors. Let me know if you have any question/suggestion. Thanks!

app_dev.yaml Outdated
Comment on lines 303 to 304
- third_party/python_libs/google/oauth2/
- third_party/python_libs/google/api_core/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are adding 4 new lib but removing 2 dirs here? Is that expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The path was incorrect.

requirements.in Outdated
@@ -27,3 +27,7 @@ requests-toolbelt==0.9.1
google-cloud-tasks==1.5.0
googleapis_common_protos==1.52.0
webapp2==3.0.0b1
google-auth==1.21.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: Nit: Should we add this in sorted order* (Though it's only sorted till line 22)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -30,10 +33,12 @@ graphy==1.0.0 # via -r requirements.in, googleappenginemapreduce
grpc-google-iam-v1==0.12.3 # via google-cloud-tasks
grpcio==1.32.0 # via google-api-core, googleapis-common-protos, grpc-google-iam-v1
html5lib==1.0.1 # via -r requirements.in
httplib2==0.18.1 # via google-api-python-client, google-auth-httplib2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also remove this from the build container? (i.e, add this in -ski-file list of app.yaml)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

idna==2.10 # via requests
mock==3.0.5 # via googleappenginemapreduce
mox==0.5.3 # via googleappenginemapreduce
mutagen==1.43.0 # via -r requirements.in
oauthlib==3.1.0 # via requests-oauthlib
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto here and elsewhere*

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

os.getcwd(), 'third_party', 'python_libs')
sys.path.insert(0, _PYTHON_LIBS_PATH)

import googleapiclient.discovery # isort:skip pylint: disable=wrong-import-position, wrong-import-order
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should move the google spreadsheet related code to a different file? (We can do it later if required but make sure to create an issue for the same!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #11038 to resolve all the below comments as I don't have time right now to perform this refactor. I will do that when I am back.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating ths issue, I'm working on it and I'll create a separate PR.

"""
sheet_id = os.getenv('FLAKY_E2E_TEST_SHEET_ID')
flaky_tests_list = []
if sheet_id is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if sheet_id is None:
    return [] 

restult = ....

[To avoid lots of code inside if branch!]


import googleapiclient.discovery # isort:skip pylint: disable=wrong-import-position, wrong-import-order

MAXIMUM_RUNS = 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe MAX_RETRY_COUNT or something around that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


flaky_tests_list = []
google_auth_decode_password = os.getenv('GOOGLE_AUTH_DECODE_PASSWORD')
if google_auth_decode_password is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move the code 647-660 to a function get_flkay_test_data_sheet?

"""Run tests, rerunning at most MAXIMUM_RUNS times if they flake."""
for _ in python_utils.RANGE(MAXIMUM_RUNS):
flake_state = run_tests(args=args)
if flake_state != 'flake':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we structure it this way:

get_flkay_test_data_sheet:
    data = get_flkay_test_data_sheet()
    return get_flaky_tests_data_from_sheets(data)

is_test_output_flaky(output):
    logs = get_flkay_test_logs()
    for log in logs:
         if is_output_similar_to_log(output, log)
             return True
    return False

main:
    return_code = 1
    Loop max_tries:
         output, return_code = run_test()
         if is_test_output_faky(output):
             update_falkey_test_count() [OR do it this inside the is_test_output_flkay() call]
             print 'The test result looks flaky, rerunning the tests (%<attempt number>)'
         else:
             break
 
    sys.exist(return_code)

[This is just to make the code structured and readable!]

@oppiabot
Copy link

oppiabot bot commented Oct 22, 2020

Unassigning @DubeySandeep since the review is done.

Copy link
Contributor

@vojtechjelinek vojtechjelinek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@oppiabot
Copy link

oppiabot bot commented Oct 22, 2020

Unassigning @vojtechjelinek since they have already approved the PR.

@U8NWXD
Copy link
Member

U8NWXD commented Oct 23, 2020

I'm seeing an import error again, but only on TravisCI:

ImportError: No module named oauth2

@seanlip
Copy link
Member

seanlip commented Oct 23, 2020

This works on my local machine (or at least, it gets past the point on Travis where it's currently stuck). Per @kevintab95 I'm going to try closing and reopening this PR in order to (hopefully) break the cache (assuming caching is the issue).

@U8NWXD, if this persists, maybe you could just test out (on a separate branch) what happens for a single Travis test, and maybe add logs as needed to get more insight into what's going on?

@seanlip seanlip closed this Oct 23, 2020
@seanlip seanlip reopened this Oct 23, 2020
@oppiabot
Copy link

oppiabot bot commented Oct 24, 2020

Hi @nithusha21. Due to recent changes in the "develop" branch, this PR now has a merge conflict. Please follow this link if you need help resolving the conflict, so that the PR can be merged. Thanks!

@oppiabot
Copy link

oppiabot bot commented Oct 26, 2020

Hi @nithusha21. Due to recent changes in the "develop" branch, this PR now has a merge conflict. Please follow this link if you need help resolving the conflict, so that the PR can be merged. Thanks!

@seanlip
Copy link
Member

seanlip commented Oct 28, 2020

Closing -- superseded by #11056.

@seanlip seanlip closed this Oct 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants