-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix error handling in gc e2e test #60671
Fix error handling in gc e2e test #60671
Conversation
@jennybuckley: Adding do-not-merge/release-note-label-needed because the release note process has not been followed. One of the following labels is required "release-note", "release-note-action-required", or "release-note-none". Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/lgtm |
/release-note-none |
If this weren't a test I'd ask for this to use the error list concept that (I think) the validation code defines (uses?) :) |
/kind bug |
@@ -610,19 +611,20 @@ var _ = SIGDescribe("Garbage collector", func() { | |||
return verifyRemainingDeploymentsReplicaSetsPods(f, clientSet, deployment, 0, 1, 2) | |||
}) | |||
if err != nil { | |||
err = fmt.Errorf("Failed to wait to see if the garbage collecter mistakenly deletes the rs: %v", err) | |||
errList := err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accumulating into list would probably make more sense than passing it back into fmt.Errorf
@@ -553,13 +553,14 @@ var _ = SIGDescribe("Garbage collector", func() { | |||
return verifyRemainingDeploymentsReplicaSetsPods(f, clientSet, deployment, 0, 0, 0) | |||
}) | |||
if err == wait.ErrWaitTimeout { | |||
err = fmt.Errorf("Failed to wait for all rs to be garbage collected: %v", err) | |||
errList := err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here... accumulate?
6dfc643
to
af684f9
Compare
err = wait.PollImmediate(5*time.Second, 2*time.Minute, func() (bool, error) { | ||
return verifyRemainingDeploymentsReplicaSetsPods(f, clientSet, deployment, 0, 1, 2) | ||
}) | ||
By("wait for 10s to see if the garbage collector mistakenly deletes the rs") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is changed because it doesn't make sense to use PollImmediate to make sure something doesn't happen (at least the way it was being used here). If it hasn't happened by the first polling the PollImmediate will succeed and then what we are watching for may end up occurring later and we will miss it. How we were doing it, the test would only wait either ~0, or 5 seconds (depending on how long the deletion of the Deployment took, usually the 0s case), so this is actually an increase in wait time despite the log message diff indicating a decrease.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Could you keep the code consistent with https://github.com/kubernetes/kubernetes/blame/6546b69964f96ff6dd7bfdd8e72f89f09bd3e305/test/e2e/apimachinery/garbage_collector.go#L447-L459?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@liggitt I updated to use aggregated errors and also to fix two more places where errors could be missed and the test could pass when it should fail |
@@ -552,15 +552,17 @@ var _ = SIGDescribe("Garbage collector", func() { | |||
err = wait.PollImmediate(500*time.Millisecond, 1*time.Minute, func() (bool, error) { | |||
return verifyRemainingDeploymentsReplicaSetsPods(f, clientSet, deployment, 0, 0, 0) | |||
}) | |||
if err == wait.ErrWaitTimeout { | |||
err = fmt.Errorf("Failed to wait for all rs to be garbage collected: %v", err) | |||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix for error swallowing missed by f08f504
ece5a70
to
218199c
Compare
}) | ||
if err != nil { | ||
err = fmt.Errorf("Failed to wait to see if the garbage collecter mistakenly deletes the rs: %v", err) | ||
By("wait for 30 seconds to see if the garbage collector mistakenly deletes the rs") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why poll at all? why not just sleep 30 seconds, then list and ensure length is 1? expecting to get a timeout error is confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liggitt
that is how I had it but changed it in favor of consistency with other similar code in this file
#60671 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this be better? jennybuckley@03b4b23#diff-4a7d8678f8505e86c1c6cc626cdc5eeeR613
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jennybuckley your code is better, let's change the rest of code to comply with yours.
@sig-testing-approvers PTAL ASAP |
this is an improvement, but is it a release blocker? |
218199c
to
5f518e6
Compare
@liggitt no, I don't think it's a release blocker |
let's move it out of the milestone, then. we're down to the wire on 1.10 and only essential things should be in at this point /milestone clear |
/retest |
/test pull-kubernetes-e2e-gce |
return false, nil | ||
}); err != nil && err != wait.ErrWaitTimeout { | ||
framework.Failf("%v", err) | ||
time.Sleep(30 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: polling helps us fail earlier in case of error. but in reality we expect these tests to pass consistently, so there is no difference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a while ago, but I think the reason for that change was to keep consistency with the change I made to https://github.com/kubernetes/kubernetes/pull/60671/files#diff-4a7d8678f8505e86c1c6cc626cdc5eeeL608
In that case the way the polling was set up was completely broken, and it would always succeed within the first 0-5 seconds, even if the garbage collector mistakenly deleted the replica set.
Discussion about it here: #60671 (comment)
/lgtm |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: caesarxuchao, jennybuckley, liggitt, roycaihw The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
bringing in to help diagnose flakes in this test (c.f. https://storage.googleapis.com/k8s-gubernator/triage/index.html?ci=0&pr=1&sig=api-machinery&text=post%20mortem&job=pull-kubernetes-e2e-gce&test=Garbage%20collector) /milestone v1.11 |
[MILESTONENOTIFIER] Milestone Pull Request: Up-to-date for process @caesarxuchao @jennybuckley @liggitt @roycaihw Pull Request Labels
|
/test all [submit-queue is verifying that this PR is safe to merge] |
/retest |
fyi, fix in progress for that flake in #64726 |
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here. |
What this PR does / why we need it:
Error messages were not being surfaced in log of GC e2e test, part of fixing #60463