-
Notifications
You must be signed in to change notification settings - Fork 39.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updating EndpointSlice e2e tests to be less flaky and easier to debug #93402
Updating EndpointSlice e2e tests to be less flaky and easier to debug #93402
Conversation
/sig networking |
@robscott: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/remove-kind feature |
test/e2e/network/endpointslice.go
Outdated
// Expect Endpoints resource to not be deleted when Service is. | ||
// TODO(robscott): Update this test when this bug is fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we expect that the Endpoints
resource will not be deleted here wouldn't this test always fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, I don't understand the comment either, the code seems to check they are deleted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦 Apparently I momentarily confused this part with another part that is waiting for an Endpoints related fix and thought I'd forgotten to add a TODO here. As it turns out the TODO already existed in the appropriate place: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/network/endpointslice.go#L296-L297. That bug will actually be fixed with an open PR so I need to remember to connect those dots.
3278b0d
to
10be96b
Compare
test/e2e/network/endpointslice.go
Outdated
@@ -143,7 +143,7 @@ var _ = SIGDescribe("EndpointSlice", func() { | |||
} | |||
|
|||
// Expect EndpointSlice resource to be deleted when Service is. | |||
if err := wait.PollImmediate(2*time.Second, 12*time.Second, func() (bool, error) { | |||
if err := wait.PollImmediate(2*time.Second, 18*time.Second, func() (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a surprisingly specific time... why not wait.ForeverTestTimeout (30 seconds)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somehow I think I'd thought the only reason it would take longer than this would be a legitimate failure, clearly that's not always the case. Also, I didn't know about wait.ForeverTestTimeout
- I've updated all 4 instances of 12 second times to use that instead, thanks!
10be96b
to
3fea696
Compare
This adjusts tests that were waiting for Pods to be ready to instead just wait for them to have IPs assigned to them. This relies on the associated publishNotReadyAddresses field on Services. Additionally this increases the the length of time we'll wait for EndpointSlices to be garbage collected from 12s to 30s. Finally, this adds additional logging to ExpectNoError calls so it's easier to understand where and why a test failed.
3fea696
to
3e4745d
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: liggitt, robscott The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Review the full test history for this PR. Silence the bot with an |
/retest |
/retest Review the full test history for this PR. Silence the bot with an |
/retest |
/retest Review the full test history for this PR. Silence the bot with an |
1 similar comment
/retest Review the full test history for this PR. Silence the bot with an |
What type of PR is this?
/kind cleanup
/kind failing-test
What this PR does / why we need it:
This adjusts tests that were waiting for Pods to be ready to instead just wait for them to have IPs assigned to them. This relies on the associated publishNotReadyAddresses field on Services. Additionally this increases the the length of time we'll wait for EndpointSlices to be garbage collected from 12s to 18s. Finally, this adds additional logging to ExpectNoError calls so it's easier to understand where and why a test failed.
Which issue(s) this PR fixes:
I'm hoping this PR will fix #93374 and #92776, but all I can say for sure is that it "works on my cluster".
Special notes for your reviewer:
Although it seems backwards to even be testing for EndpointSlice garbage collection we need some kind of test coverage to make sure that the EndpointSlice controller does not conflict with the garbage collector. Maybe there's a better way to do that than have this long timeout here. It could potentially be more straightforward to just immediately delete both Endpoints and EndpointSlices when a Service is marked for deletion, but that could potentially be undesirable if there are long running finalizers in place.
Does this PR introduce a user-facing change?:
/cc @aojea @freehan @RobertKielty @hasheddan
/assign @liggitt