[Flaky test] EndpointSlice should create Endpoints and EndpointSlices for Pods matching a Service #93374

liggitt · 2020-07-23T04:08:37Z

Which jobs are flaking:

ci-kubernetes-kind-e2e-parallel
pr:pull-kubernetes-e2e-gce
pr:pull-kubernetes-e2e-gce-ubuntu-containerd
pr:pull-kubernetes-e2e-gce-ubuntu-containerd
ci-kubernetes-kind-ipv6-e2e-parallel

Which test(s) are flaking:

Second flakiest test in the e2e job according to http://storage.googleapis.com/k8s-metrics/flakes-latest.json

Testgrid link:

https://testgrid.k8s.io/presubmits-kubernetes-blocking#pull-kubernetes-e2e-gce&include-filter-by-regex=EndpointSlice%20should%20create%20Endpoints%20and%20EndpointSlices%20for%20Pods%20matching%20a%20Service&width=5

Reason for failure:

test/e2e/network/endpointslice.go:70
Jul 22 22:13:01.180: EndpointSlice resource not deleted after Service endpointslice-4585/example-empty-selector was deleted: timed out waiting for the condition
test/e2e/network/endpointslice.go:158

Anything else we need to know:

https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&test=EndpointSlice%20should%20create%20Endpoints%20and%20EndpointSlices%20for%20Pods%20matching%20a%20Service

/sig network
/assign @robscott

The text was updated successfully, but these errors were encountered:

liggitt · 2020-07-23T04:10:44Z

needs investigation since we're switching kube-proxy to endpointslice in this release

athenabot · 2020-07-23T04:12:07Z

/triage unresolved

Comment /remove-triage unresolved when the issue is assessed and confirmed.

🤖 I am a bot run by vllry. 👩‍🔬

BenTheElder · 2020-07-23T05:17:08Z

made me think of #91399, but that's in already.

liggitt · 2020-07-23T05:33:47Z

It's possible it's a test timeout that's just too close, but we need verification of that checking the controller and apiserver logs to make sure there isn't a latent bug

robscott · 2020-07-23T06:42:01Z

Thanks for catching this! It looks like the timeouts are actually associated with waiting for 2 pods to be ready, not actually anything related to EndpointSlice functionality - timeout is here: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/network/endpointslice.go#L261-L285. Given that, it seems like the obvious solution would be to update the Services to set publishNotReadyAddresses to true and update this block to wait for the Pods to have IPs, not for them to be ready. If that sounds reasonable, I'll work on getting a fix in tomorrow.

robscott · 2020-07-23T06:42:43Z

/remove-triage unresolved

liggitt · 2020-07-23T14:47:33Z

Given that, it seems like the obvious solution would be to update the Services to set publishNotReadyAddresses to true and update this block to wait for the Pods to have IPs, not for them to be ready.

If that is still exercises the same conditions (I vaguely remember publishNotReadyAddresses changing code paths in the endpoints controller... does it do the same in the endpointslice controller?), that seems reasonable to me.

robscott · 2020-07-23T23:50:09Z

I think the code path is very similar here, the only real difference should be how the EndpointSlice controller calculates readiness. I've gone ahead and filed a PR that uses this approach, hopefully it helps: #93402

hasheddan · 2020-07-27T17:48:39Z

This flaked again this morning

/reopen

k8s-ci-robot · 2020-07-27T17:48:53Z

@hasheddan: Reopened this issue.

In response to this:

This flaked again this morning

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tpepper · 2020-07-27T18:08:01Z

This should be
/priority critical-urgent
given #93374 (comment)

liggitt · 2020-07-27T18:30:07Z

https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce/1286795958520123392

test/e2e/network/endpointslice.go:162
Jul 24 23:26:39.937: Timed out waiting for matching EndpointSlices to exist: timed out waiting for the condition
test/e2e/network/endpointslice.go:321

robscott · 2020-07-27T18:32:45Z

The top level error message does not make this clear but the issue here is that too many EndpointSlices are being created - 3 instead of 2.

liggitt · 2020-07-27T18:33:37Z

The top level error message does not make this clear but the issue here is that too many EndpointSlices are being created - 3 instead of 2.

Interesting... that condition should not persist for O(minutes), right? that seems like it might be a bug, not just a test issue

robscott · 2020-07-27T18:34:55Z

There's potential that it's not consistently merging 2 ports with 1 endpoint into a single slice and instead creating a slice per port. It also could represent a larger bug. Trying to find a way to consistently reproduce this.

robscott · 2020-07-28T01:21:35Z

Despite my best efforts I have been unable to reproduce this on my own e2e cluster. I've had 10 test runs going at the same time repeatedly without any luck, any ideas for how to reproduce flaky test failures local would be welcome. In the meantime, I've created a follow up PR that will provide more helpful information if/when this test fails again: #93491.

My leading theory right now is that this could be a caching issue. The EndpointSlice controller runs a sync, sees that we don't have an EndpointSlice for the given port combination in cache and then goes ahead and creates one, even though a matching EndpointSlice already exists and has not been cached yet. That would result in an annoying situation where an essentially duplicate EndpointSlice was created due to a stale cache. I would think the EndpointSliceTracker would resolve this issue for us though, so not sure. Getting the above PR in will help validate this theory.

It's worth noting that if this is actually a duplicate EndpointSlice, clients like kube-proxy will handle this seamlessly because they already have to handle potential duplication in EndpointSlices. That would potentially explain why a test failure like this does not seem to be causing other things to break.

robscott · 2020-07-28T06:22:13Z

It appears that my theory was correct as witnessed here: https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/93491/pull-kubernetes-e2e-gce/1287970453826048000/. I create a gist with what the 3 EndpointSlices look like. The key points I've noticed so far:

1 EndpointSlice references pod2 (example-named-port-f7vq5)
2 EndpointSlices reference pod1 (example-named-port-fkgcc and example-named-port-vfp2q)
2 EndpointSlices were created at "2020-07-28T04:57:51Z":
- example-named-port-f7vq5 (references pod2)
- example-named-port-vfp2q (references pod1)
1 EndpointSlice was created 1 second later ("2020-07-28T04:57:52Z")
- example-named-port-fkgcc (references pod1)
All 3 EndpointSlices are:
- On generation 1 - no updates have occurred
- With a state of ready: true
The 2 EndpointSlices referencing pod1 are effectively identical

Of course this is something the EndpointSliceTracker should be handling for us, but even that approach seems less than ideal here. It wouldn't actually prevent the duplicate EndpointSlices from being created, it should just be cleaning up duplicates after they happen. This combined with the race conditions referenced in #92928 make me think we may want to find a way to slow down the sync cycle for each Service here. This also feels at least tangentially related to the need for better rate limiting in both this and the Endpoints controller discussed in #88967.

/cc @wojtek-t @mborsz

wojtek-t · 2020-07-28T07:46:23Z

Hmm - actually I don't understand why this is happening. Maybe I'm missing something, but I don't think slowing down controller is needed to solve this problem. Even though the controller usually is not lagging my more than 1s, it is possible that it can be lagging by a minute or more in pathological cases. So we can't just rely on the fact that it's guaranteed to be synced.

Can you explain it a bit deeper why this is happening? i.e. some more detailed timeline like:

T1: state of cache is XYZ
T2: controller computes the ES x1, x2
T3: state of cache in ABC
T4: controller computes ES y1 [why?]

liggitt · 2020-07-28T13:54:06Z

I don't think slowing down controller is needed to solve this problem

Agree, I'm not in favor of slowing down the controller globally to address this. Delaying resync of a particular service until we observe the existence of new endpointslices we know we created (with some max delay) could be reasonable.

robscott · 2020-07-28T16:48:35Z

Delaying resync of a particular service until we observe the existence of new endpointslices we know we created (with some max delay) could be reasonable.

That's exactly where I was hoping to end up here, although I'm not quite sure how to do that cleanly yet. To be clear, this is not just to fix this flaky test. This flaky test is a symptom of a much larger and more annoying issue with the current implementation. The EndpointSlice controller is consistently running into errors where it's trying to update EndpointSlices that are stale. This is caused by follow up calls to syncService running before the informer cache is up to date, which happens all the time during rolling updates where rapid churn of EndpointSlices is expected. Although this eventually resolves itself, it's causing a lot more load on API Server than necessary and can potentially result in some bad states where an update fails but a delete succeeds.

In this specific instance, it looks like a syncService run is occurring after the first 2 EndpointSlices have been created but before the one referencing pod1 is in the informer cache. That results in the controller creating a new EndpointSlice to reference pod1 and thus we end up with a duplicate. If anything in the future triggers another syncService call that would be resolved.

I'd initially thought that EndpointSliceTracker would be thing to trigger that follow up syncService, but I was wrong. That tracker just tracks the latest resource version for each EndpointSlice the controller has created/updated and compares that with EndpointSlices received through create/update events. If the controller doesn't think that EndpointSlice should exist or the controller's resource version for that slice does not match it triggers another syncService. In this case neither of those conditions would be met so no additional syncService call would be triggered.

So that's a long way of saying that I think our best possible solution here is to introduce some delay between when Services can be synced. I think the easiest solution that would solve 99% of our problems would be to add something like a 1s delay in between syncs simply by delaying when Done was called. This would not be ideal, but it would result in an immediate improvement and be easy to backport. It's worth noting that although duplicate EndpointSlices like this or even attempts to update a stale EndpointSlice are not ideal, they are not actually invalid and will eventually resolve themselves. Consumers of this API already have to do some form of deduping.

With that said, I'm hopeful that there's a clean way to implement a solution that actually waited for each EndpointSlice change the controller has made to be reflected in the cache because that would be a better solution. I'll spend some time looking into possible implementations here.

robscott · 2020-07-29T01:28:55Z

I've created a draft PR that contains my initial progress towards the approach I described above. I don't particularly love this approach so I'm very open to other ways we could solve this issue. It's also still very much a WIP and has significant work remaining. I opened the PR early just so I could get some initial feedback on the direction and potentially inspire other ideas around approaches we could take here. #93520

robscott · 2020-08-05T01:16:07Z

#93520 is still the best approach I can come up with here. It is ready for review if anyone has time to take a look.

robscott · 2020-08-12T00:41:54Z

Quick update here, I've limited the scope here to the following changes as discussed in #93520 (comment):

Update the flaky test to consider duplicate EndpointSlices valid. Updating EndpointSlice e2e tests to accept duplicate EndpointSlices #93907
Update the docs to make it very clear that consumers of the EndpointSlice API will have to do deduplication (as kube-proxy already does). Documenting EndpointSlice deduplication requirements website#23084
Update the controller error handling to return immediately if a create, update, or delete errors (in that order). Updating EndpointSlice controllers to return if error encountered #93908

Reviews would be appreciated on all of the above.

wojtek-t · 2020-08-12T10:18:20Z

Reopening since #93908 is not yet merged.

robscott · 2020-08-18T17:29:59Z

Thanks for catching that @wojtek-t! Closing now that #93908 is merged.

/close

k8s-ci-robot · 2020-08-18T17:30:13Z

@robscott: Closing this issue.

In response to this:

Thanks for catching that @wojtek-t! Closing now that #93908 is merged.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

liggitt added the kind/flake Categorizes issue or PR as related to a flaky test. label Jul 23, 2020

k8s-ci-robot assigned robscott Jul 23, 2020

k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Jul 23, 2020

liggitt added this to the v1.19 milestone Jul 23, 2020

k8s-ci-robot added the triage/unresolved Indicates an issue that can not or will not be resolved. label Jul 23, 2020

k8s-ci-robot removed the triage/unresolved Indicates an issue that can not or will not be resolved. label Jul 23, 2020

liggitt mentioned this issue Jul 23, 2020

Wait for resources owned by pod to be cleaned up in sig-storage tests #93332

Merged

robscott mentioned this issue Jul 23, 2020

Updating EndpointSlice e2e tests to be less flaky and easier to debug #93402

Merged

k8s-ci-robot closed this as completed in #93402 Jul 25, 2020

k8s-ci-robot reopened this Jul 27, 2020

k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Jul 27, 2020

robscott mentioned this issue Jul 28, 2020

Improving logging in EndpointSlice e2e tests #93491

Merged

robscott mentioned this issue Jul 28, 2020

Add global rate limit to endpoints/endpointslices controller #88967

Closed

robscott mentioned this issue Jul 29, 2020

Delaying EndpointSlice controller syncs until cache is updated #93520

Closed

robscott mentioned this issue Aug 11, 2020

Updating EndpointSlice e2e tests to accept duplicate EndpointSlices #93907

Merged

k8s-ci-robot closed this as completed in #93907 Aug 12, 2020

wojtek-t reopened this Aug 12, 2020

k8s-ci-robot closed this as completed Aug 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Flaky test] EndpointSlice should create Endpoints and EndpointSlices for Pods matching a Service #93374

[Flaky test] EndpointSlice should create Endpoints and EndpointSlices for Pods matching a Service #93374

liggitt commented Jul 23, 2020

liggitt commented Jul 23, 2020

athenabot commented Jul 23, 2020

BenTheElder commented Jul 23, 2020

liggitt commented Jul 23, 2020

robscott commented Jul 23, 2020

robscott commented Jul 23, 2020

liggitt commented Jul 23, 2020

robscott commented Jul 23, 2020

hasheddan commented Jul 27, 2020

k8s-ci-robot commented Jul 27, 2020

tpepper commented Jul 27, 2020

liggitt commented Jul 27, 2020

robscott commented Jul 27, 2020

liggitt commented Jul 27, 2020 •

edited

Loading

robscott commented Jul 27, 2020

robscott commented Jul 28, 2020

robscott commented Jul 28, 2020

wojtek-t commented Jul 28, 2020

liggitt commented Jul 28, 2020 •

edited

Loading

robscott commented Jul 28, 2020

robscott commented Jul 29, 2020

robscott commented Aug 5, 2020

robscott commented Aug 12, 2020

wojtek-t commented Aug 12, 2020

robscott commented Aug 18, 2020

k8s-ci-robot commented Aug 18, 2020

[Flaky test] EndpointSlice should create Endpoints and EndpointSlices for Pods matching a Service #93374

[Flaky test] EndpointSlice should create Endpoints and EndpointSlices for Pods matching a Service #93374

Comments

liggitt commented Jul 23, 2020

liggitt commented Jul 23, 2020

athenabot commented Jul 23, 2020

BenTheElder commented Jul 23, 2020

liggitt commented Jul 23, 2020

robscott commented Jul 23, 2020

robscott commented Jul 23, 2020

liggitt commented Jul 23, 2020

robscott commented Jul 23, 2020

hasheddan commented Jul 27, 2020

k8s-ci-robot commented Jul 27, 2020

tpepper commented Jul 27, 2020

liggitt commented Jul 27, 2020

robscott commented Jul 27, 2020

liggitt commented Jul 27, 2020 • edited Loading

robscott commented Jul 27, 2020

robscott commented Jul 28, 2020

robscott commented Jul 28, 2020

wojtek-t commented Jul 28, 2020

liggitt commented Jul 28, 2020 • edited Loading

robscott commented Jul 28, 2020

robscott commented Jul 29, 2020

robscott commented Aug 5, 2020

robscott commented Aug 12, 2020

wojtek-t commented Aug 12, 2020

robscott commented Aug 18, 2020

k8s-ci-robot commented Aug 18, 2020

liggitt commented Jul 27, 2020 •

edited

Loading

liggitt commented Jul 28, 2020 •

edited

Loading