Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaky test] EndpointSlice should create Endpoints and EndpointSlices for Pods matching a Service #93374

Closed
liggitt opened this issue Jul 23, 2020 · 26 comments · Fixed by #93402 or #93907
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/network Categorizes an issue or PR as relevant to SIG Network.
Milestone

Comments

@liggitt
Copy link
Member

liggitt commented Jul 23, 2020

Which jobs are flaking:

  • ci-kubernetes-kind-e2e-parallel
  • pr:pull-kubernetes-e2e-gce
  • pr:pull-kubernetes-e2e-gce-ubuntu-containerd
  • pr:pull-kubernetes-e2e-gce-ubuntu-containerd
  • ci-kubernetes-kind-ipv6-e2e-parallel

Which test(s) are flaking:

Second flakiest test in the e2e job according to http://storage.googleapis.com/k8s-metrics/flakes-latest.json

Testgrid link:

https://testgrid.k8s.io/presubmits-kubernetes-blocking#pull-kubernetes-e2e-gce&include-filter-by-regex=EndpointSlice%20should%20create%20Endpoints%20and%20EndpointSlices%20for%20Pods%20matching%20a%20Service&width=5

Reason for failure:

test/e2e/network/endpointslice.go:70
Jul 22 22:13:01.180: EndpointSlice resource not deleted after Service endpointslice-4585/example-empty-selector was deleted: timed out waiting for the condition
test/e2e/network/endpointslice.go:158

Anything else we need to know:

https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&test=EndpointSlice%20should%20create%20Endpoints%20and%20EndpointSlices%20for%20Pods%20matching%20a%20Service

/sig network
/assign @robscott

@liggitt liggitt added the kind/flake Categorizes issue or PR as related to a flaky test. label Jul 23, 2020
@k8s-ci-robot k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Jul 23, 2020
@liggitt liggitt added this to the v1.19 milestone Jul 23, 2020
@liggitt
Copy link
Member Author

liggitt commented Jul 23, 2020

needs investigation since we're switching kube-proxy to endpointslice in this release

@athenabot
Copy link

/triage unresolved

Comment /remove-triage unresolved when the issue is assessed and confirmed.

🤖 I am a bot run by vllry. 👩‍🔬

@k8s-ci-robot k8s-ci-robot added the triage/unresolved Indicates an issue that can not or will not be resolved. label Jul 23, 2020
@BenTheElder
Copy link
Member

made me think of #91399, but that's in already.

@liggitt
Copy link
Member Author

liggitt commented Jul 23, 2020

It's possible it's a test timeout that's just too close, but we need verification of that checking the controller and apiserver logs to make sure there isn't a latent bug

@robscott
Copy link
Member

Thanks for catching this! It looks like the timeouts are actually associated with waiting for 2 pods to be ready, not actually anything related to EndpointSlice functionality - timeout is here: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/network/endpointslice.go#L261-L285. Given that, it seems like the obvious solution would be to update the Services to set publishNotReadyAddresses to true and update this block to wait for the Pods to have IPs, not for them to be ready. If that sounds reasonable, I'll work on getting a fix in tomorrow.

@robscott
Copy link
Member

/remove-triage unresolved

@k8s-ci-robot k8s-ci-robot removed the triage/unresolved Indicates an issue that can not or will not be resolved. label Jul 23, 2020
@liggitt
Copy link
Member Author

liggitt commented Jul 23, 2020

Given that, it seems like the obvious solution would be to update the Services to set publishNotReadyAddresses to true and update this block to wait for the Pods to have IPs, not for them to be ready.

If that is still exercises the same conditions (I vaguely remember publishNotReadyAddresses changing code paths in the endpoints controller... does it do the same in the endpointslice controller?), that seems reasonable to me.

@robscott
Copy link
Member

I think the code path is very similar here, the only real difference should be how the EndpointSlice controller calculates readiness. I've gone ahead and filed a PR that uses this approach, hopefully it helps: #93402

@hasheddan
Copy link
Contributor

This flaked again this morning

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Jul 27, 2020
@k8s-ci-robot
Copy link
Contributor

@hasheddan: Reopened this issue.

In response to this:

This flaked again this morning

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tpepper
Copy link
Member

tpepper commented Jul 27, 2020

This should be
/priority critical-urgent
given #93374 (comment)

@k8s-ci-robot k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Jul 27, 2020
@liggitt
Copy link
Member Author

liggitt commented Jul 27, 2020

https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce/1286795958520123392

test/e2e/network/endpointslice.go:162
Jul 24 23:26:39.937: Timed out waiting for matching EndpointSlices to exist: timed out waiting for the condition
test/e2e/network/endpointslice.go:321

@robscott
Copy link
Member

The top level error message does not make this clear but the issue here is that too many EndpointSlices are being created - 3 instead of 2.

@liggitt
Copy link
Member Author

liggitt commented Jul 27, 2020

The top level error message does not make this clear but the issue here is that too many EndpointSlices are being created - 3 instead of 2.

Interesting... that condition should not persist for O(minutes), right? that seems like it might be a bug, not just a test issue

@robscott
Copy link
Member

There's potential that it's not consistently merging 2 ports with 1 endpoint into a single slice and instead creating a slice per port. It also could represent a larger bug. Trying to find a way to consistently reproduce this.

@robscott
Copy link
Member

Despite my best efforts I have been unable to reproduce this on my own e2e cluster. I've had 10 test runs going at the same time repeatedly without any luck, any ideas for how to reproduce flaky test failures local would be welcome. In the meantime, I've created a follow up PR that will provide more helpful information if/when this test fails again: #93491.

My leading theory right now is that this could be a caching issue. The EndpointSlice controller runs a sync, sees that we don't have an EndpointSlice for the given port combination in cache and then goes ahead and creates one, even though a matching EndpointSlice already exists and has not been cached yet. That would result in an annoying situation where an essentially duplicate EndpointSlice was created due to a stale cache. I would think the EndpointSliceTracker would resolve this issue for us though, so not sure. Getting the above PR in will help validate this theory.

It's worth noting that if this is actually a duplicate EndpointSlice, clients like kube-proxy will handle this seamlessly because they already have to handle potential duplication in EndpointSlices. That would potentially explain why a test failure like this does not seem to be causing other things to break.

@robscott
Copy link
Member

It appears that my theory was correct as witnessed here: https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/93491/pull-kubernetes-e2e-gce/1287970453826048000/. I create a gist with what the 3 EndpointSlices look like. The key points I've noticed so far:

  • 1 EndpointSlice references pod2 (example-named-port-f7vq5)
  • 2 EndpointSlices reference pod1 (example-named-port-fkgcc and example-named-port-vfp2q)
  • 2 EndpointSlices were created at "2020-07-28T04:57:51Z":
    • example-named-port-f7vq5 (references pod2)
    • example-named-port-vfp2q (references pod1)
  • 1 EndpointSlice was created 1 second later ("2020-07-28T04:57:52Z")
    • example-named-port-fkgcc (references pod1)
  • All 3 EndpointSlices are:
    • On generation 1 - no updates have occurred
    • With a state of ready: true
  • The 2 EndpointSlices referencing pod1 are effectively identical

Of course this is something the EndpointSliceTracker should be handling for us, but even that approach seems less than ideal here. It wouldn't actually prevent the duplicate EndpointSlices from being created, it should just be cleaning up duplicates after they happen. This combined with the race conditions referenced in #92928 make me think we may want to find a way to slow down the sync cycle for each Service here. This also feels at least tangentially related to the need for better rate limiting in both this and the Endpoints controller discussed in #88967.

/cc @wojtek-t @mborsz

@wojtek-t
Copy link
Member

Hmm - actually I don't understand why this is happening. Maybe I'm missing something, but I don't think slowing down controller is needed to solve this problem. Even though the controller usually is not lagging my more than 1s, it is possible that it can be lagging by a minute or more in pathological cases. So we can't just rely on the fact that it's guaranteed to be synced.

Can you explain it a bit deeper why this is happening? i.e. some more detailed timeline like:

  • T1: state of cache is XYZ
  • T2: controller computes the ES x1, x2
  • T3: state of cache in ABC
  • T4: controller computes ES y1 [why?]

@liggitt
Copy link
Member Author

liggitt commented Jul 28, 2020

I don't think slowing down controller is needed to solve this problem

Agree, I'm not in favor of slowing down the controller globally to address this. Delaying resync of a particular service until we observe the existence of new endpointslices we know we created (with some max delay) could be reasonable.

@robscott
Copy link
Member

Delaying resync of a particular service until we observe the existence of new endpointslices we know we created (with some max delay) could be reasonable.

That's exactly where I was hoping to end up here, although I'm not quite sure how to do that cleanly yet. To be clear, this is not just to fix this flaky test. This flaky test is a symptom of a much larger and more annoying issue with the current implementation. The EndpointSlice controller is consistently running into errors where it's trying to update EndpointSlices that are stale. This is caused by follow up calls to syncService running before the informer cache is up to date, which happens all the time during rolling updates where rapid churn of EndpointSlices is expected. Although this eventually resolves itself, it's causing a lot more load on API Server than necessary and can potentially result in some bad states where an update fails but a delete succeeds.

In this specific instance, it looks like a syncService run is occurring after the first 2 EndpointSlices have been created but before the one referencing pod1 is in the informer cache. That results in the controller creating a new EndpointSlice to reference pod1 and thus we end up with a duplicate. If anything in the future triggers another syncService call that would be resolved.

I'd initially thought that EndpointSliceTracker would be thing to trigger that follow up syncService, but I was wrong. That tracker just tracks the latest resource version for each EndpointSlice the controller has created/updated and compares that with EndpointSlices received through create/update events. If the controller doesn't think that EndpointSlice should exist or the controller's resource version for that slice does not match it triggers another syncService. In this case neither of those conditions would be met so no additional syncService call would be triggered.

So that's a long way of saying that I think our best possible solution here is to introduce some delay between when Services can be synced. I think the easiest solution that would solve 99% of our problems would be to add something like a 1s delay in between syncs simply by delaying when Done was called. This would not be ideal, but it would result in an immediate improvement and be easy to backport. It's worth noting that although duplicate EndpointSlices like this or even attempts to update a stale EndpointSlice are not ideal, they are not actually invalid and will eventually resolve themselves. Consumers of this API already have to do some form of deduping.

With that said, I'm hopeful that there's a clean way to implement a solution that actually waited for each EndpointSlice change the controller has made to be reflected in the cache because that would be a better solution. I'll spend some time looking into possible implementations here.

@robscott
Copy link
Member

I've created a draft PR that contains my initial progress towards the approach I described above. I don't particularly love this approach so I'm very open to other ways we could solve this issue. It's also still very much a WIP and has significant work remaining. I opened the PR early just so I could get some initial feedback on the direction and potentially inspire other ideas around approaches we could take here. #93520

@robscott
Copy link
Member

robscott commented Aug 5, 2020

#93520 is still the best approach I can come up with here. It is ready for review if anyone has time to take a look.

@robscott
Copy link
Member

Quick update here, I've limited the scope here to the following changes as discussed in #93520 (comment):

Reviews would be appreciated on all of the above.

@wojtek-t
Copy link
Member

Reopening since #93908 is not yet merged.

@wojtek-t wojtek-t reopened this Aug 12, 2020
@robscott
Copy link
Member

Thanks for catching that @wojtek-t! Closing now that #93908 is merged.

/close

@k8s-ci-robot
Copy link
Contributor

@robscott: Closing this issue.

In response to this:

Thanks for catching that @wojtek-t! Closing now that #93908 is merged.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/network Categorizes an issue or PR as relevant to SIG Network.
Projects
None yet
8 participants