Delaying EndpointSlice controller syncs until cache is updated #93520

robscott · 2020-07-29T01:26:29Z

What type of PR is this?
/kind bug
/kind failing-test

What this PR does / why we need it:
This updates the EndpointSlice controller to wait until the cache is updated (or a max delay of 3 seconds is exceeded) before running a sync for a Service. This should help avoid race conditions that would result in duplicate EndpointSlices or failed attempts to update stale EndpointSlices.

Which issue(s) this PR fixes:
This will eventually fix #93374 and #92928.

Special notes for your reviewer:
This is very much still a WIP. I filed this early to get some initial feedback to at least see if I'm working in a reasonable direction. I do not love this approach but haven't thought of a better one yet, so very open to alternative ideas here.

I'd initially hoped that I could rely on the retry mechanism in the queue to simplify this implementation but that simply does not work since so many events can trigger a new sync. Currently I'm planning on updating the endpointSliceTracker to also track the last time the controller modified an EndpointSlice for each Service and then updating the logic to bypass the outdated cache check after a set number of seconds have passed since the last update. Other significant todos here include adding some test coverage and ensuring that a cache being outdated at one point in time does not forever doom a Service to slow sync times.

Does this PR introduce a user-facing change?:

EndpointSlice controller syncs now wait for cache updates to limit caching issues.

/sig network
/cc @liggitt @freehan

k8s-ci-robot · 2020-07-29T01:26:31Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

robscott · 2020-07-30T05:39:45Z

So I think this is a bit closer now, I've updated the logic to have a max delay instead of a max number of retries and I've added some admittedly less than ideal logic that will ensure that a single failure doesn't doom a Service to slow syncs forever. I've been testing this with kind and e2e clusters to try to understand the behavior of an informer and it's been frustratingly inconsistent. Sometimes adds or updates are taking significantly longer than my 3 second max delay to show up. At this point that is purely anecdotal and I don't have sufficient data/evidence to back that up. Will work on trying to figure out exactly what is going on here, but would certainly appreciate a second set of eyes on this in case I'm missing something obvious or even just if this direction doesn't actually make sense.

robscott · 2020-07-31T00:35:03Z

I talked a bit with @wojtek-t this morning and he suggested trying a larger master. I've moved this up to a n2-standard-4 and only scaling up a 100 pod deployment. Here's a copy of the kube-controller-manager logs from that (using the debug logging adding in this PR). The good news is that this approach generally seems to work. The bad news is that even with a larger master I'm still missing cache updates. The key lines in these logs are TRACKER UPDATE (controller has added/updated an EndpointSlice and the tracker has tracked that update) and CACHE UPDATED FOR SLICE (controller has received an EndpointSlice add/update event).

There are some examples where this works exactly as expected:
https://gist.github.com/robscott/df1832cd9a6d5a69adeccfa2830edf9a#file-kube-controller-manager-log-L427-L434
https://gist.github.com/robscott/df1832cd9a6d5a69adeccfa2830edf9a#file-kube-controller-manager-log-L488-L505

There are also other examples where the cache still takes >3s to be updated though:
~3.2s: https://gist.github.com/robscott/df1832cd9a6d5a69adeccfa2830edf9a#file-kube-controller-manager-log-L451-L489
~5s: https://gist.github.com/robscott/df1832cd9a6d5a69adeccfa2830edf9a#file-kube-controller-manager-log-L506-L526

Although this code is still quite messy, this is still the best solution I can think of. Unless @liggitt or @wojtek-t have better ideas I'll try to get this cleaned up and add some tests so this can actually be ready for review.

robscott · 2020-08-01T01:17:55Z

Added some additional test coverage, going to move this out of draft mode so I can get full e2e tests on this. There are some EndpointSlice controller tests added by @mborsz around Pod batching that this breaks, so at least those will fail. Those are heavily dependent on timing and this is breaking that. I think there's a way to get that those to work with this, just haven't figured it out yet.

robscott · 2020-08-05T01:14:30Z

I think this PR is ready for an initial round of review. I ended up updating the pod batch timing tests to essentially bypass this functionality by setting a maxCacheDelay of 0 seconds. Unfortunately given the nature of those tests being heavily dependent on timing and this introducing some inconsistency to the timing of when syncService() will be called, any non-zero value for maxCacheDelay could have introduced flakes here.

As much as I would appreciate review on this PR now, I know that I still need to add test coverage for MarkServiceCacheUpdated and ServiceCacheUpdatedSince. Additionally, if this new approach makes sense we should likely apply the same updates to the EndpointSliceMirroring controller. At that point my work towards moving this out to a separate package that can be referenced by both controllers becomes relevant. Adding a new package does seem like a bit of a stretch for a bug fix though.

With that said, my tests have shown that this does fix #93374 and the related bugs. I know this is not an ideal solution though, so if anyone has other ideas I'm very open to them.

/assign @liggitt @freehan @wojtek-t

robscott · 2020-08-05T01:15:01Z

/priority critical-urgent

robscott · 2020-08-11T19:18:32Z

Talked with @bowei this morning about this some more and he pointed out that this specific PR is more of an optimization than a correction. With that said, I think I'm OK with leaving this specific PR/optimization problem for later and making the following smaller fixes:

Update the flaky test to consider duplicate EndpointSlices valid.
Update the docs to make it very clear that consumers of the EndpointSlice API will have to do deduplication (as kube-proxy already does).
Update the controller error handling to return immediately if a create, update, or delete errors (in that order).

I'm hoping to have PRs for all of the above ready later today.

robscott · 2020-08-11T19:18:55Z

/priority important-longterm

robscott · 2020-08-11T19:19:15Z

/remove-priority critical-urgent

liggitt · 2020-08-13T13:20:45Z

/milestone v1.20

liggitt · 2020-08-27T15:02:37Z

explicitly holding until we decide if we want to go with this approach

k8s-ci-robot · 2020-09-17T10:46:15Z

@robscott: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

fejta-bot · 2020-12-16T10:58:27Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

thockin · 2020-12-16T16:30:31Z

Rob - keep or close?

…

On Wed, Dec 16, 2020 at 2:58 AM fejta-bot ***@***.***> wrote: Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Send feedback to sig-testing, kubernetes/test-infra and/or fejta <https://github.com/fejta>. /lifecycle stale — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#93520 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABKWAVA6UDZZO3HYG4O6QFTSVCHGJANCNFSM4PLCZBSQ> .

robscott · 2020-12-18T01:36:26Z

Thanks for the reminder on this one @thockin! This specific approach is likely not ideal, but it does seem like we'll need some optimizations here. At present, we're doing a lot of unnecessary work due to the controller operating with an out of date cache. Although certainly not elegant, I believe even this PR would represent an improvement over the status quo. I think @lavalamp's suggestion above to try to use/implement something resembling a write-through cache could be helpful.

Fundamentally I don't think we want to move away from the informer cache as the source of truth here, so we just need to find a way reduce the odds that we'll run a sync with before that cache has caught up. Those odds are currently rather high during a rolling update, and I don't think it would take a very complex solution to provide a significant improvement here.

/remove-lifecycle stale

aojea · 2021-01-11T16:01:50Z

A cheap hack is to at least make actions deterministic enough that they conflict

@robscott if the endpoint slice name is deterministic, will this be solved?

kubectl get endpointslices -o wide
NAME         ADDRESSTYPE   PORTS     ENDPOINTS    AGE
test-001   IPv4          <unset>   <unset>      19d
test-002   IPv4          <unset>   <unset>      19d

robscott · 2021-01-11T22:07:38Z

Unfortunately I don't think the name itself is sufficient. If I'm remembering the problem correctly, we're also seeing a large number of these warnings from the controller attempting to update a stale copy of an EndpointSlice.

robscott · 2021-02-23T06:35:11Z

Created a new PR that takes a bit of a simpler approach here: #99345. Closing this one in favor of that.

/close

k8s-ci-robot · 2021-02-23T06:35:21Z

@robscott: Closed this PR.

In response to this:

Created a new PR that takes a bit of a simpler approach here: #99345. Closing this one in favor of that.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot requested review from freehan and liggitt July 29, 2020 01:26

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Jul 29, 2020

robscott mentioned this pull request Jul 29, 2020

[Flaky test] EndpointSlice should create Endpoints and EndpointSlices for Pods matching a Service #93374

Closed

robscott marked this pull request as ready for review July 31, 2020 21:59

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 31, 2020

robscott marked this pull request as draft July 31, 2020 22:00

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 31, 2020

robscott marked this pull request as ready for review August 1, 2020 01:18

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 1, 2020

robscott force-pushed the endpointslice-wait branch from 23d8208 to 8bb6b10 Compare August 5, 2020 01:04

k8s-ci-robot assigned freehan, liggitt and wojtek-t Aug 5, 2020

k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Aug 11, 2020

k8s-ci-robot removed the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Aug 11, 2020

This was referenced Aug 11, 2020

Updating EndpointSlice e2e tests to accept duplicate EndpointSlices #93907

Merged

Updating EndpointSlice controllers to return if error encountered #93908

Merged

Documenting EndpointSlice deduplication requirements kubernetes/website#23084

Merged

k8s-ci-robot modified the milestones: v1.19, v1.20 Aug 13, 2020

liggitt removed this from the v1.20 milestone Aug 27, 2020

liggitt added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 27, 2020

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 17, 2020

aojea mentioned this pull request Dec 9, 2020

Endpoint Slices (and Dual Stack Services) ovn-kubernetes/ovn-kubernetes#1857

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 16, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 18, 2020

liggitt removed their assignment Feb 18, 2021

robscott mentioned this pull request Feb 23, 2021

Updating EndpointSlice controller to wait for cache to be updated #99345

Merged

k8s-ci-robot closed this Feb 23, 2021

robscott deleted the endpointslice-wait branch March 11, 2021 04:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delaying EndpointSlice controller syncs until cache is updated #93520

Delaying EndpointSlice controller syncs until cache is updated #93520

robscott commented Jul 29, 2020 •

edited

Loading

k8s-ci-robot commented Jul 29, 2020

robscott commented Jul 30, 2020

robscott commented Jul 31, 2020 •

edited

Loading

robscott commented Aug 1, 2020

robscott commented Aug 5, 2020

robscott commented Aug 5, 2020

robscott commented Aug 11, 2020

robscott commented Aug 11, 2020

robscott commented Aug 11, 2020

liggitt commented Aug 13, 2020

liggitt commented Aug 27, 2020

k8s-ci-robot commented Sep 17, 2020

fejta-bot commented Dec 16, 2020

thockin commented Dec 16, 2020 via email

robscott commented Dec 18, 2020

aojea commented Jan 11, 2021

robscott commented Jan 11, 2021

robscott commented Feb 23, 2021

k8s-ci-robot commented Feb 23, 2021

Delaying EndpointSlice controller syncs until cache is updated #93520

Delaying EndpointSlice controller syncs until cache is updated #93520

Conversation

robscott commented Jul 29, 2020 • edited Loading

k8s-ci-robot commented Jul 29, 2020

robscott commented Jul 30, 2020

robscott commented Jul 31, 2020 • edited Loading

robscott commented Aug 1, 2020

robscott commented Aug 5, 2020

robscott commented Aug 5, 2020

robscott commented Aug 11, 2020

robscott commented Aug 11, 2020

robscott commented Aug 11, 2020

liggitt commented Aug 13, 2020

liggitt commented Aug 27, 2020

k8s-ci-robot commented Sep 17, 2020

fejta-bot commented Dec 16, 2020

thockin commented Dec 16, 2020 via email

robscott commented Dec 18, 2020

aojea commented Jan 11, 2021

robscott commented Jan 11, 2021

robscott commented Feb 23, 2021

k8s-ci-robot commented Feb 23, 2021

robscott commented Jul 29, 2020 •

edited

Loading

robscott commented Jul 31, 2020 •

edited

Loading