Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delaying EndpointSlice controller syncs until cache is updated #93520

Closed
wants to merge 1 commit into from

Conversation

robscott
Copy link
Member

@robscott robscott commented Jul 29, 2020

What type of PR is this?
/kind bug
/kind failing-test

What this PR does / why we need it:
This updates the EndpointSlice controller to wait until the cache is updated (or a max delay of 3 seconds is exceeded) before running a sync for a Service. This should help avoid race conditions that would result in duplicate EndpointSlices or failed attempts to update stale EndpointSlices.

Which issue(s) this PR fixes:
This will eventually fix #93374 and #92928.

Special notes for your reviewer:
This is very much still a WIP. I filed this early to get some initial feedback to at least see if I'm working in a reasonable direction. I do not love this approach but haven't thought of a better one yet, so very open to alternative ideas here.

I'd initially hoped that I could rely on the retry mechanism in the queue to simplify this implementation but that simply does not work since so many events can trigger a new sync. Currently I'm planning on updating the endpointSliceTracker to also track the last time the controller modified an EndpointSlice for each Service and then updating the logic to bypass the outdated cache check after a set number of seconds have passed since the last update. Other significant todos here include adding some test coverage and ensuring that a cache being outdated at one point in time does not forever doom a Service to slow sync times.

Does this PR introduce a user-facing change?:

EndpointSlice controller syncs now wait for cache updates to limit caching issues.

/sig network
/cc @liggitt @freehan

@k8s-ci-robot k8s-ci-robot requested review from freehan and liggitt July 29, 2020 01:26
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Jul 29, 2020
@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. sig/network Categorizes an issue or PR as relevant to SIG Network. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Jul 29, 2020
@robscott
Copy link
Member Author

So I think this is a bit closer now, I've updated the logic to have a max delay instead of a max number of retries and I've added some admittedly less than ideal logic that will ensure that a single failure doesn't doom a Service to slow syncs forever. I've been testing this with kind and e2e clusters to try to understand the behavior of an informer and it's been frustratingly inconsistent. Sometimes adds or updates are taking significantly longer than my 3 second max delay to show up. At this point that is purely anecdotal and I don't have sufficient data/evidence to back that up. Will work on trying to figure out exactly what is going on here, but would certainly appreciate a second set of eyes on this in case I'm missing something obvious or even just if this direction doesn't actually make sense.

@robscott
Copy link
Member Author

robscott commented Jul 31, 2020

I talked a bit with @wojtek-t this morning and he suggested trying a larger master. I've moved this up to a n2-standard-4 and only scaling up a 100 pod deployment. Here's a copy of the kube-controller-manager logs from that (using the debug logging adding in this PR). The good news is that this approach generally seems to work. The bad news is that even with a larger master I'm still missing cache updates. The key lines in these logs are TRACKER UPDATE (controller has added/updated an EndpointSlice and the tracker has tracked that update) and CACHE UPDATED FOR SLICE (controller has received an EndpointSlice add/update event).

There are some examples where this works exactly as expected:
https://gist.github.com/robscott/df1832cd9a6d5a69adeccfa2830edf9a#file-kube-controller-manager-log-L427-L434
https://gist.github.com/robscott/df1832cd9a6d5a69adeccfa2830edf9a#file-kube-controller-manager-log-L488-L505

There are also other examples where the cache still takes >3s to be updated though:
~3.2s: https://gist.github.com/robscott/df1832cd9a6d5a69adeccfa2830edf9a#file-kube-controller-manager-log-L451-L489
~5s: https://gist.github.com/robscott/df1832cd9a6d5a69adeccfa2830edf9a#file-kube-controller-manager-log-L506-L526

Although this code is still quite messy, this is still the best solution I can think of. Unless @liggitt or @wojtek-t have better ideas I'll try to get this cleaned up and add some tests so this can actually be ready for review.

@robscott robscott marked this pull request as ready for review July 31, 2020 21:59
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 31, 2020
@robscott robscott marked this pull request as draft July 31, 2020 22:00
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 31, 2020
@robscott
Copy link
Member Author

robscott commented Aug 1, 2020

Added some additional test coverage, going to move this out of draft mode so I can get full e2e tests on this. There are some EndpointSlice controller tests added by @mborsz around Pod batching that this breaks, so at least those will fail. Those are heavily dependent on timing and this is breaking that. I think there's a way to get that those to work with this, just haven't figured it out yet.

@robscott robscott marked this pull request as ready for review August 1, 2020 01:18
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 1, 2020
@robscott robscott force-pushed the endpointslice-wait branch from 23d8208 to 8bb6b10 Compare August 5, 2020 01:04
@robscott
Copy link
Member Author

robscott commented Aug 5, 2020

I think this PR is ready for an initial round of review. I ended up updating the pod batch timing tests to essentially bypass this functionality by setting a maxCacheDelay of 0 seconds. Unfortunately given the nature of those tests being heavily dependent on timing and this introducing some inconsistency to the timing of when syncService() will be called, any non-zero value for maxCacheDelay could have introduced flakes here.

As much as I would appreciate review on this PR now, I know that I still need to add test coverage for MarkServiceCacheUpdated and ServiceCacheUpdatedSince. Additionally, if this new approach makes sense we should likely apply the same updates to the EndpointSliceMirroring controller. At that point my work towards moving this out to a separate package that can be referenced by both controllers becomes relevant. Adding a new package does seem like a bit of a stretch for a bug fix though.

With that said, my tests have shown that this does fix #93374 and the related bugs. I know this is not an ideal solution though, so if anyone has other ideas I'm very open to them.

/assign @liggitt @freehan @wojtek-t

@robscott
Copy link
Member Author

robscott commented Aug 5, 2020

/priority critical-urgent

@robscott
Copy link
Member Author

Talked with @bowei this morning about this some more and he pointed out that this specific PR is more of an optimization than a correction. With that said, I think I'm OK with leaving this specific PR/optimization problem for later and making the following smaller fixes:

  • Update the flaky test to consider duplicate EndpointSlices valid.
  • Update the docs to make it very clear that consumers of the EndpointSlice API will have to do deduplication (as kube-proxy already does).
  • Update the controller error handling to return immediately if a create, update, or delete errors (in that order).

I'm hoping to have PRs for all of the above ready later today.

@robscott
Copy link
Member Author

/priority important-longterm

@k8s-ci-robot k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Aug 11, 2020
@robscott
Copy link
Member Author

/remove-priority critical-urgent

@liggitt
Copy link
Member

liggitt commented Aug 13, 2020

/milestone v1.20

@k8s-ci-robot k8s-ci-robot modified the milestones: v1.19, v1.20 Aug 13, 2020
@liggitt liggitt removed this from the v1.20 milestone Aug 27, 2020
@liggitt liggitt added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 27, 2020
@liggitt
Copy link
Member

liggitt commented Aug 27, 2020

explicitly holding until we decide if we want to go with this approach

@k8s-ci-robot
Copy link
Contributor

@robscott: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 17, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 16, 2020
@thockin
Copy link
Member

thockin commented Dec 16, 2020 via email

@robscott
Copy link
Member Author

Thanks for the reminder on this one @thockin! This specific approach is likely not ideal, but it does seem like we'll need some optimizations here. At present, we're doing a lot of unnecessary work due to the controller operating with an out of date cache. Although certainly not elegant, I believe even this PR would represent an improvement over the status quo. I think @lavalamp's suggestion above to try to use/implement something resembling a write-through cache could be helpful.

Fundamentally I don't think we want to move away from the informer cache as the source of truth here, so we just need to find a way reduce the odds that we'll run a sync with before that cache has caught up. Those odds are currently rather high during a rolling update, and I don't think it would take a very complex solution to provide a significant improvement here.

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 18, 2020
@aojea
Copy link
Member

aojea commented Jan 11, 2021

A cheap hack is to at least make actions deterministic enough that they conflict

@robscott if the endpoint slice name is deterministic, will this be solved?

kubectl get endpointslices -o wide
NAME         ADDRESSTYPE   PORTS     ENDPOINTS    AGE
test-001   IPv4          <unset>   <unset>      19d
test-002   IPv4          <unset>   <unset>      19d

@robscott
Copy link
Member Author

Unfortunately I don't think the name itself is sufficient. If I'm remembering the problem correctly, we're also seeing a large number of these warnings from the controller attempting to update a stale copy of an EndpointSlice.

@robscott
Copy link
Member Author

Created a new PR that takes a bit of a simpler approach here: #99345. Closing this one in favor of that.

/close

@k8s-ci-robot
Copy link
Contributor

@robscott: Closed this PR.

In response to this:

Created a new PR that takes a bit of a simpler approach here: #99345. Closing this one in favor of that.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@robscott robscott deleted the endpointslice-wait branch March 11, 2021 04:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/network Categorizes an issue or PR as relevant to SIG Network. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Flaky test] EndpointSlice should create Endpoints and EndpointSlices for Pods matching a Service