-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delaying EndpointSlice controller syncs until cache is updated #93520
Conversation
Skipping CI for Draft Pull Request. |
So I think this is a bit closer now, I've updated the logic to have a max delay instead of a max number of retries and I've added some admittedly less than ideal logic that will ensure that a single failure doesn't doom a Service to slow syncs forever. I've been testing this with kind and e2e clusters to try to understand the behavior of an informer and it's been frustratingly inconsistent. Sometimes adds or updates are taking significantly longer than my 3 second max delay to show up. At this point that is purely anecdotal and I don't have sufficient data/evidence to back that up. Will work on trying to figure out exactly what is going on here, but would certainly appreciate a second set of eyes on this in case I'm missing something obvious or even just if this direction doesn't actually make sense. |
I talked a bit with @wojtek-t this morning and he suggested trying a larger master. I've moved this up to a There are some examples where this works exactly as expected: There are also other examples where the cache still takes >3s to be updated though: Although this code is still quite messy, this is still the best solution I can think of. Unless @liggitt or @wojtek-t have better ideas I'll try to get this cleaned up and add some tests so this can actually be ready for review. |
Added some additional test coverage, going to move this out of draft mode so I can get full e2e tests on this. There are some EndpointSlice controller tests added by @mborsz around Pod batching that this breaks, so at least those will fail. Those are heavily dependent on timing and this is breaking that. I think there's a way to get that those to work with this, just haven't figured it out yet. |
23d8208
to
8bb6b10
Compare
I think this PR is ready for an initial round of review. I ended up updating the pod batch timing tests to essentially bypass this functionality by setting a As much as I would appreciate review on this PR now, I know that I still need to add test coverage for With that said, my tests have shown that this does fix #93374 and the related bugs. I know this is not an ideal solution though, so if anyone has other ideas I'm very open to them. |
/priority critical-urgent |
Talked with @bowei this morning about this some more and he pointed out that this specific PR is more of an optimization than a correction. With that said, I think I'm OK with leaving this specific PR/optimization problem for later and making the following smaller fixes:
I'm hoping to have PRs for all of the above ready later today. |
/priority important-longterm |
/remove-priority critical-urgent |
/milestone v1.20 |
explicitly holding until we decide if we want to go with this approach |
@robscott: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rob - keep or close?
…On Wed, Dec 16, 2020 at 2:58 AM fejta-bot ***@***.***> wrote:
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually
close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta
<https://github.com/fejta>.
/lifecycle stale
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#93520 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABKWAVA6UDZZO3HYG4O6QFTSVCHGJANCNFSM4PLCZBSQ>
.
|
Thanks for the reminder on this one @thockin! This specific approach is likely not ideal, but it does seem like we'll need some optimizations here. At present, we're doing a lot of unnecessary work due to the controller operating with an out of date cache. Although certainly not elegant, I believe even this PR would represent an improvement over the status quo. I think @lavalamp's suggestion above to try to use/implement something resembling a write-through cache could be helpful. Fundamentally I don't think we want to move away from the informer cache as the source of truth here, so we just need to find a way reduce the odds that we'll run a sync with before that cache has caught up. Those odds are currently rather high during a rolling update, and I don't think it would take a very complex solution to provide a significant improvement here. /remove-lifecycle stale |
@robscott if the endpoint slice name is deterministic, will this be solved?
|
Unfortunately I don't think the name itself is sufficient. If I'm remembering the problem correctly, we're also seeing a large number of these warnings from the controller attempting to update a stale copy of an EndpointSlice. |
Created a new PR that takes a bit of a simpler approach here: #99345. Closing this one in favor of that. /close |
@robscott: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What type of PR is this?
/kind bug
/kind failing-test
What this PR does / why we need it:
This updates the EndpointSlice controller to wait until the cache is updated (or a max delay of 3 seconds is exceeded) before running a sync for a Service. This should help avoid race conditions that would result in duplicate EndpointSlices or failed attempts to update stale EndpointSlices.
Which issue(s) this PR fixes:
This will eventually fix #93374 and #92928.
Special notes for your reviewer:
This is very much still a WIP. I filed this early to get some initial feedback to at least see if I'm working in a reasonable direction. I do not love this approach but haven't thought of a better one yet, so very open to alternative ideas here.
I'd initially hoped that I could rely on the retry mechanism in the queue to simplify this implementation but that simply does not work since so many events can trigger a new sync. Currently I'm planning on updating the endpointSliceTracker to also track the last time the controller modified an EndpointSlice for each Service and then updating the logic to bypass the outdated cache check after a set number of seconds have passed since the last update. Other significant todos here include adding some test coverage and ensuring that a cache being outdated at one point in time does not forever doom a Service to slow sync times.
Does this PR introduce a user-facing change?:
/sig network
/cc @liggitt @freehan