Updating EndpointSlice controller to wait for cache to be updated #99345

robscott · 2021-02-23T06:30:37Z

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

This updates the EndpointSlice controller to make use of the EndpointSlice tracker to identify when expected changes are not present in the cache yet. If this is detected, the controller will wait to sync until all expected updates have been received. This should help avoid race conditions that would result in duplicate EndpointSlices or failed attempts to update stale EndpointSlices. To simplify this logic, this also moves the EndpointSlice tracker from relying on resource versions to generations.

Which issue(s) this PR fixes:

Fixes #92928

Special notes for your reviewer:

This is a follow up to #93520 that shares some similar concepts but is simpler thanks to some tips from @lavalamp. The EndpointSlice mirroring controller will need an identical change. The EndpointSlice trackers in each controller should really be merged, but I didn't want to bundle that into this change for the sake of simplicity and potentially ease of backporting.

Does this PR introduce a user-facing change?

EndpointSlice controller is now less likely to emit FailedToUpdateEndpointSlices events.

/sig network
/cc @bowei @swetharepakula @aojea
/assign @lavalamp

k8s-ci-robot · 2021-02-23T06:30:39Z

@robscott: GitHub didn't allow me to request PR reviews from the following users: swetharepakula.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

This updates the EndpointSlice controller to make use of the EndpointSlice tracker to identify when expected changes are not present in the cache yet. If this is detected, the controller will wait to sync until all expected updates have been received. This should help avoid race conditions that would result in duplicate EndpointSlices or failed attempts to update stale EndpointSlices. To simplify this logic, this also moves the EndpointSlice tracker from relying on resource versions to generations.

Which issue(s) this PR fixes:

Fixes #92928

Special notes for your reviewer:

This is a follow up to #93520 that shares some similar concepts but is simpler thanks to some tips from @lavalamp.

Does this PR introduce a user-facing change?
EndpointSlice controller is now less likely to emit FailedToUpdateEndpointSlices events.
/sig network
/cc @bowei @swetharepakula @aojea
/assign @lavalamp

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2021-02-23T06:31:12Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: robscott

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/controller/endpointslice/OWNERS~~ [robscott]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

robscott · 2021-02-23T06:32:05Z

/priority important-soon
/triage accepted

robscott · 2021-02-23T08:12:52Z

/retest

wojtek-t

Just looked into non-test code, but I really like this approach.

pkg/controller/endpointslice/endpointslice_tracker.go

aojea · 2021-02-23T14:45:13Z

pkg/controller/endpointslice/endpointslice_controller.go

@@ -280,6 +280,12 @@ func (c *Controller) handleErr(err error, key interface{}) {
 		return
 	}

+	if isStaleInformerCacheErr(err) {


I'm very ignorant in the workqueue area, but should we check the NumRequeues too?

Good question - I added a comment here to clarify that this was intentional. I think we'd want to wait indefinitely for the informer cache to update but that may be naive. @lavalamp do you have any tips here?

If the informer cache can't update you'll have bigger problems. That kind of problem should be caught by monitoring the metrics.

@lavalamp What kind of metric could I expose here? If I simply increment a counter every time we run into this error I think we'd end up with a lot of noise. Maybe a metric that tracks every time we go past max retries and give up for a Service?

Update - I've removed this extra check and this error will not get any kind of special treatment.

If the only possible cause is the informer not having received an update, that will be reflected in existing metrics (it will be missing all updates in that case). If there's any chance of another kind of error, my first thought is to make two gauge metrics, one that counts the number of items in this condition, and one that adds up how many seconds those items have been in the condition. That way you can tell the difference between 1 persistently bad item and having a different item in this condition every time (which is not a problem).

I think right now the other potential cause of an issue here would be someone manually setting an EndpointSlice generation lower than a previous value. As long as we can seal that up with a separate PR to prevent that in the strategy logic, I don't think we'll need much in the way of extra metrics.

For the two gauge metrics would you start a separate goroutine that did these calculations at some kind of fixed interval for all Services?

I've been thinking about adding a new metric that would track the number of syncs the controller has performed along with the outcome as a label with 3 potential values: success, staleCache, and error. That would not be as helpful in identifying outliers but would provide some broad insight into the health of the controller.

Since I'm interested in attempting to backport this, I think adding a metric directly to this PR would complicate that, so maybe this can be done as a follow up.

pkg/controller/endpointslice/endpointslice_tracker.go

pkg/controller/endpointslice/endpointslice_controller.go

robscott · 2021-03-02T17:45:14Z

After a bit more conversation with @wojtek-t I've moved away from RWMutex entirely in favor of Mutex since lock contention is unlikely to be a significant issue here.

wojtek-t · 2021-03-02T20:04:16Z

After a bit more conversation with @wojtek-t I've moved away from RWMutex entirely in favor of Mutex since lock contention is unlikely to be a significant issue here.

Just to explain for posterity - RWMutex is basically more expensive (from cpu cost POV) so unless it provides visibly better experience (e.g. reduced e2e latency), it's not worth using it. And given that we generally have QPS limits in endpointslice controller which is producing them (and the fact that critical sections are super small) it doesn't make much sense.

wojtek-t · 2021-03-02T20:07:17Z

/lgtm

Thanks!

k8s-ci-robot · 2021-03-02T21:18:33Z

@robscott: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
pull-kubernetes-integration	`e154260`	link	`/test pull-kubernetes-integration`
pull-kubernetes-e2e-gce-ubuntu-containerd	`e154260`	link	`/test pull-kubernetes-e2e-gce-ubuntu-containerd`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

aojea · 2021-03-02T21:22:59Z

/retest
some of them are known flakes but this test failures worries me

Kubernetes e2e suite: [sig-network] Networking Granular Checks: Services should function for service endpoints using hostNetwork expand_less

BenTheElder · 2021-03-02T22:06:38Z

this passed in a batch: https://prow.k8s.io/?type=batch #99345, #99631
https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/batch/pull-kubernetes-integration/1366845179432013824

…dated This matches the recent updates to the EndpointSliceTracker for the EndpointSlice controller in kubernetes#99345 that accomplished the same thing.

…345-release-1.18 Automated cherry pick of #99345: Updating EndpointSlice controller to wait for cache to be

…345-release-1.20 Automated cherry pick of #99345: Updating EndpointSlice controller to wait for cache to be

…345-release-1.19 Automated cherry pick of #99345: Updating EndpointSlice controller to wait for cache to be

k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Feb 23, 2021

k8s-ci-robot requested a review from bowei February 23, 2021 06:30

k8s-ci-robot added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Feb 23, 2021

k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 23, 2021

k8s-ci-robot requested a review from aojea February 23, 2021 06:30

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Feb 23, 2021

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Feb 23, 2021

robscott mentioned this pull request Feb 23, 2021

Delaying EndpointSlice controller syncs until cache is updated #93520

Closed

robscott force-pushed the endpointslice-wait-for-cache branch from 3129cb8 to d9a1834 Compare February 23, 2021 07:20

wojtek-t reviewed Feb 23, 2021

View reviewed changes

pkg/controller/endpointslice/endpointslice_tracker.go Show resolved Hide resolved

pkg/controller/endpointslice/endpointslice_tracker.go Outdated Show resolved Hide resolved

aojea reviewed Feb 23, 2021

View reviewed changes

pkg/controller/endpointslice/endpointslice_tracker.go Outdated Show resolved Hide resolved

robscott force-pushed the endpointslice-wait-for-cache branch 2 times, most recently from 8c5a4b6 to 5f4c4c0 Compare February 23, 2021 18:38

aojea reviewed Feb 23, 2021

View reviewed changes

pkg/controller/endpointslice/endpointslice_controller.go Outdated Show resolved Hide resolved

robscott force-pushed the endpointslice-wait-for-cache branch from 5f4c4c0 to e8fb4b4 Compare February 23, 2021 20:26

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 23, 2021

k8s-ci-robot assigned wojtek-t Mar 2, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 2, 2021

k8s-ci-robot merged commit ee90db5 into kubernetes:master Mar 2, 2021

k8s-ci-robot added this to the v1.21 milestone Mar 2, 2021

robscott mentioned this pull request Mar 4, 2021

Incrementing EndpointSlice generation when labels change #99750

Merged

robscott mentioned this pull request Mar 4, 2021

Updating EndpointSliceMirroring controller to wait for cache to be updated #99756

Merged

robscott mentioned this pull request Mar 11, 2021

Automated cherry pick of #99345: Updating EndpointSlice controller to wait for cache to be #100146

Merged

k8s-ci-robot added a commit that referenced this pull request Mar 12, 2021

Merge pull request #100146 from robscott/automated-cherry-pick-of-#99…

0efe1d1

…345-release-1.18 Automated cherry pick of #99345: Updating EndpointSlice controller to wait for cache to be

k8s-ci-robot added a commit that referenced this pull request Mar 12, 2021

Merge pull request #100113 from robscott/automated-cherry-pick-of-#99…

6a21ad7

…345-release-1.20 Automated cherry pick of #99345: Updating EndpointSlice controller to wait for cache to be

k8s-ci-robot added a commit that referenced this pull request Mar 12, 2021

Merge pull request #100114 from robscott/automated-cherry-pick-of-#99…

ef6e345

…345-release-1.19 Automated cherry pick of #99345: Updating EndpointSlice controller to wait for cache to be

enj mentioned this pull request Mar 29, 2021

Weird behavior on resumed Kind cluster after sleep vmware-tanzu/pinniped#526

Closed

aojea mentioned this pull request Sep 22, 2021

[epcontroller] Retry on conflict #105195

Closed

aojea mentioned this pull request Jun 16, 2024

Endpoint / Endpointslice objects become stuck in state where IP condition differs (not ready vs ready) #117193

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating EndpointSlice controller to wait for cache to be updated #99345

Updating EndpointSlice controller to wait for cache to be updated #99345

robscott commented Feb 23, 2021 •

edited

Loading

k8s-ci-robot commented Feb 23, 2021

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Feb 23, 2021

robscott commented Feb 23, 2021

robscott commented Feb 23, 2021

wojtek-t left a comment

aojea Feb 23, 2021 •

edited

Loading

robscott Feb 23, 2021

lavalamp Feb 25, 2021

robscott Feb 25, 2021

robscott Feb 25, 2021

lavalamp Feb 25, 2021

robscott Feb 26, 2021

robscott commented Mar 2, 2021

wojtek-t commented Mar 2, 2021

wojtek-t commented Mar 2, 2021

k8s-ci-robot commented Mar 2, 2021

aojea commented Mar 2, 2021

BenTheElder commented Mar 2, 2021

Updating EndpointSlice controller to wait for cache to be updated #99345

Updating EndpointSlice controller to wait for cache to be updated #99345

Conversation

robscott commented Feb 23, 2021 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Feb 23, 2021

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Feb 23, 2021

robscott commented Feb 23, 2021

robscott commented Feb 23, 2021

wojtek-t left a comment

Choose a reason for hiding this comment

aojea Feb 23, 2021 • edited Loading

Choose a reason for hiding this comment

robscott Feb 23, 2021

Choose a reason for hiding this comment

lavalamp Feb 25, 2021

Choose a reason for hiding this comment

robscott Feb 25, 2021

Choose a reason for hiding this comment

robscott Feb 25, 2021

Choose a reason for hiding this comment

lavalamp Feb 25, 2021

Choose a reason for hiding this comment

robscott Feb 26, 2021

Choose a reason for hiding this comment

robscott commented Mar 2, 2021

wojtek-t commented Mar 2, 2021

wojtek-t commented Mar 2, 2021

k8s-ci-robot commented Mar 2, 2021

aojea commented Mar 2, 2021

BenTheElder commented Mar 2, 2021

robscott commented Feb 23, 2021 •

edited

Loading

aojea Feb 23, 2021 •

edited

Loading