-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updating EndpointSlice controller to wait for cache to be updated #99345
Updating EndpointSlice controller to wait for cache to be updated #99345
Conversation
@robscott: GitHub didn't allow me to request PR reviews from the following users: swetharepakula. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: robscott The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/priority important-soon |
3129cb8
to
d9a1834
Compare
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just looked into non-test code, but I really like this approach.
@@ -280,6 +280,12 @@ func (c *Controller) handleErr(err error, key interface{}) { | |||
return | |||
} | |||
|
|||
if isStaleInformerCacheErr(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very ignorant in the workqueue area, but should we check the NumRequeues
too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question - I added a comment here to clarify that this was intentional. I think we'd want to wait indefinitely for the informer cache to update but that may be naive. @lavalamp do you have any tips here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the informer cache can't update you'll have bigger problems. That kind of problem should be caught by monitoring the metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lavalamp What kind of metric could I expose here? If I simply increment a counter every time we run into this error I think we'd end up with a lot of noise. Maybe a metric that tracks every time we go past max retries and give up for a Service?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update - I've removed this extra check and this error will not get any kind of special treatment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the only possible cause is the informer not having received an update, that will be reflected in existing metrics (it will be missing all updates in that case). If there's any chance of another kind of error, my first thought is to make two gauge metrics, one that counts the number of items in this condition, and one that adds up how many seconds those items have been in the condition. That way you can tell the difference between 1 persistently bad item and having a different item in this condition every time (which is not a problem).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think right now the other potential cause of an issue here would be someone manually setting an EndpointSlice generation lower than a previous value. As long as we can seal that up with a separate PR to prevent that in the strategy logic, I don't think we'll need much in the way of extra metrics.
For the two gauge metrics would you start a separate goroutine that did these calculations at some kind of fixed interval for all Services?
I've been thinking about adding a new metric that would track the number of syncs the controller has performed along with the outcome as a label with 3 potential values: success
, staleCache
, and error
. That would not be as helpful in identifying outliers but would provide some broad insight into the health of the controller.
Since I'm interested in attempting to backport this, I think adding a metric directly to this PR would complicate that, so maybe this can be done as a follow up.
8c5a4b6
to
5f4c4c0
Compare
5f4c4c0
to
e8fb4b4
Compare
After a bit more conversation with @wojtek-t I've moved away from |
Just to explain for posterity - RWMutex is basically more expensive (from cpu cost POV) so unless it provides visibly better experience (e.g. reduced e2e latency), it's not worth using it. And given that we generally have QPS limits in endpointslice controller which is producing them (and the fact that critical sections are super small) it doesn't make much sense. |
/lgtm Thanks! |
@robscott: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/retest
|
…dated This matches the recent updates to the EndpointSliceTracker for the EndpointSlice controller in kubernetes#99345 that accomplished the same thing.
…dated This matches the recent updates to the EndpointSliceTracker for the EndpointSlice controller in kubernetes#99345 that accomplished the same thing.
…dated This matches the recent updates to the EndpointSliceTracker for the EndpointSlice controller in kubernetes#99345 that accomplished the same thing.
…345-release-1.18 Automated cherry pick of #99345: Updating EndpointSlice controller to wait for cache to be
…345-release-1.20 Automated cherry pick of #99345: Updating EndpointSlice controller to wait for cache to be
…345-release-1.19 Automated cherry pick of #99345: Updating EndpointSlice controller to wait for cache to be
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
This updates the EndpointSlice controller to make use of the EndpointSlice tracker to identify when expected changes are not present in the cache yet. If this is detected, the controller will wait to sync until all expected updates have been received. This should help avoid race conditions that would result in duplicate EndpointSlices or failed attempts to update stale EndpointSlices. To simplify this logic, this also moves the EndpointSlice tracker from relying on resource versions to generations.
Which issue(s) this PR fixes:
Fixes #92928
Special notes for your reviewer:
This is a follow up to #93520 that shares some similar concepts but is simpler thanks to some tips from @lavalamp. The EndpointSlice mirroring controller will need an identical change. The EndpointSlice trackers in each controller should really be merged, but I didn't want to bundle that into this change for the sake of simplicity and potentially ease of backporting.
Does this PR introduce a user-facing change?
/sig network
/cc @bowei @swetharepakula @aojea
/assign @lavalamp