FailedToUpdateEndpointSlices Error updating Endpoint Slices for Service #92928

jijojv · 2020-07-09T02:36:42Z

What happened:
Hi Folks, after every deployment we see this for about an hour. Seems to be harmless but wondering if this is a bug in v1.17.3
kubectl describe svc my-svc

Events:
  Type     Reason                        Age   From                       Message
  ----     ------                        ----  ----                       -------
  Warning  FailedToUpdateEndpointSlices  35m   endpoint-slice-controller  Error updating Endpoint Slices for Service my-svc/my-app: Error updating my-app-h7q6v EndpointSlice for Service my-svc/my-app: Operation cannot be fulfilled on endpointslices.discovery.k8s.io "my-app-h7q6v": the object has been modified; please apply your changes to the latest version and try again

What you expected to happen:
Events: <none>

How to reproduce it (as minimally and precisely as possible):
kubectl rollout restart deployment my_deploy

Anything else we need to know?:

Environment:
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"archive", BuildDate:"2020-03-20T16:41:14Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"}

The text was updated successfully, but these errors were encountered:

jijojv · 2020-07-09T02:41:36Z

/sig api-machinery

liggitt · 2020-07-09T03:37:48Z

/remove-sig api-machinery
/sig network
/assign @robscott

athenabot · 2020-07-09T04:12:06Z

/triage unresolved

Comment /remove-triage unresolved when the issue is assessed and confirmed.

🤖 I am a bot run by vllry. 👩‍🔬

robscott · 2020-07-09T06:59:53Z

Hey @jijojv, thanks for reporting this! This is not actually anything to worry about and I think the best solution will be for us to stop publishing that event if the error is related to an out of date cache like this. Due to the nature of the controller reacting to changes in Services and attempting to update related EndpointSlices, it can run into problems if the locally cached copy of EndpointSlices it has is out of date. It will naturally retry and resolve the issue when the cache updates. I'll work on a fix here to lower the logging and see if there are some ways to reduce the probability of this happening.

/remove-triage unresolved

robscott · 2020-08-07T17:56:33Z

As a quick update, #93520 should fix this.

fejta-bot · 2020-11-05T18:04:17Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-12-05T18:49:31Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2021-01-04T19:34:52Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2021-01-04T19:35:05Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

robscott · 2021-01-04T20:04:26Z

/reopen
/remove-lifecycle rotten

k8s-ci-robot · 2021-01-04T20:04:38Z

@robscott: Reopened this issue.

In response to this:

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

thockin · 2021-01-08T00:31:37Z

@robscott is this a proper bug? Or does it still need triage?

robscott · 2021-01-08T01:07:46Z

I think it's a proper bug, but my initial fix got rather bogged down for this. This is the kind of error that most controllers emit, it just happens significantly more frequently due to the nature of EndpointSlices. My first attempt was to try to limit the number of times we sync before the informer cache is up to date, but that could have potentially introduced more bugs and it was certainly not a full solution here. I wouldn't categorize this as a critical bug, and I'm not sure I'll have time to prioritize it short term, but it is a real issue.

/triage accepted

aizuddin85 · 2021-02-17T15:59:09Z

@robscott any plan to get this fix this permanently?

robscott · 2021-02-18T05:27:46Z

@aizuddin85 Unfortunately I don't think I'll ever be able to completely fix this without some significant help from apimachinery. With that said, I'm hopeful that I can find some kind of ways to reduce the frequency of these events. My previous attempt (#93520) was unfortunately not seen as the correct approach, but it's been difficult to find a better one.

With the upcoming work for Topology for 1.21 I'll be spending some quality time working on the controller again. I'm hopeful that I'll be able to find a somewhat less controversial way to mitigate this as part of that work, but no guarantees. Very open to ideas here.

uvwild · 2021-04-06T13:23:56Z

I have this issue after trying to reinitialize a gardener k8s cluster after a scheduled password change.
the
k8s 1.18.6 using openstack provider
docker.io/kubermatic/machine-controller:v1.25.0

Error updating Endpoint Slices for Service kube-system/machine-controller-webhook: Error updating machine-controller-webhook-5tjj2 EndpointSlice for Service kube-system/machine-controller-webhook: Operation cannot be fulfilled on endpointslices.discovery.k8s.io "machine-controller-webhook-5tjj2": the object has been modified; please apply your changes to the latest version and try again

uvwild · 2021-04-06T14:21:21Z

i deleted the endpointslices and the machine-controller-webhook and ran kubeone apply again....
which fixed the endpoint slice error. but not my scaling issue
kubermatic/machine-controller#939

drawn4427 · 2021-04-22T17:17:06Z

We are still having this issue within AKS. When releasing a new service as the pods come up we get this Enpoint slice error and would like to know what the fix is for it?

robscott · 2021-04-22T17:50:51Z

@drawn4427 what version of Kubernetes are you using? For reference, the oldest version of Kubernetes that got this patch was v1.19.9.

ltagliamonte-dd · 2021-05-18T22:59:11Z

i'm on Kubernetes v1.18.9 and also start noticing the same errors in the events stream.

robscott · 2021-05-18T23:02:30Z

Hey @ltagliamonte-dd, unfortunately we can no longer patch v1.18 so the mitigation for this only made it back as far as 1.19.

alberthuang24 · 2021-07-21T18:30:42Z

I also have this problem in v1.20.4

robscott · 2021-07-21T18:43:57Z

@HongjiangHuang unfortunately this fix didn't make it in until 1.20.5. For reference, this should be mitigated in:

v1.19.9+
v1.20.5+
v1.21+

That fix does not guarantee that these events will never happen, they should just be significantly less common.

haorenfsa · 2023-02-26T14:17:25Z

@robscott Hi, my team seems just caught by this issue. Is it possible to make a complete fix on this?

aojea · 2023-02-26T14:22:12Z

@robscott Hi, my team seems just caught by this issue. Is it possible to make a complete fix on this?

this was fixed in all supported versions, which version are you using?

haorenfsa · 2023-02-26T14:24:51Z

this was fixed in all supported versions,

Is it already totaly fixed by now? I'm a little confused because according to @robscott 「That fix does not guarantee that these events will never happen, they should just be significantly less common」

which version are you using?

Server Version: version.Info{Major:"1", Minor:"22+", GitVersion:"v1.22.16-eks-ffeb93d", GitCommit:"52e500d139bdef42fbc4540c357f0565c7867a81", GitTreeState:"clean", BuildDate:"2022-11-29T18:41:42Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

aojea · 2023-02-26T14:29:54Z

That fix does not guarantee that these events will never happen, they should just be significantly less common

oh, yeah , sorry, but see the other Rob's comment

This is not actually anything to worry about

smyja · 2023-07-13T08:05:08Z

I am still getting this error.

aufarg · 2023-07-24T07:08:08Z

I'm currently faced with the same error, but it's not after every deployment. The error suddenly shows up hours after deployment is done. There are some pods that emit DNS resolution error logs within 10s after the FailedToUpdateEndpointSlices emitted. Example log emitted:

caused by: Post "https://sts.ap-southeast-1.amazonaws.com/": dial tcp: lookup sts.ap-southeast-1.amazonaws.com: i/o timeout

Is this related or a different issue? This is on v1.23 EKS.

jijojv added the kind/bug Categorizes issue or PR as related to a bug. label Jul 9, 2020

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jul 9, 2020

k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 9, 2020

k8s-ci-robot assigned robscott Jul 9, 2020

k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Jul 9, 2020

k8s-ci-robot added the triage/unresolved Indicates an issue that can not or will not be resolved. label Jul 9, 2020

k8s-ci-robot removed the triage/unresolved Indicates an issue that can not or will not be resolved. label Jul 9, 2020

mateiidavid mentioned this issue Jul 13, 2020

service topologies: add install flag for EndpointSlices linkerd/linkerd2#4740

Merged

This was referenced Jul 23, 2020

Updating EndpointSlice e2e tests to be less flaky and easier to debug #93402

Merged

[Flaky test] EndpointSlice should create Endpoints and EndpointSlices for Pods matching a Service #93374

Closed

robscott mentioned this issue Aug 7, 2020

Delaying EndpointSlice controller syncs until cache is updated #93520

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 5, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 5, 2020

k8s-ci-robot closed this as completed Jan 4, 2021

k8s-ci-robot reopened this Jan 4, 2021

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 4, 2021

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jan 4, 2021

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 8, 2021

robscott mentioned this issue Feb 23, 2021

Updating EndpointSlice controller to wait for cache to be updated #99345

Merged

k8s-ci-robot closed this as completed in #99345 Mar 2, 2021

uvwild mentioned this issue Apr 6, 2021

machine upscale fails after password change kubermatic/machine-controller#939

Closed

miseyu mentioned this issue Oct 14, 2021

Upgrade k8s patch version for CI scalar-labs/helm-charts#39

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FailedToUpdateEndpointSlices Error updating Endpoint Slices for Service #92928

FailedToUpdateEndpointSlices Error updating Endpoint Slices for Service #92928

jijojv commented Jul 9, 2020

jijojv commented Jul 9, 2020

liggitt commented Jul 9, 2020

athenabot commented Jul 9, 2020

robscott commented Jul 9, 2020

robscott commented Aug 7, 2020

fejta-bot commented Nov 5, 2020

fejta-bot commented Dec 5, 2020

fejta-bot commented Jan 4, 2021

k8s-ci-robot commented Jan 4, 2021

robscott commented Jan 4, 2021

k8s-ci-robot commented Jan 4, 2021

thockin commented Jan 8, 2021

robscott commented Jan 8, 2021

aizuddin85 commented Feb 17, 2021

robscott commented Feb 18, 2021

uvwild commented Apr 6, 2021

uvwild commented Apr 6, 2021 •

edited

Loading

drawn4427 commented Apr 22, 2021

robscott commented Apr 22, 2021

ltagliamonte-dd commented May 18, 2021

robscott commented May 18, 2021

alberthuang24 commented Jul 21, 2021

robscott commented Jul 21, 2021 •

edited

Loading

haorenfsa commented Feb 26, 2023

aojea commented Feb 26, 2023

haorenfsa commented Feb 26, 2023 •

edited

Loading

aojea commented Feb 26, 2023

smyja commented Jul 13, 2023

aufarg commented Jul 24, 2023

FailedToUpdateEndpointSlices Error updating Endpoint Slices for Service #92928

FailedToUpdateEndpointSlices Error updating Endpoint Slices for Service #92928

Comments

jijojv commented Jul 9, 2020

jijojv commented Jul 9, 2020

liggitt commented Jul 9, 2020

athenabot commented Jul 9, 2020

robscott commented Jul 9, 2020

robscott commented Aug 7, 2020

fejta-bot commented Nov 5, 2020

fejta-bot commented Dec 5, 2020

fejta-bot commented Jan 4, 2021

k8s-ci-robot commented Jan 4, 2021

robscott commented Jan 4, 2021

k8s-ci-robot commented Jan 4, 2021

thockin commented Jan 8, 2021

robscott commented Jan 8, 2021

aizuddin85 commented Feb 17, 2021

robscott commented Feb 18, 2021

uvwild commented Apr 6, 2021

uvwild commented Apr 6, 2021 • edited Loading

drawn4427 commented Apr 22, 2021

robscott commented Apr 22, 2021

ltagliamonte-dd commented May 18, 2021

robscott commented May 18, 2021

alberthuang24 commented Jul 21, 2021

robscott commented Jul 21, 2021 • edited Loading

haorenfsa commented Feb 26, 2023

aojea commented Feb 26, 2023

haorenfsa commented Feb 26, 2023 • edited Loading

aojea commented Feb 26, 2023

smyja commented Jul 13, 2023

aufarg commented Jul 24, 2023

uvwild commented Apr 6, 2021 •

edited

Loading

robscott commented Jul 21, 2021 •

edited

Loading

haorenfsa commented Feb 26, 2023 •

edited

Loading