Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FailedToUpdateEndpointSlices Error updating Endpoint Slices for Service #92928

Closed
jijojv opened this issue Jul 9, 2020 · 29 comments · Fixed by #99345
Closed

FailedToUpdateEndpointSlices Error updating Endpoint Slices for Service #92928

jijojv opened this issue Jul 9, 2020 · 29 comments · Fixed by #99345
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@jijojv
Copy link

jijojv commented Jul 9, 2020

What happened:
Hi Folks, after every deployment we see this for about an hour. Seems to be harmless but wondering if this is a bug in v1.17.3
kubectl describe svc my-svc

Events:
  Type     Reason                        Age   From                       Message
  ----     ------                        ----  ----                       -------
  Warning  FailedToUpdateEndpointSlices  35m   endpoint-slice-controller  Error updating Endpoint Slices for Service my-svc/my-app: Error updating my-app-h7q6v EndpointSlice for Service my-svc/my-app: Operation cannot be fulfilled on endpointslices.discovery.k8s.io "my-app-h7q6v": the object has been modified; please apply your changes to the latest version and try again

What you expected to happen:
Events: <none>

How to reproduce it (as minimally and precisely as possible):
kubectl rollout restart deployment my_deploy

Anything else we need to know?:

Environment:
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"archive", BuildDate:"2020-03-20T16:41:14Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"}

@jijojv jijojv added the kind/bug Categorizes issue or PR as related to a bug. label Jul 9, 2020
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jul 9, 2020
@jijojv
Copy link
Author

jijojv commented Jul 9, 2020

/sig api-machinery

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 9, 2020
@liggitt
Copy link
Member

liggitt commented Jul 9, 2020

/remove-sig api-machinery
/sig network
/assign @robscott

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Jul 9, 2020
@athenabot
Copy link

/triage unresolved

Comment /remove-triage unresolved when the issue is assessed and confirmed.

🤖 I am a bot run by vllry. 👩‍🔬

@k8s-ci-robot k8s-ci-robot added the triage/unresolved Indicates an issue that can not or will not be resolved. label Jul 9, 2020
@robscott
Copy link
Member

robscott commented Jul 9, 2020

Hey @jijojv, thanks for reporting this! This is not actually anything to worry about and I think the best solution will be for us to stop publishing that event if the error is related to an out of date cache like this. Due to the nature of the controller reacting to changes in Services and attempting to update related EndpointSlices, it can run into problems if the locally cached copy of EndpointSlices it has is out of date. It will naturally retry and resolve the issue when the cache updates. I'll work on a fix here to lower the logging and see if there are some ways to reduce the probability of this happening.

/remove-triage unresolved

@robscott
Copy link
Member

robscott commented Aug 7, 2020

As a quick update, #93520 should fix this.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 5, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 5, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@robscott
Copy link
Member

robscott commented Jan 4, 2021

/reopen
/remove-lifecycle rotten

@k8s-ci-robot
Copy link
Contributor

@robscott: Reopened this issue.

In response to this:

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Jan 4, 2021
@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 4, 2021
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jan 4, 2021
@thockin
Copy link
Member

thockin commented Jan 8, 2021

@robscott is this a proper bug? Or does it still need triage?

@robscott
Copy link
Member

robscott commented Jan 8, 2021

I think it's a proper bug, but my initial fix got rather bogged down for this. This is the kind of error that most controllers emit, it just happens significantly more frequently due to the nature of EndpointSlices. My first attempt was to try to limit the number of times we sync before the informer cache is up to date, but that could have potentially introduced more bugs and it was certainly not a full solution here. I wouldn't categorize this as a critical bug, and I'm not sure I'll have time to prioritize it short term, but it is a real issue.

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 8, 2021
@aizuddin85
Copy link

@robscott any plan to get this fix this permanently?

@robscott
Copy link
Member

@aizuddin85 Unfortunately I don't think I'll ever be able to completely fix this without some significant help from apimachinery. With that said, I'm hopeful that I can find some kind of ways to reduce the frequency of these events. My previous attempt (#93520) was unfortunately not seen as the correct approach, but it's been difficult to find a better one.

With the upcoming work for Topology for 1.21 I'll be spending some quality time working on the controller again. I'm hopeful that I'll be able to find a somewhat less controversial way to mitigate this as part of that work, but no guarantees. Very open to ideas here.

@uvwild
Copy link

uvwild commented Apr 6, 2021

I have this issue after trying to reinitialize a gardener k8s cluster after a scheduled password change.
the
k8s 1.18.6 using openstack provider
docker.io/kubermatic/machine-controller:v1.25.0

Error updating Endpoint Slices for Service kube-system/machine-controller-webhook: Error updating machine-controller-webhook-5tjj2 EndpointSlice for Service kube-system/machine-controller-webhook: Operation cannot be fulfilled on endpointslices.discovery.k8s.io "machine-controller-webhook-5tjj2": the object has been modified; please apply your changes to the latest version and try again

@uvwild
Copy link

uvwild commented Apr 6, 2021

i deleted the endpointslices and the machine-controller-webhook and ran kubeone apply again....
which fixed the endpoint slice error. but not my scaling issue
kubermatic/machine-controller#939

@drawn4427
Copy link

We are still having this issue within AKS. When releasing a new service as the pods come up we get this Enpoint slice error and would like to know what the fix is for it?

@robscott
Copy link
Member

@drawn4427 what version of Kubernetes are you using? For reference, the oldest version of Kubernetes that got this patch was v1.19.9.

@ltagliamonte-dd
Copy link
Contributor

i'm on Kubernetes v1.18.9 and also start noticing the same errors in the events stream.

@robscott
Copy link
Member

Hey @ltagliamonte-dd, unfortunately we can no longer patch v1.18 so the mitigation for this only made it back as far as 1.19.

@alberthuang24
Copy link

I also have this problem in v1.20.4

@robscott
Copy link
Member

robscott commented Jul 21, 2021

@HongjiangHuang unfortunately this fix didn't make it in until 1.20.5. For reference, this should be mitigated in:

  • v1.19.9+
  • v1.20.5+
  • v1.21+

That fix does not guarantee that these events will never happen, they should just be significantly less common.

@haorenfsa
Copy link
Contributor

@robscott Hi, my team seems just caught by this issue. Is it possible to make a complete fix on this?

@aojea
Copy link
Member

aojea commented Feb 26, 2023

@robscott Hi, my team seems just caught by this issue. Is it possible to make a complete fix on this?

this was fixed in all supported versions, which version are you using?

@haorenfsa
Copy link
Contributor

haorenfsa commented Feb 26, 2023

this was fixed in all supported versions,

Is it already totaly fixed by now? I'm a little confused because according to @robscott 「That fix does not guarantee that these events will never happen, they should just be significantly less common」

which version are you using?

Server Version: version.Info{Major:"1", Minor:"22+", GitVersion:"v1.22.16-eks-ffeb93d", GitCommit:"52e500d139bdef42fbc4540c357f0565c7867a81", GitTreeState:"clean", BuildDate:"2022-11-29T18:41:42Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

@aojea
Copy link
Member

aojea commented Feb 26, 2023

That fix does not guarantee that these events will never happen, they should just be significantly less common

oh, yeah , sorry, but see the other Rob's comment

This is not actually anything to worry about

@smyja
Copy link

smyja commented Jul 13, 2023

I am still getting this error.

@aufarg
Copy link

aufarg commented Jul 24, 2023

I'm currently faced with the same error, but it's not after every deployment. The error suddenly shows up hours after deployment is done. There are some pods that emit DNS resolution error logs within 10s after the FailedToUpdateEndpointSlices emitted. Example log emitted:

caused by: Post "https://sts.ap-southeast-1.amazonaws.com/": dial tcp: lookup sts.ap-southeast-1.amazonaws.com: i/o timeout

Is this related or a different issue? This is on v1.23 EKS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.