-
Notifications
You must be signed in to change notification settings - Fork 39.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FailedToUpdateEndpointSlices Error updating Endpoint Slices for Service #92928
Comments
/sig api-machinery |
/remove-sig api-machinery |
/triage unresolved Comment 🤖 I am a bot run by vllry. 👩🔬 |
Hey @jijojv, thanks for reporting this! This is not actually anything to worry about and I think the best solution will be for us to stop publishing that event if the error is related to an out of date cache like this. Due to the nature of the controller reacting to changes in Services and attempting to update related EndpointSlices, it can run into problems if the locally cached copy of EndpointSlices it has is out of date. It will naturally retry and resolve the issue when the cache updates. I'll work on a fix here to lower the logging and see if there are some ways to reduce the probability of this happening. /remove-triage unresolved |
As a quick update, #93520 should fix this. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen |
@robscott: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@robscott is this a proper bug? Or does it still need triage? |
I think it's a proper bug, but my initial fix got rather bogged down for this. This is the kind of error that most controllers emit, it just happens significantly more frequently due to the nature of EndpointSlices. My first attempt was to try to limit the number of times we sync before the informer cache is up to date, but that could have potentially introduced more bugs and it was certainly not a full solution here. I wouldn't categorize this as a critical bug, and I'm not sure I'll have time to prioritize it short term, but it is a real issue. /triage accepted |
@robscott any plan to get this fix this permanently? |
@aizuddin85 Unfortunately I don't think I'll ever be able to completely fix this without some significant help from apimachinery. With that said, I'm hopeful that I can find some kind of ways to reduce the frequency of these events. My previous attempt (#93520) was unfortunately not seen as the correct approach, but it's been difficult to find a better one. With the upcoming work for Topology for 1.21 I'll be spending some quality time working on the controller again. I'm hopeful that I'll be able to find a somewhat less controversial way to mitigate this as part of that work, but no guarantees. Very open to ideas here. |
I have this issue after trying to reinitialize a gardener k8s cluster after a scheduled password change.
|
i deleted the endpointslices and the machine-controller-webhook and ran kubeone apply again.... |
We are still having this issue within AKS. When releasing a new service as the pods come up we get this Enpoint slice error and would like to know what the fix is for it? |
@drawn4427 what version of Kubernetes are you using? For reference, the oldest version of Kubernetes that got this patch was v1.19.9. |
i'm on Kubernetes |
Hey @ltagliamonte-dd, unfortunately we can no longer patch v1.18 so the mitigation for this only made it back as far as 1.19. |
I also have this problem in v1.20.4 |
@HongjiangHuang unfortunately this fix didn't make it in until 1.20.5. For reference, this should be mitigated in:
That fix does not guarantee that these events will never happen, they should just be significantly less common. |
@robscott Hi, my team seems just caught by this issue. Is it possible to make a complete fix on this? |
this was fixed in all supported versions, which version are you using? |
Is it already totaly fixed by now? I'm a little confused because according to @robscott 「That fix does not guarantee that these events will never happen, they should just be significantly less common」
Server Version: version.Info{Major:"1", Minor:"22+", GitVersion:"v1.22.16-eks-ffeb93d", GitCommit:"52e500d139bdef42fbc4540c357f0565c7867a81", GitTreeState:"clean", BuildDate:"2022-11-29T18:41:42Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"} |
oh, yeah , sorry, but see the other Rob's comment
|
I am still getting this error. |
I'm currently faced with the same error, but it's not after every deployment. The error suddenly shows up hours after deployment is done. There are some pods that emit DNS resolution error logs within 10s after the
Is this related or a different issue? This is on v1.23 EKS. |
What happened:
Hi Folks, after every deployment we see this for about an hour. Seems to be harmless but wondering if this is a bug in v1.17.3
kubectl describe svc my-svc
What you expected to happen:
Events: <none>
How to reproduce it (as minimally and precisely as possible):
kubectl rollout restart deployment my_deploy
Anything else we need to know?:
Environment:
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"archive", BuildDate:"2020-03-20T16:41:14Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"}
The text was updated successfully, but these errors were encountered: