-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nodes go missing if cloudprovider instances api returns an error #41850
Comments
cc: @justinsb Have you ever seen anything like this with AWS? |
We've seen some stuff where instances were incorrectly not returned in the list instances call, but I haven't seen this one that I recall. I think though that ExternalID is now called in two places.. on kubelet start-up, but also if the node goes NotReady, because it is checking that the node still exists. Based on the timing, I think the nodes ran out of disk? But then I don't know why it didn't transition back... |
Hm, okay. I'm not sure why you identified the disk space though. These nodes were idle, I don't think it'd have run out of disk. I was more keyed into the fact that Kubelet went KubeletNotReady on Monday 20 Feb 2017 and then seems to never have transitioned back to Healthy... yet... the error there from external ID was a very intermittent issue. As if it got an error back and decided to stop one of its sync loops? |
Just that the disk space transitioned at around the same time:
But there's something here - check out #41916 (probably not related to disk space) |
@colemickens what version are you running? |
Gah, I rushed the bug report. It was either 1.5.2 or 1.5.3. |
Might be seeing the same behavior on 1.6.2. It's easy to reproduce on Azure, at least. In this example, the kubelet failed to get node info due to the Azure subscription used is being rate limited by the Azure API. @colemickens I can follow up with you on how to repro if you're curious.
|
@colemickens There are no sig labels on this issue. Please add a sig label by: |
/sig azure |
@seanknox I'm seeing a similar issue on acs-engine. Can you tell me how to reproduce? |
I also met a similiar problem but it's not totally same with this one. When I configured a wrong node name, kubelet can't get an external ID from azure, so it refuses to start. @colemickens Could you also provide related kubelet logs? |
I'm not really running in Azure, or any cloud these days, so it will not be easy for me to provide these. Anyone with a cluster can emulate this and check. Add an iptables rule blocking calls to the Azure control plane. Wait a few minutes and then remove the block. If there is a bug, kubelet will have stopped reporting status and you should start to see Nodes go unhealthy after some period of time. If there is no bug, then kubelet should recover from the momentary failure and move on. @feiskyer If there's not an open bug for that, can you please open one. @brendandburns and I have both observed and discussed that before. Even if the CP is throwing out errors, kubelet should arguably still finish initializing and run static pods (especially with the work going on to pull the CP mostly out of the kubelet initialization path). |
Bringing to SIG-Azure meeting to discuss. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle rotten |
We are seeing the same behavior. Yesterday, we lost 10 nodes. The only info I have right now from kubelet is this line (with log level set to 4)
I'm now adding extra logging to kubelet since the azure autorest error is getting swallowed... This will helpidentify with my root cause, but it doesn't explain why after 1 hour the nodes are still not back to Healthy and the controller-manager marks them all as |
@brendanburns thanks for the reply, after we upgraded to 1.7.13 the issue where kubelet disappear forever was resolved. We are still affected by Kubelet being marked as NotReady and we were able to pinpoint it to the management.azure.com becoming not available across multiple SubID for ~5mins. This is causing PodEviction across 15-50% of our nodes once a day. I have engaged with Azure Support since I do not think this is something related to Upstream at this point. Regarding this issue, I would say that the NotReady status sticking for ever is indeed fixed. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG REPORT
Kubernetes version (use
kubectl version
):Environment:
uname -a
): latest xenial lts kernelWhat happened:
Two of my three nodes went
NotReady
and have not returned.What you expected to happen:
Them not to have gone
NotReady
or them to have transitioned back out.My guess is that it got an error back from Azure and then got "stuck". This is similar to how kubelet gets stuck on bootup if the cloudprovider instances returns a 401 error... In that case, the kubelet grinds to a halt and doesn't start up the static pods. That's why I'm suspicious of a bug here.
How to reproduce it (as minimally and precisely as possible):
Unclear at this moment.
Anything else we need to know:
n/a
The text was updated successfully, but these errors were encountered: