-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cloud provider API downtime still leads to NodeNotReady #19124
Comments
Indeed, I asked if the AWS ticket could be reopened already, since the current fix makes the entire cluster depend on the availability of a single API. It's extremely dangerous in production. |
@mwielgus I think the |
cc/ @gmarek |
It may be the time to rethink some parts of NodeController's logic. By the time it was written only source of NotReady Node status was NodeController itself (Kubelet was never writing NotReady), but those times are long gone and I think that we might act slightly differently when NotReady was set by NodeController (machine is not responsive) and when it was set by Kubelet itself (some software problem - CloudProvider is not responsive, Docker daemon has some problems, etc.). Current logic of NodeController was designed to handle first case for single node failures. As a way to protect us from destroying whole cluster in case of networking problems of the master machine rate limiting was introduces (@brendandburns is right, that we should test this). Current limits were set at-hock in a world of 100 Node clusters - this certainly needs a fix I'll open an issue for it. As for the main problem: I think that we should distinguish single(-ish) Node failures from whole-cluster ones. When all Nodes are NotReady ControllerManager can assume that it's some kind of global problem (e.g. cloud provider is not responsive) or that master machine is separated from the cluster. In both those cases it shouldn't take any action, as it won't do any good. One of the question is how to proceed when Nodes will start to reappear gradually. Maybe we should have separate 'healing' state of NodeController, or maybe it can be simple incorporated into the normal logic. An orthogonal issue is if we wan't to distinguish 'hard' failures (NotReady set by the NodeController) from 'soft' ones (NotReady set by the Kubelet). We may want to give Kubelet additional time to try to fix itself (e.g. by restarting Docker or sth), but I'm not sure it's worth the effort. Action items:
It's not possible to all of this for 1.2 (i.e. until tomorrow;) cc @davidopp @bgrant0607 @wojtek-t @fgrzadkowski @smarterclayton @quinton-hoole |
The more data we have, the better decisions we can make. It's useful to distinguish VM is unreachable vs. Kubelet not communicating. It's useful to distinguish Kubelet can't run any containers vs. can't start new ones. It's useful to know why: Docker dead/unresponsive, sys OOM, OOD, volumes can't be mounted, container network unavailable, etc. It's useful to determine which failures are correlated: all, by fault domain, with software rollout, particular dedicated pools, with a Container of Death, etc. In response, we might want to make the node unschedulable, evict scheduled pods, reboot the node, replace the node, revert the rollout, etc. Rate limiting is just a way to slow down to human speeds in hope of human analysis or intervention. It's necessary in some cases, but we could do better in others. |
Fixed by #25571 |
I'm experiencing the same issue as #16593 but with OpenStack, using Kubernetes 1.1.3. Since that issue is closed and was filed as AWS-specific, I'm opening a new one. #13417 was supposed to fix it and is included in release 1.1.3 (despite having been reverted in master), but it doesn't seem to work properly (as @antoineco also reported in #16593). Also related to issues #13398 and #13412.
In short, cloud provider API downtime leads (seemingly) to minions failing to update their status, being marked as NodeNotReady, and evicting all pods from them. This can happen to all minions in the cluster at about the same time, and can lead to semi-disastrous situations (such as a pod being unable to start on a new node after recovery, because its persistent volume couldn't be detached from the previous node due to the API being down).
Relevant logs:
(I'm not sure why the controller-manager changes the node status 19 seconds before the kubelet itself logs an error, but the two always happen at about the same time. The machines' clocks are properly synchronized and the controller-manager has no cloud-provider config specified)
The text was updated successfully, but these errors were encountered: