Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud provider API downtime still leads to NodeNotReady #19124

Closed
pesho opened this issue Dec 28, 2015 · 6 comments
Closed

Cloud provider API downtime still leads to NodeNotReady #19124

pesho opened this issue Dec 28, 2015 · 6 comments
Assignees
Labels
area/nodecontroller priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle.

Comments

@pesho
Copy link

pesho commented Dec 28, 2015

I'm experiencing the same issue as #16593 but with OpenStack, using Kubernetes 1.1.3. Since that issue is closed and was filed as AWS-specific, I'm opening a new one. #13417 was supposed to fix it and is included in release 1.1.3 (despite having been reverted in master), but it doesn't seem to work properly (as @antoineco also reported in #16593). Also related to issues #13398 and #13412.

In short, cloud provider API downtime leads (seemingly) to minions failing to update their status, being marked as NodeNotReady, and evicting all pods from them. This can happen to all minions in the cluster at about the same time, and can lead to semi-disastrous situations (such as a pod being unable to start on a new node after recovery, because its persistent volume couldn't be detached from the previous node due to the API being down).

Relevant logs:

Dec 27 16:28:26 cluster03.s.xxxxxxxxx.com kubelet[929]: E1227 16:28:26.189013     929 kubelet.go:2340] Failed to get addresses from cloud provider, so node addresses will be stale: Expected HTTP response code [200 204] when accessing [GET https://compute.gra1.cloud.ovh.net/v2/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/servers/detail?name=%5Ecluster03%5C.s%5C.xxxxxxxxx%5C.com%24&status=ACTIVE], but got 500 instead
Dec 27 16:28:26 cluster03.s.xxxxxxxxx.com kubelet[929]: {"computeFault": {"message": "The server has either erred or is incapable of performing the requested operation.", "code": 500}}
Dec 27 16:28:26 cluster03.s.xxxxxxxxx.com kubelet[929]: E1227 16:28:26.230604     929 kubelet.go:2291] Error updating node status, will retry: node "cluster03.s.xxxxxxxxx.com" cannot be updated: the object has been modified; please apply your changes to the latest version and try again
Dec 27 16:28:07 master.xxxxxxxxx.com kube-controller-manager[513]: I1227 16:28:07.320025     513 event.go:206] Event(api.ObjectReference{Kind:"Node", Namespace:"", Name:"cluster03.s.xxxxxxxxx.com", UID:"cluster03.s.xxxxxxxxx.com", APIVersion:"", ResourceVersion:"", FieldPath:""}): reason: 'NodeNotReady' Node cluster03.s.xxxxxxxxx.com status is now: NodeNotReady

(I'm not sure why the controller-manager changes the node status 19 seconds before the kubelet itself logs an error, but the two always happen at about the same time. The machines' clocks are properly synchronized and the controller-manager has no cloud-provider config specified)

@antoineco
Copy link
Contributor

Indeed, I asked if the AWS ticket could be reopened already, since the current fix makes the entire cluster depend on the availability of a single API. It's extremely dangerous in production.

@mwielgus mwielgus added the area/provider/openstack Issues or PRs related to openstack provider label Dec 28, 2015
@bgrant0607-nocc bgrant0607-nocc added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. team/control-plane labels Jan 11, 2016
@pesho
Copy link
Author

pesho commented Jan 17, 2016

@mwielgus I think the openstack label here is a little misleading by itself. The issue most likely affects all cloud providers, as evidenced by the symptoms and the other similar issues already opened.

@davidopp davidopp added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed area/provider/openstack Issues or PRs related to openstack provider labels Jan 25, 2016
@davidopp davidopp self-assigned this Jan 25, 2016
@davidopp
Copy link
Member

cc/ @gmarek

@gmarek gmarek assigned gmarek and unassigned davidopp Feb 4, 2016
@gmarek gmarek added this to the next-candidate milestone Feb 4, 2016
@gmarek
Copy link
Contributor

gmarek commented Feb 4, 2016

It may be the time to rethink some parts of NodeController's logic. By the time it was written only source of NotReady Node status was NodeController itself (Kubelet was never writing NotReady), but those times are long gone and I think that we might act slightly differently when NotReady was set by NodeController (machine is not responsive) and when it was set by Kubelet itself (some software problem - CloudProvider is not responsive, Docker daemon has some problems, etc.).

Current logic of NodeController was designed to handle first case for single node failures. As a way to protect us from destroying whole cluster in case of networking problems of the master machine rate limiting was introduces (@brendandburns is right, that we should test this). Current limits were set at-hock in a world of 100 Node clusters - this certainly needs a fix I'll open an issue for it.

As for the main problem: I think that we should distinguish single(-ish) Node failures from whole-cluster ones. When all Nodes are NotReady ControllerManager can assume that it's some kind of global problem (e.g. cloud provider is not responsive) or that master machine is separated from the cluster. In both those cases it shouldn't take any action, as it won't do any good. One of the question is how to proceed when Nodes will start to reappear gradually. Maybe we should have separate 'healing' state of NodeController, or maybe it can be simple incorporated into the normal logic.

An orthogonal issue is if we wan't to distinguish 'hard' failures (NotReady set by the NodeController) from 'soft' ones (NotReady set by the Kubelet). We may want to give Kubelet additional time to try to fix itself (e.g. by restarting Docker or sth), but I'm not sure it's worth the effort.

Action items:

  • provide better limits for eviction rate limiter,
  • write a test for eviction rate limiting (I don't think it needs to be an e2e test, I think that it may be possible to write a unit test for it, as it doesn't need any component other than NodeControler),
  • add a logic for handling 'whole cluster down' situation to NodeController and make sure that recovery is possible
  • decide if we want to act differently on Kubelet-driven node's problems.

It's not possible to all of this for 1.2 (i.e. until tomorrow;)

cc @davidopp @bgrant0607 @wojtek-t @fgrzadkowski @smarterclayton @quinton-hoole

@bgrant0607
Copy link
Member

The more data we have, the better decisions we can make.

It's useful to distinguish VM is unreachable vs. Kubelet not communicating.

It's useful to distinguish Kubelet can't run any containers vs. can't start new ones.

It's useful to know why: Docker dead/unresponsive, sys OOM, OOD, volumes can't be mounted, container network unavailable, etc.

It's useful to determine which failures are correlated: all, by fault domain, with software rollout, particular dedicated pools, with a Container of Death, etc.

In response, we might want to make the node unschedulable, evict scheduled pods, reboot the node, replace the node, revert the rollout, etc.

Rate limiting is just a way to slow down to human speeds in hope of human analysis or intervention. It's necessary in some cases, but we could do better in others.

@gmarek
Copy link
Contributor

gmarek commented Jun 2, 2016

Fixed by #25571

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/nodecontroller priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle.
Projects
None yet
Development

No branches or pull requests

7 participants