Cloud provider API downtime still leads to NodeNotReady #19124

pesho · 2015-12-28T02:54:20Z

I'm experiencing the same issue as #16593 but with OpenStack, using Kubernetes 1.1.3. Since that issue is closed and was filed as AWS-specific, I'm opening a new one. #13417 was supposed to fix it and is included in release 1.1.3 (despite having been reverted in master), but it doesn't seem to work properly (as @antoineco also reported in #16593). Also related to issues #13398 and #13412.

In short, cloud provider API downtime leads (seemingly) to minions failing to update their status, being marked as NodeNotReady, and evicting all pods from them. This can happen to all minions in the cluster at about the same time, and can lead to semi-disastrous situations (such as a pod being unable to start on a new node after recovery, because its persistent volume couldn't be detached from the previous node due to the API being down).

Relevant logs:

Dec 27 16:28:26 cluster03.s.xxxxxxxxx.com kubelet[929]: E1227 16:28:26.189013     929 kubelet.go:2340] Failed to get addresses from cloud provider, so node addresses will be stale: Expected HTTP response code [200 204] when accessing [GET https://compute.gra1.cloud.ovh.net/v2/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/servers/detail?name=%5Ecluster03%5C.s%5C.xxxxxxxxx%5C.com%24&status=ACTIVE], but got 500 instead
Dec 27 16:28:26 cluster03.s.xxxxxxxxx.com kubelet[929]: {"computeFault": {"message": "The server has either erred or is incapable of performing the requested operation.", "code": 500}}
Dec 27 16:28:26 cluster03.s.xxxxxxxxx.com kubelet[929]: E1227 16:28:26.230604     929 kubelet.go:2291] Error updating node status, will retry: node "cluster03.s.xxxxxxxxx.com" cannot be updated: the object has been modified; please apply your changes to the latest version and try again

Dec 27 16:28:07 master.xxxxxxxxx.com kube-controller-manager[513]: I1227 16:28:07.320025     513 event.go:206] Event(api.ObjectReference{Kind:"Node", Namespace:"", Name:"cluster03.s.xxxxxxxxx.com", UID:"cluster03.s.xxxxxxxxx.com", APIVersion:"", ResourceVersion:"", FieldPath:""}): reason: 'NodeNotReady' Node cluster03.s.xxxxxxxxx.com status is now: NodeNotReady

(I'm not sure why the controller-manager changes the node status 19 seconds before the kubelet itself logs an error, but the two always happen at about the same time. The machines' clocks are properly synchronized and the controller-manager has no cloud-provider config specified)

The text was updated successfully, but these errors were encountered:

antoineco · 2015-12-28T08:18:30Z

Indeed, I asked if the AWS ticket could be reopened already, since the current fix makes the entire cluster depend on the availability of a single API. It's extremely dangerous in production.

pesho · 2016-01-17T16:27:00Z

@mwielgus I think the openstack label here is a little misleading by itself. The issue most likely affects all cloud providers, as evidenced by the symptoms and the other similar issues already opened.

davidopp · 2016-01-25T00:44:45Z

cc/ @gmarek

gmarek · 2016-02-04T13:49:25Z

It may be the time to rethink some parts of NodeController's logic. By the time it was written only source of NotReady Node status was NodeController itself (Kubelet was never writing NotReady), but those times are long gone and I think that we might act slightly differently when NotReady was set by NodeController (machine is not responsive) and when it was set by Kubelet itself (some software problem - CloudProvider is not responsive, Docker daemon has some problems, etc.).

Current logic of NodeController was designed to handle first case for single node failures. As a way to protect us from destroying whole cluster in case of networking problems of the master machine rate limiting was introduces (@brendandburns is right, that we should test this). Current limits were set at-hock in a world of 100 Node clusters - this certainly needs a fix I'll open an issue for it.

As for the main problem: I think that we should distinguish single(-ish) Node failures from whole-cluster ones. When all Nodes are NotReady ControllerManager can assume that it's some kind of global problem (e.g. cloud provider is not responsive) or that master machine is separated from the cluster. In both those cases it shouldn't take any action, as it won't do any good. One of the question is how to proceed when Nodes will start to reappear gradually. Maybe we should have separate 'healing' state of NodeController, or maybe it can be simple incorporated into the normal logic.

An orthogonal issue is if we wan't to distinguish 'hard' failures (NotReady set by the NodeController) from 'soft' ones (NotReady set by the Kubelet). We may want to give Kubelet additional time to try to fix itself (e.g. by restarting Docker or sth), but I'm not sure it's worth the effort.

Action items:

provide better limits for eviction rate limiter,
write a test for eviction rate limiting (I don't think it needs to be an e2e test, I think that it may be possible to write a unit test for it, as it doesn't need any component other than NodeControler),
add a logic for handling 'whole cluster down' situation to NodeController and make sure that recovery is possible
decide if we want to act differently on Kubelet-driven node's problems.

It's not possible to all of this for 1.2 (i.e. until tomorrow;)

cc @davidopp @bgrant0607 @wojtek-t @fgrzadkowski @smarterclayton @quinton-hoole

bgrant0607 · 2016-02-05T03:52:49Z

The more data we have, the better decisions we can make.

It's useful to distinguish VM is unreachable vs. Kubelet not communicating.

It's useful to distinguish Kubelet can't run any containers vs. can't start new ones.

It's useful to know why: Docker dead/unresponsive, sys OOM, OOD, volumes can't be mounted, container network unavailable, etc.

It's useful to determine which failures are correlated: all, by fault domain, with software rollout, particular dedicated pools, with a Container of Death, etc.

In response, we might want to make the node unschedulable, evict scheduled pods, reboot the node, replace the node, revert the rollout, etc.

Rate limiting is just a way to slow down to human speeds in hope of human analysis or intervention. It's necessary in some cases, but we could do better in others.

gmarek · 2016-06-02T13:15:30Z

Fixed by #25571

mwielgus added the area/provider/openstack Issues or PRs related to openstack provider label Dec 28, 2015

bgrant0607-nocc added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. team/control-plane labels Jan 11, 2016

davidopp added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed area/provider/openstack Issues or PRs related to openstack provider labels Jan 25, 2016

davidopp self-assigned this Jan 25, 2016

gmarek assigned gmarek and unassigned davidopp Feb 4, 2016

gmarek added this to the next-candidate milestone Feb 4, 2016

This was referenced Feb 4, 2016

Re-evaluate rate-limiting parameters #13430

Closed

Write a test for NodeController rate limiting. #20639

Closed

If all Nodes are NotReady NodeController shouldn't evict Pods. #20640

Closed

bgrant0607 added the area/nodecontroller label Feb 5, 2016

pesho mentioned this issue May 20, 2016

NodeController doesn't evict Pods if no Nodes are Ready #25571

Merged

gmarek closed this as completed Jun 2, 2016

glesage mentioned this issue Apr 13, 2017

AWS: Node becomes NodeNotReady without logged reason #41916

Closed

d-grigorenko mentioned this issue Jul 21, 2017

Kubelet gets throttled by AWS metadata service and becomes NotReady #49331

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cloud provider API downtime still leads to NodeNotReady #19124

Cloud provider API downtime still leads to NodeNotReady #19124

pesho commented Dec 28, 2015

antoineco commented Dec 28, 2015

pesho commented Jan 17, 2016

davidopp commented Jan 25, 2016

gmarek commented Feb 4, 2016

bgrant0607 commented Feb 5, 2016

gmarek commented Jun 2, 2016

Cloud provider API downtime still leads to NodeNotReady #19124

Cloud provider API downtime still leads to NodeNotReady #19124

Comments

pesho commented Dec 28, 2015

antoineco commented Dec 28, 2015

pesho commented Jan 17, 2016

davidopp commented Jan 25, 2016

gmarek commented Feb 4, 2016

bgrant0607 commented Feb 5, 2016

gmarek commented Jun 2, 2016