Kubernetes doesn't handle autoscaled minions properly #2922

AdrienLemaire · 2014-12-14T01:02:32Z

Using Kubernetes v0.6.2, on a AWS cloudformation stack:

core@ip-10-51-176-7 ~ $ kubecfg list minions
Minion identifier   Labels
----------          ----------
10.51.182.3
10.233.91.177

core@ip-10-51-176-7 ~ $ kubecfg list replicationControllers
Name                   Image(s)                      Selector            Replicas
----------             ----------                    ----------          ----------
postgresqlController   172.16.92.180:5000/database   name=postgresql     2

core@ip-10-51-176-7 ~ $ vim cfg/database-controller.yml  # Set 100 Replicas to trigger autoscaling policy
core@ip-10-51-176-7 ~ $ kubecfg -c cfg/database-controller.yml update replicationControllers/postgresqlController
I1214 00:12:37.981262    3925 restclient.go:133] Waiting for completion of operation 593
Name                   Image(s)                      Selector            Replicas
----------             ----------                    ----------          ----------
postgresqlController   172.16.92.180:5000/database   name=postgresql     100

Once a hundred pods has been triggered, here is what's happening:

Kubernetes deploys all 100 pods on the 2 minions

Minions are overloaded, check instance fails, and kubecfg cannot list minions anymore:

  core@ip-10-51-176-7 ~ $ kubecfg list minions
  F1214 00:20:33.822859    5827 kubecfg.go:428] Got request error: 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]

within 2 min, a cloudwatch alarm is triggered, which starts a scaleUp policy and adds 2 more minions
kubecfg still fails, I have to manually terminate the instance which check fails, and the autoscale group respawns a new healthy instance

kubecfg can list minions, but timeout to list pods:

  core@ip-10-51-176-7 ~ $ kubecfg list minions
  Minion identifier   Labels
  ----------          ----------
  10.51.182.3
  10.144.223.221
  10.136.78.254
  10.178.161.113

  core@ip-10-51-176-7 ~ $ kubecfg list pods | grep 10.144.223.221
  ^C

Trying to scale down the number of pods gives weird results:

  core@ip-10-51-176-7 ~ $ kubecfg list replicationControllers
  Name                   Image(s)                      Selector            Replicas
  ----------             ----------                    ----------          ----------
  postgresqlController   172.16.92.180:5000/database   name=postgresql     10

  core@ip-10-51-176-7 ~ $ fleetctl list-machines
  MACHINE         IP              METADATA
  198ae00d...     10.178.161.113  role=minion
  ca92355c...     10.144.223.221  role=minion
  d8ee87a6...     10.51.176.7     role=master
  e85d5976...     10.136.78.254   role=minion
  fc7f7c31...     10.51.182.3     role=minion
  core@ip-10-51-176-7 ~ $ for machine in 1 c e f; do echo "$machine -> $(fleetctl ssh $machine "
  docker ps | grep postgres | wc -l")"; done
  1 -> 0
  c -> 0
  e -> 0
  f -> 39
  core@ip-10-51-176-7 ~ $ kubecfg list pods 
  ^C
  core@ip-10-51-176-7 ~ $

Here is the list of problems I believe happened:

Kubernetes breaks if one of the registered minions becomes unavailable. I believe that when this happens, kubernetes should smartly reassign the unavailable node pods to other nodes, and a list could return:
```
  core@ip-10-51-176-7 ~ $ kubecfg list minions
  Minion identifier   Labels
  ----------          ----------
  10.51.182.3
  10.144.223.221
  10.136.78.254
   <unavailable>
```
Kubernetes doesn't have policies (eg do not add another pod to an instance having CPU>90%). I'd really like to have something like this so it could wait for the new nodes to be available.
Kubernetes doesn't assign pods to nodes correctly. When resizing my controller from 100 to 10, I am expecting each node to take on 2~3 pods, but here 39 (?) containers were listed in a single node (probably due to the precedent bug, since I'm still unable to list pods from master).
I had also tried a ScaleDown policy in my autoscaling group, and when a node was removed, kubernetes would break as well, instead of removing the minion and reassigning pods to other ones.

Are those features work in progress, or planned for the future? Do you have any suggestion on how to handle this stack in the meanwhile?

The text was updated successfully, but these errors were encountered:

jbeda · 2014-12-15T19:03:34Z

Thanks for the in depth bug report!

@lavalamp, do you have time to dig in to some of this?

One thing that I suspect is that you are are running the master/etcd on the same machines that are running work. That will cause problems as things get overloaded as right now there is no clear way to prioritize the cluster management over the workload. I'd look at running etcd and the master components on a dedicated node to see if that helps. That is how we generally run stuff in GCE.

Quick read from me:

Kubernetes breaks if one of the registered minions becomes unavailable. I believe that when this happens, kubernetes should smartly reassign the unavailable node pods to other nodes, and a list could return:

This is a tricky one, to be honest. If that node comes back online, you now have a "ghost" with the same ID running. This creates a lot of issues internally for Google. The general plan here is to have a "controller" that is responsible for applying policy for spinning up a new replacement container for the one that died. Right now you can do that with a ReplicationController with 1 replica, but we have talked about a "SingletonPodController" for the case when you want one and only one pod running at a time. @bgrant0607, would you like to elaborate here?

Kubernetes doesn't have policies (eg do not add another pod to an instance having CPU>90%). I'd really like to have something like this so it could wait for the new nodes to be available.

This is on our radar for v1. We want to take output from cAdvisor and schedule based on actual usage on a machine and not just "reserved usage". Having some threshold where we just won't schedule makes sense. @lavalamp?

Kubernetes doesn't assign pods to nodes correctly. When resizing my controller from 100 to 10, I am expecting each node to take on 2~3 pods, but here 39 (?) containers were listed in a single node (probably due to the precedent bug, since I'm still unable to list pods from master).

We should get more detail here. There are a lot of conflicting requirements and no "right" answer for every situation. For instance, if you are specifying resource requirements (CPU/Mem) this might not be a bad thing, especially if you are planing to shrink down the cluster. However, you may want to ensure that you have a good spread of nodes across the cluster and that would be a scheduler constraint (don't run more than 2 pods with these labels on any machine.) Right now when the ReplicationController sizes down, it kills at random, having this be smarter (based on age? machine load? explicit?) is going to be a feature that I think we'll need. @bgrant0607?

I had also tried a ScaleDown policy in my autoscaling group, and when a node was removed, kubernetes would break as well, instead of removing the minion and reassigning pods to other ones.

Ideally the minion should go into an "unknown" state as kubernetes doesn't know what happened and doesn't know if the minion is coming back. If we had an AWS cloud provider it could remove the minion and clean things up as it could tell, via the AWS API, that the minion is gone for good. That being said, the rescheduling of pods to new minions is currently done with a controller as specified above.

@Fandekasp, let me know if I'm missing some of the issues here.

vishh · 2014-12-15T21:31:54Z

@Fandekasp: @rjnagal, @vmarmol and I have been doing some work around automatically scaling up the number of nodes in a kubernetes cluster. We will post our work soon.

AdrienLemaire · 2014-12-15T21:38:18Z

@jbeda thanks for the detailed answer, I don't think you missed any issues.

@vishh looking forward to reading your post!

bgrant0607 · 2014-12-16T01:48:52Z

Listing minions not working:

We shouldn't be fetching this state on demand. Same problem as kubectl/apiserver problems if minion down #2951.
We should ensure Kubelet and Docker are protected from excessive resource usage by user containers. Configuring the OOM Killer for processes managed by kubelet #2852

Dealing with minions that are MIA:

The plan is to have the minion/node controller "babysit" the minions and decide when to kill pods. Improve eviction of pods from unhealthy and vanished nodes #1366
The "phantom pod" issue that @jbeda mentioned is real. (a) We should do our best to ensure Kubelet availability, but (b) applications that need application-level locking or storage-level atomicity need to use services that provide that functionality rather than depend on Kubernetes. Kubernetes favors availability over mutual exclusion.

Rather than "SingletonPodController" (aka "forever pod"), what we've discussed in more detail are:

Nominal services PetSet (was nominal services) #260: Dynamically assign names/addresses to pods.
Pod migration, once pods, pod IP addresses, and data volumes are migratable, by building upon existing "pod instances". I do want to be careful about encouraging people to treat pods as pets, however.

Scheduling:

If you specified cpu and memory for containers, not all pods would land on the same node, and it would constrain usage and perhaps protect Kubelet somewhat.
Using cAdvisor-based information is tracked by scheduler should leverage resource usage for pod placement #274.

Auto-scaling of cluster:
Some other users/partners want this. Definitely adding/removing minions should work.

Auto-scaling of replication controllers:
Discussion in progress on #2863.

Pluggable policy for killing pods when scaling down:
My proposal is to generate events that applications can respond to: #2804. The built-in policy should also be smarter, such as taking readiness (#620) into account.

bgrant0607 · 2014-12-16T19:00:28Z

/cc @satnam6502 re. performance
/cc @dchen1107 re. Kubelet unresponsiveness

lavalamp · 2014-12-19T00:25:31Z

Sorry, I just noticed this. #2951 seems like a good thing for me to fix first.

F1214 00:20:33.822859 5827 kubecfg.go:428] Got request error: 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]

That's a problem reaching etcd, not a problem reaching minions, which is interesting. There may be more than one failure here.

This is on our radar for v1. We want to take output from cAdvisor and schedule based on actual usage on a machine and not just "reserved usage". Having some threshold where we just won't schedule makes sense. @lavalamp?

Our current logic should definitely be more advanced, but I suspect there's something broken if we ended up putting 39 pods on the same node. I think scheduler thresholds may be a bit iffy when we have multiple QoS tiers, but for now we can assume that there's one tier and we shouldn't oversubscribe, which would effectively behave like a threshold.

bgrant0607 · 2014-12-19T01:41:11Z

Re. the etcd problem: @satnam6502 also hit that problem, and I see from past etcd issues that others have encountered it, too. Too many open files (etcd-io/etcd#1959). Look for something like the following in the etcd log:

2014/12/18 03:36:43 http: Accept error: accept tcp [::]:4001: too many open files; retrying in 40ms

bgrant0607 · 2014-12-19T01:44:10Z

Regarding protecting the Kubelet from resource starvation, that's being discussed in #2852.

bgrant0607 · 2014-12-19T01:45:19Z

Also note that the default scheduling policy was recently changed to spread pods of the same replication controller, in addition to ensuring that requested cpu and memory are respected.

satnam6502 · 2014-12-19T06:44:45Z

I've not managed to do it yet, but I am hoping to try out an alpha of etcd 0.5 which may help to fix the too many open files problem. I don't think we can take a dependency on the alpha yet because it still seems in flux.

bgrant0607 · 2015-05-07T18:35:24Z

This is subsumed by other issues.

jbeda assigned lavalamp Dec 15, 2014

bgrant0607 added area/nodecontroller and removed area/nodecontroller labels Dec 15, 2014

bgrant0607 added kind/support Categorizes issue or PR as a support question. priority/backlog Higher priority than priority/awaiting-more-evidence. labels Dec 16, 2014

lavalamp mentioned this issue Dec 21, 2014

Make list pods constant time #3083

Merged

goltermann added priority/support and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Jan 7, 2015

lavalamp removed their assignment Jan 15, 2015

roberthbailey added the sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. label Feb 18, 2015

bgrant0607 self-assigned this Feb 28, 2015

bgrant0607 closed this as completed May 7, 2015

rektide mentioned this issue Jun 8, 2015

Does kubernetes support auto-scaling of containers/pods according to cpu utilization? #8242

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes doesn't handle autoscaled minions properly #2922

Kubernetes doesn't handle autoscaled minions properly #2922

AdrienLemaire commented Dec 14, 2014

jbeda commented Dec 15, 2014

vishh commented Dec 15, 2014

AdrienLemaire commented Dec 15, 2014

bgrant0607 commented Dec 16, 2014

bgrant0607 commented Dec 16, 2014

lavalamp commented Dec 19, 2014

bgrant0607 commented Dec 19, 2014

bgrant0607 commented Dec 19, 2014

bgrant0607 commented Dec 19, 2014

satnam6502 commented Dec 19, 2014

bgrant0607 commented May 7, 2015

Kubernetes doesn't handle autoscaled minions properly #2922

Kubernetes doesn't handle autoscaled minions properly #2922

Comments

AdrienLemaire commented Dec 14, 2014

jbeda commented Dec 15, 2014

vishh commented Dec 15, 2014

AdrienLemaire commented Dec 15, 2014

bgrant0607 commented Dec 16, 2014

bgrant0607 commented Dec 16, 2014

lavalamp commented Dec 19, 2014

bgrant0607 commented Dec 19, 2014

bgrant0607 commented Dec 19, 2014

bgrant0607 commented Dec 19, 2014

satnam6502 commented Dec 19, 2014

bgrant0607 commented May 7, 2015