Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes doesn't handle autoscaled minions properly #2922

Closed
AdrienLemaire opened this issue Dec 14, 2014 · 11 comments
Closed

Kubernetes doesn't handle autoscaled minions properly #2922

AdrienLemaire opened this issue Dec 14, 2014 · 11 comments
Assignees
Labels
kind/support Categorizes issue or PR as a support question. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle.

Comments

@AdrienLemaire
Copy link

Using Kubernetes v0.6.2, on a AWS cloudformation stack:

core@ip-10-51-176-7 ~ $ kubecfg list minions
Minion identifier   Labels
----------          ----------
10.51.182.3
10.233.91.177

core@ip-10-51-176-7 ~ $ kubecfg list replicationControllers
Name                   Image(s)                      Selector            Replicas
----------             ----------                    ----------          ----------
postgresqlController   172.16.92.180:5000/database   name=postgresql     2

core@ip-10-51-176-7 ~ $ vim cfg/database-controller.yml  # Set 100 Replicas to trigger autoscaling policy
core@ip-10-51-176-7 ~ $ kubecfg -c cfg/database-controller.yml update replicationControllers/postgresqlController
I1214 00:12:37.981262    3925 restclient.go:133] Waiting for completion of operation 593
Name                   Image(s)                      Selector            Replicas
----------             ----------                    ----------          ----------
postgresqlController   172.16.92.180:5000/database   name=postgresql     100

Once a hundred pods has been triggered, here is what's happening:

  1. Kubernetes deploys all 100 pods on the 2 minions

  2. Minions are overloaded, check instance fails, and kubecfg cannot list minions anymore:

      core@ip-10-51-176-7 ~ $ kubecfg list minions
      F1214 00:20:33.822859    5827 kubecfg.go:428] Got request error: 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]
    
  3. within 2 min, a cloudwatch alarm is triggered, which starts a scaleUp policy and adds 2 more minions

  4. kubecfg still fails, I have to manually terminate the instance which check fails, and the autoscale group respawns a new healthy instance

  5. kubecfg can list minions, but timeout to list pods:

      core@ip-10-51-176-7 ~ $ kubecfg list minions
      Minion identifier   Labels
      ----------          ----------
      10.51.182.3
      10.144.223.221
      10.136.78.254
      10.178.161.113
    
      core@ip-10-51-176-7 ~ $ kubecfg list pods | grep 10.144.223.221
      ^C
    
  6. Trying to scale down the number of pods gives weird results:

      core@ip-10-51-176-7 ~ $ kubecfg list replicationControllers
      Name                   Image(s)                      Selector            Replicas
      ----------             ----------                    ----------          ----------
      postgresqlController   172.16.92.180:5000/database   name=postgresql     10
    
      core@ip-10-51-176-7 ~ $ fleetctl list-machines
      MACHINE         IP              METADATA
      198ae00d...     10.178.161.113  role=minion
      ca92355c...     10.144.223.221  role=minion
      d8ee87a6...     10.51.176.7     role=master
      e85d5976...     10.136.78.254   role=minion
      fc7f7c31...     10.51.182.3     role=minion
      core@ip-10-51-176-7 ~ $ for machine in 1 c e f; do echo "$machine -> $(fleetctl ssh $machine "
      docker ps | grep postgres | wc -l")"; done
      1 -> 0
      c -> 0
      e -> 0
      f -> 39
      core@ip-10-51-176-7 ~ $ kubecfg list pods 
      ^C
      core@ip-10-51-176-7 ~ $
    

Here is the list of problems I believe happened:

  • Kubernetes breaks if one of the registered minions becomes unavailable. I believe that when this happens, kubernetes should smartly reassign the unavailable node pods to other nodes, and a list could return:

      core@ip-10-51-176-7 ~ $ kubecfg list minions
      Minion identifier   Labels
      ----------          ----------
      10.51.182.3
      10.144.223.221
      10.136.78.254
       <unavailable>
    
  • Kubernetes doesn't have policies (eg do not add another pod to an instance having CPU>90%). I'd really like to have something like this so it could wait for the new nodes to be available.

  • Kubernetes doesn't assign pods to nodes correctly. When resizing my controller from 100 to 10, I am expecting each node to take on 2~3 pods, but here 39 (?) containers were listed in a single node (probably due to the precedent bug, since I'm still unable to list pods from master).

  • I had also tried a ScaleDown policy in my autoscaling group, and when a node was removed, kubernetes would break as well, instead of removing the minion and reassigning pods to other ones.

Are those features work in progress, or planned for the future? Do you have any suggestion on how to handle this stack in the meanwhile?

@jbeda
Copy link
Contributor

jbeda commented Dec 15, 2014

Thanks for the in depth bug report!

@lavalamp, do you have time to dig in to some of this?

One thing that I suspect is that you are are running the master/etcd on the same machines that are running work. That will cause problems as things get overloaded as right now there is no clear way to prioritize the cluster management over the workload. I'd look at running etcd and the master components on a dedicated node to see if that helps. That is how we generally run stuff in GCE.

Quick read from me:

  • Kubernetes breaks if one of the registered minions becomes unavailable. I believe that when this happens, kubernetes should smartly reassign the unavailable node pods to other nodes, and a list could return:

This is a tricky one, to be honest. If that node comes back online, you now have a "ghost" with the same ID running. This creates a lot of issues internally for Google. The general plan here is to have a "controller" that is responsible for applying policy for spinning up a new replacement container for the one that died. Right now you can do that with a ReplicationController with 1 replica, but we have talked about a "SingletonPodController" for the case when you want one and only one pod running at a time. @bgrant0607, would you like to elaborate here?

  • Kubernetes doesn't have policies (eg do not add another pod to an instance having CPU>90%). I'd really like to have something like this so it could wait for the new nodes to be available.

This is on our radar for v1. We want to take output from cAdvisor and schedule based on actual usage on a machine and not just "reserved usage". Having some threshold where we just won't schedule makes sense. @lavalamp?

  • Kubernetes doesn't assign pods to nodes correctly. When resizing my controller from 100 to 10, I am expecting each node to take on 2~3 pods, but here 39 (?) containers were listed in a single node (probably due to the precedent bug, since I'm still unable to list pods from master).

We should get more detail here. There are a lot of conflicting requirements and no "right" answer for every situation. For instance, if you are specifying resource requirements (CPU/Mem) this might not be a bad thing, especially if you are planing to shrink down the cluster. However, you may want to ensure that you have a good spread of nodes across the cluster and that would be a scheduler constraint (don't run more than 2 pods with these labels on any machine.) Right now when the ReplicationController sizes down, it kills at random, having this be smarter (based on age? machine load? explicit?) is going to be a feature that I think we'll need. @bgrant0607?

  • I had also tried a ScaleDown policy in my autoscaling group, and when a node was removed, kubernetes would break as well, instead of removing the minion and reassigning pods to other ones.

Ideally the minion should go into an "unknown" state as kubernetes doesn't know what happened and doesn't know if the minion is coming back. If we had an AWS cloud provider it could remove the minion and clean things up as it could tell, via the AWS API, that the minion is gone for good. That being said, the rescheduling of pods to new minions is currently done with a controller as specified above.

@Fandekasp, let me know if I'm missing some of the issues here.

@vishh
Copy link
Contributor

vishh commented Dec 15, 2014

@Fandekasp: @rjnagal, @vmarmol and I have been doing some work around automatically scaling up the number of nodes in a kubernetes cluster. We will post our work soon.

@AdrienLemaire
Copy link
Author

@jbeda thanks for the detailed answer, I don't think you missed any issues.

@vishh looking forward to reading your post!

@bgrant0607
Copy link
Member

Listing minions not working:

  1. We shouldn't be fetching this state on demand. Same problem as kubectl/apiserver problems if minion down #2951.
  2. We should ensure Kubelet and Docker are protected from excessive resource usage by user containers. Configuring the OOM Killer for processes managed by kubelet #2852

Dealing with minions that are MIA:

  1. The plan is to have the minion/node controller "babysit" the minions and decide when to kill pods. Improve eviction of pods from unhealthy and vanished nodes #1366
  2. The "phantom pod" issue that @jbeda mentioned is real. (a) We should do our best to ensure Kubelet availability, but (b) applications that need application-level locking or storage-level atomicity need to use services that provide that functionality rather than depend on Kubernetes. Kubernetes favors availability over mutual exclusion.

Rather than "SingletonPodController" (aka "forever pod"), what we've discussed in more detail are:

  1. Nominal services PetSet (was nominal services) #260: Dynamically assign names/addresses to pods.
  2. Pod migration, once pods, pod IP addresses, and data volumes are migratable, by building upon existing "pod instances". I do want to be careful about encouraging people to treat pods as pets, however.

Scheduling:

  1. If you specified cpu and memory for containers, not all pods would land on the same node, and it would constrain usage and perhaps protect Kubelet somewhat.
  2. Using cAdvisor-based information is tracked by scheduler should leverage resource usage for pod placement #274.

Auto-scaling of cluster:
Some other users/partners want this. Definitely adding/removing minions should work.

Auto-scaling of replication controllers:
Discussion in progress on #2863.

Pluggable policy for killing pods when scaling down:
My proposal is to generate events that applications can respond to: #2804. The built-in policy should also be smarter, such as taking readiness (#620) into account.

@bgrant0607 bgrant0607 added kind/support Categorizes issue or PR as a support question. priority/backlog Higher priority than priority/awaiting-more-evidence. labels Dec 16, 2014
@bgrant0607
Copy link
Member

/cc @satnam6502 re. performance
/cc @dchen1107 re. Kubelet unresponsiveness

@lavalamp
Copy link
Member

Sorry, I just noticed this. #2951 seems like a good thing for me to fix first.

F1214 00:20:33.822859 5827 kubecfg.go:428] Got request error: 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]

That's a problem reaching etcd, not a problem reaching minions, which is interesting. There may be more than one failure here.

This is on our radar for v1. We want to take output from cAdvisor and schedule based on actual usage on a machine and not just "reserved usage". Having some threshold where we just won't schedule makes sense. @lavalamp?

Our current logic should definitely be more advanced, but I suspect there's something broken if we ended up putting 39 pods on the same node. I think scheduler thresholds may be a bit iffy when we have multiple QoS tiers, but for now we can assume that there's one tier and we shouldn't oversubscribe, which would effectively behave like a threshold.

@bgrant0607
Copy link
Member

Re. the etcd problem: @satnam6502 also hit that problem, and I see from past etcd issues that others have encountered it, too. Too many open files (etcd-io/etcd#1959). Look for something like the following in the etcd log:

2014/12/18 03:36:43 http: Accept error: accept tcp [::]:4001: too many open files; retrying in 40ms

@bgrant0607
Copy link
Member

Regarding protecting the Kubelet from resource starvation, that's being discussed in #2852.

@bgrant0607
Copy link
Member

Also note that the default scheduling policy was recently changed to spread pods of the same replication controller, in addition to ensuring that requested cpu and memory are respected.

@satnam6502
Copy link
Contributor

I've not managed to do it yet, but I am hoping to try out an alpha of etcd 0.5 which may help to fix the too many open files problem. I don't think we can take a dependency on the alpha yet because it still seems in flux.

@goltermann goltermann added priority/support and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Jan 7, 2015
@lavalamp lavalamp removed their assignment Jan 15, 2015
@roberthbailey roberthbailey added the sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. label Feb 18, 2015
@bgrant0607 bgrant0607 self-assigned this Feb 28, 2015
@bgrant0607
Copy link
Member

This is subsumed by other issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Categorizes issue or PR as a support question. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle.
Projects
None yet
Development

No branches or pull requests

8 participants