-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kubernetes doesn't handle autoscaled minions properly #2922
Comments
Thanks for the in depth bug report! @lavalamp, do you have time to dig in to some of this? One thing that I suspect is that you are are running the master/etcd on the same machines that are running work. That will cause problems as things get overloaded as right now there is no clear way to prioritize the cluster management over the workload. I'd look at running etcd and the master components on a dedicated node to see if that helps. That is how we generally run stuff in GCE. Quick read from me:
This is a tricky one, to be honest. If that node comes back online, you now have a "ghost" with the same ID running. This creates a lot of issues internally for Google. The general plan here is to have a "controller" that is responsible for applying policy for spinning up a new replacement container for the one that died. Right now you can do that with a ReplicationController with 1 replica, but we have talked about a "SingletonPodController" for the case when you want one and only one pod running at a time. @bgrant0607, would you like to elaborate here?
This is on our radar for v1. We want to take output from cAdvisor and schedule based on actual usage on a machine and not just "reserved usage". Having some threshold where we just won't schedule makes sense. @lavalamp?
We should get more detail here. There are a lot of conflicting requirements and no "right" answer for every situation. For instance, if you are specifying resource requirements (CPU/Mem) this might not be a bad thing, especially if you are planing to shrink down the cluster. However, you may want to ensure that you have a good spread of nodes across the cluster and that would be a scheduler constraint (don't run more than 2 pods with these labels on any machine.) Right now when the ReplicationController sizes down, it kills at random, having this be smarter (based on age? machine load? explicit?) is going to be a feature that I think we'll need. @bgrant0607?
Ideally the minion should go into an "unknown" state as kubernetes doesn't know what happened and doesn't know if the minion is coming back. If we had an AWS cloud provider it could remove the minion and clean things up as it could tell, via the AWS API, that the minion is gone for good. That being said, the rescheduling of pods to new minions is currently done with a controller as specified above. @Fandekasp, let me know if I'm missing some of the issues here. |
Listing minions not working:
Dealing with minions that are MIA:
Rather than "SingletonPodController" (aka "forever pod"), what we've discussed in more detail are:
Scheduling:
Auto-scaling of cluster: Auto-scaling of replication controllers: Pluggable policy for killing pods when scaling down: |
/cc @satnam6502 re. performance |
Sorry, I just noticed this. #2951 seems like a good thing for me to fix first.
That's a problem reaching etcd, not a problem reaching minions, which is interesting. There may be more than one failure here.
Our current logic should definitely be more advanced, but I suspect there's something broken if we ended up putting 39 pods on the same node. I think scheduler thresholds may be a bit iffy when we have multiple QoS tiers, but for now we can assume that there's one tier and we shouldn't oversubscribe, which would effectively behave like a threshold. |
Re. the etcd problem: @satnam6502 also hit that problem, and I see from past etcd issues that others have encountered it, too. Too many open files (etcd-io/etcd#1959). Look for something like the following in the etcd log:
|
Regarding protecting the Kubelet from resource starvation, that's being discussed in #2852. |
Also note that the default scheduling policy was recently changed to spread pods of the same replication controller, in addition to ensuring that requested cpu and memory are respected. |
I've not managed to do it yet, but I am hoping to try out an alpha of etcd 0.5 which may help to fix the too many open files problem. I don't think we can take a dependency on the alpha yet because it still seems in flux. |
This is subsumed by other issues. |
Using Kubernetes v0.6.2, on a AWS cloudformation stack:
Once a hundred pods has been triggered, here is what's happening:
Kubernetes deploys all 100 pods on the 2 minions
Minions are overloaded, check instance fails, and kubecfg cannot list minions anymore:
within 2 min, a cloudwatch alarm is triggered, which starts a scaleUp policy and adds 2 more minions
kubecfg still fails, I have to manually terminate the instance which check fails, and the autoscale group respawns a new healthy instance
kubecfg can list minions, but timeout to list pods:
Trying to scale down the number of pods gives weird results:
Here is the list of problems I believe happened:
Kubernetes breaks if one of the registered minions becomes unavailable. I believe that when this happens, kubernetes should smartly reassign the unavailable node pods to other nodes, and a list could return:
Kubernetes doesn't have policies (eg do not add another pod to an instance having CPU>90%). I'd really like to have something like this so it could wait for the new nodes to be available.
Kubernetes doesn't assign pods to nodes correctly. When resizing my controller from 100 to 10, I am expecting each node to take on 2~3 pods, but here 39 (?) containers were listed in a single node (probably due to the precedent bug, since I'm still unable to list pods from master).
I had also tried a ScaleDown policy in my autoscaling group, and when a node was removed, kubernetes would break as well, instead of removing the minion and reassigning pods to other ones.
Are those features work in progress, or planned for the future? Do you have any suggestion on how to handle this stack in the meanwhile?
The text was updated successfully, but these errors were encountered: