-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS: Master stops responding on t2.micros #18975
Comments
I thought I had lost the error message from
|
This happened again on the same cluster. I was unable to SSH in, but this time I just left it alone (did not restart it). Some hours later, it started responding to SSH again, and uptime indicated that it had not rebooted. Looking at the It might be that 1GB is simply not enough for the master. Or that we should configure some swap space, although I believe we deliberately don't do this because the minute we start swapping we are likely going to go unresponsive anyway. Also, although kube-apiserver is technically running, the machine is very slow and kube-apiserver is not in practice responsive. The TLS handshakes still timeout. There are long delays logged reading from etcd:
Unusually (at least to my experience), top shows a lot of time in 'wa' state:
dmesg output relating to kube-apiserver kill:
|
Yeah. We the same problem where it is very obviously running out of memory and becomes unresponsive since it cannot swap. I've been looking at some sort of information on what the expected memory usage is supposed to be. But haven't found any information on it. Are there any thoughts from the team if this is simply too small a machine to run it on? |
T2.micro might be too limited for kubernetes. When running kubernetes on AWS for a testing period of 24h the master node was reaching the CPU credit limit constantly without heavy load on the cluster. And after a while it stops responding. |
Currently we run a test cluster with etcd running on t2.nano instances and Kubernetes running on t2.small. And it seems to just exactly work. When Kubernetes does a lot of stuff it does get fairly close to the memory limit on both the node types. But so far it hasn't broken |
Have also seen this issue on a fresh run w/default configuration. Could eventually get back into master but docker was unresponsive and had to reboot to recover. There were no swap / diskspace issues but iowait was >90% CloudWatch accounting suggests the master is running down its CPUCredits so probably will underperform but there is one thing that might be amplifying the problem. a) liveness checks are failing due to "use of closed network connection" causing apiserver in particular to occasionally get restarted b) the c) the script appears to be constantly executing there are long sequences of (taking ES as an example, but all 14 are having the same issue concurrently)
note the interleaved nb not suggesting this is the root-cause, but it might be part of the reason the server becomes completely unresponsive? |
found a problem (#20219) that causes it's still burning down its |
I have the same problem. However rebooting the instance and waiting for it to settle fixes this temporarily (and it does failsafter few hours). |
We've also had experiences similar to above. With a t2.micro master, it will eventually go unresponsive. Often within 24 hours, sometimes longer. When unresponsive, we cannot use API or SSH, and are forced to issue a reboot from AWS directly. Cloudwatch suggests we did not exhaust CPU credits, but on the one or two times we've been lucky enough to be on the box, memory exhaustion seems to be the culprit. We've just recently transitioned the master to t2.small, and so far no problems, but it's been less than a day. |
How is it now? Does it still respond? |
I've been running on a t2.small instance for several days now with no particular problem, having had problems on a t2.micro instance. |
Thank you so much. I'll try on a small instance. |
Making the two changes I mentioned above, the master has been up for 12 days now and is still responsive (only light usage however). I see several instances in the log where the add-on update failed. Would suggest checking for
in the t2.small syslog just in case The minions died (NotReady, ssh timed out, needed reboot) however - have not investigated why. |
I'm using Google Cloud. They have a managed service for Kubernetes and it seems to do fairly good. |
@justinsb I see you added v1.2 milestone. Do you think you will have time to look into this in the next few days? |
I've split this into two issues:
|
For large clusters the bigger spec. is probably worthwhile (esp. if scaling the master is hard), but maybe worth noting that with the fix for #20219 and smaller number of salt-master threads the master has ran for 3 weeks on a t2.micro test cluster without this problem re-occuring. |
m3.large for > 150 nodes. t2.micro often runs out of memory. The t2 class has very difficult-to-understand behaviour when it runs out of CPU. The m3.medium is reasonably affordable, and avoids these problems. Fix kubernetes#21151 Issue kubernetes#18975
I'm trying to break #13997 into its component issues.
An issue which I was able to reproduce: when launching on t2.micros, after a little less than 24 hours, the master stopped responding to
kubectl get nodes
. I was also unable to SSH in to the instance. When I rebooted the master instance through the AWS API it did come back online (the instance took 30 minutes to reboot, but I don't think that's a k8s issue).The client was giving a TLS handshake timeout error. I was able to capture the logs after reboot, and noticed these messages shortly before the node went offline:
A minute later I got the first of these errors:
Both of those messages started repeating, but then the service appeared to recover for a few minutes, before getting worse again.
Then finally I started seeing this sort of error:
And at this point I think the service was simply not responding.
Although I suspected that we have run out of CPU credits (the t2 is burstable), this did not appear to be the case based on the CloudWatch metrics.
I did wonder whether maybe this corresponds to etcd compaction, and there was a compaction around the time, but it appears normal:
I captured all of /var/log, but I didn't see anything that leapt out at me as being suspicious.
The text was updated successfully, but these errors were encountered: