-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues bringing up 200-node GCE cluster #4078
Comments
Is this with etcd 2.0 or the old version? |
|
Creating a 100 node cluster got us a little further but there was an error about the monitoring pods not scheduling. Will check health otherwise.
|
|
That etcd error is really tiresome - can we try to understand what it On Tue, Feb 3, 2015 at 3:08 PM, Satnam Singh notifications@github.com
|
+1. This is at the root of many failures and I think things have regressed a bit since the move to 2.0.0 rather than gotten better. It is also intermittent: running a command again might succeed (e.g. |
From kube-apiserver.log, there are tons of 500
|
Dawn: is this from the 200-node cluster with master |
@satnam6502: From 200-node cluster. I haven't looked at 100 one yet. |
Again, with a 50-node cluster.
|
|
what is the output of kubectl describe pods ? Noticed from the logs pasted above, there is an error related to scheduling: Monitoring pods failed to be scheduled! |
Please re-run kubectl get pods to see if you still observe above issue? |
For the 50-node cluster:
|
From the latest output of kubectl describe, I think 50-nodes cluster should work as intended. Here is the hyperthesis I have: When you run kubectl get pods, nodes' status is just changed to READY, so that scheduler couldn't schedule pods. |
So the pods that were created at first are still Unknown but pods I launched later seem to have gone into the Running state (although the service is not working yet since they probably depend on some of the Unknown status pods working).
|
Looking at a specific instance of
Although the services does not seem to be reachable via its external IP.
Looking into it (default firewall is set up for tcp 9500). |
Does the Unknown state only for those built-in pods? If so, it's likely we have a race in cluster bootstrapping between scheduler and node controller, similar to what @dchen1107 suggested. Looking at scheduler code, seems it doesn't try to reschedule the failed pods. Can you check controller-manager log and compare the timestamp with scheduler to rule out the possibility (or hopefully found the problem)? I don't have access to the large cluster unfortunately. |
Well those pods are just regular pods that are kicked off by the addons process that @zmerlynn added so there is nothing that special about most of them (unlike the pods launched from manifest files). I'll poke around the logs when I get a moment -- still trying to debug why I can't get the external IP to work. The create services call did crash with an unhelpful error :-)
|
You can check if there was anything weird about the startup by checking the On Tue, Feb 3, 2015 at 6:23 PM, Satnam Singh notifications@github.com
|
See also #4095 |
I can now contact the service using the external IP and get a valid response. Perhaps this bug needs to be renamed -- but the add ons don't seem to work -- at least Elasticsearch and Kibana are not accessible via their external IPs. |
When I dial up the replicas from 100 to 1,000 I sometimes get this error when I try to list any kind of resources: |
I am going to close this issue and do more investigation early next week and file more specific issues. |
I attempted to bring up a 200-node GCE cluster which is something that worked fairly reliably for me before (indeed, I've managed to bring up 1,000 node clusters that pass cluster validation before) but now this does not seem to work anymore. After the initial cluster validation failure I waited and re-tried but got no further. The step that fails is the enumeration of the minions (we've previously done some work to speed this up) and the actual error we see has its source in etcd.
Action item: investigate LIST MINIONS and etcd to debug performance.
The text was updated successfully, but these errors were encountered: