Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues bringing up 200-node GCE cluster #4078

Closed
satnam6502 opened this issue Feb 3, 2015 · 24 comments
Closed

Issues bringing up 200-node GCE cluster #4078

satnam6502 opened this issue Feb 3, 2015 · 24 comments
Labels
area/reliability priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.

Comments

@satnam6502
Copy link
Contributor

I attempted to bring up a 200-node GCE cluster which is something that worked fairly reliably for me before (indeed, I've managed to bring up 1,000 node clusters that pass cluster validation before) but now this does not seem to work anymore. After the initial cluster validation failure I waited and re-tried but got no further. The step that fails is the enumeration of the minions (we've previously done some work to speed this up) and the actual error we see has its source in etcd.

Action item: investigate LIST MINIONS and etcd to debug performance.

Attempt 1 to check Docker on node kubernetes-minion-196 ... [working]
Attempt 1 to check Docker on node kubernetes-minion-197 ... [working]
Attempt 1 to check Docker on node kubernetes-minion-198 ... [working]
Attempt 1 to check Docker on node kubernetes-minion-199 ... [working]
Attempt 1 to check Docker on node kubernetes-minion-200 ... [working]

Kubernetes cluster is running.  The master is running at:

  https://104.155.239.43

The user name and password to use is located in ~/.kubernetes_auth.

... calling validate-cluster
Project: kubernetes-scale
Zone: asia-east1-b
Running: cluster/../cluster/gce/../../cluster/../cluster/gce/../../cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get minions -o template -t {{range.items}}{{.id}}
{{end}}
Running: cluster/../cluster/gce/../../cluster/../cluster/gce/../../cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get minions -o template -t {{range.items}}{{.id}}
{{end}}
Running: cluster/../cluster/gce/../../cluster/../cluster/gce/../../cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get minions -o template -t {{range.items}}{{.id}}
{{end}}
Running: cluster/../cluster/gce/../../cluster/../cluster/gce/../../cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get minions -o template -t {{range.items}}{{.id}}
{{end}}
F0203 14:01:25.284908   29117 get.go:143] 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]
satnam@satnam-linux:~/gocode/src/github.com/GoogleCloudPlatform/kubernetes$ cluster/validate-cluster.sh 
Project: kubernetes-scale
Zone: asia-east1-b
Running: cluster/../cluster/gce/../../cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get minions -o template -t {{range.items}}{{.id}}
{{end}}
Running: cluster/../cluster/gce/../../cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get minions -o template -t {{range.items}}{{.id}}
{{end}}
Running: cluster/../cluster/gce/../../cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get minions -o template -t {{range.items}}{{.id}}
{{end}}
Running: cluster/../cluster/gce/../../cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get minions -o template -t {{range.items}}{{.id}}
{{end}}
Running: cluster/../cluster/gce/../../cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get minions -o template -t {{range.items}}{{.id}}
{{end}}
Running: cluster/../cluster/gce/../../cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get minions -o template -t {{range.items}}{{.id}}
{{end}}
Running: cluster/../cluster/gce/../../cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get minions -o template -t {{range.items}}{{.id}}
{{end}}
F0203 14:15:33.088271    3206 get.go:143] 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]
@satnam6502 satnam6502 added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. area/reliability labels Feb 3, 2015
@thockin
Copy link
Member

thockin commented Feb 3, 2015

Is this with etcd 2.0 or the old version?

@satnam6502
Copy link
Contributor Author

satnam@kubernetes-master:/var/log$ etcd --version
etcd version 2.0.0

@satnam6502
Copy link
Contributor Author

Creating a 100 node cluster got us a little further but there was an error about the monitoring pods not scheduling. Will check health otherwise.

Attempt 1 at checking Kubelet installation on node k100-minion-98 ... [working]
Attempt 1 at checking Kubelet installation on node k100-minion-99 ... [working]
Attempt 1 at checking Kubelet installation on node k100-minion-100 ... [working]
Cluster validation succeeded
... calling setup-monitoring-firewall
Setting up firewalls to Heapster based cluster monitoring.
Created [https://www.googleapis.com/compute/v1/projects/kubernetes-scale/global/firewalls/k100-monitoring-heapster].
NAME                     NETWORK SRC_RANGES RULES                    SRC_TAGS TARGET_TAGS
k100-monitoring-heapster default 0.0.0.0/0  tcp:80,tcp:8083,tcp:8086          k100-minion
waiting for monitoring pods to be scheduled.
Running: cluster/../cluster/gce/../../cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get pods -l name=influxGrafana -o template -t {{range.items}}{{.currentState.hostIP}}:{{end}}
Running: cluster/../cluster/gce/../../cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get pods -l name=influxGrafana -o template -t {{range.items}}{{.currentState.hostIP}}:{{end}}
Running: cluster/../cluster/gce/../../cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get pods -l name=influxGrafana -o template -t {{range.items}}{{.currentState.hostIP}}:{{end}}
Running: cluster/../cluster/gce/../../cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get pods -l name=influxGrafana -o template -t {{range.items}}{{.currentState.hostIP}}:{{end}}
Running: cluster/../cluster/gce/../../cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get pods -l name=influxGrafana -o template -t {{range.items}}{{.currentState.hostIP}}:{{end}}
Running: cluster/../cluster/gce/../../cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get pods -l name=influxGrafana -o template -t {{range.items}}{{.currentState.hostIP}}:{{end}}
Running: cluster/../cluster/gce/../../cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get pods -l name=influxGrafana -o template -t {{range.items}}{{.currentState.hostIP}}:{{end}}
Running: cluster/../cluster/gce/../../cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get pods -l name=influxGrafana -o template -t {{range.items}}{{.currentState.hostIP}}:{{end}}
Running: cluster/../cluster/gce/../../cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get pods -l name=influxGrafana -o template -t {{range.items}}{{.currentState.hostIP}}:{{end}}
Running: cluster/../cluster/gce/../../cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get pods -l name=influxGrafana -o template -t {{range.items}}{{.currentState.hostIP}}:{{end}}
Monitoring pods failed to be scheduled!
... calling setup-logging-firewall
Created [https://www.googleapis.com/compute/v1/projects/kubernetes-scale/global/firewalls/k100-fluentd-elasticsearch-logging].
NAME                               NETWORK SRC_RANGES RULES                      SRC_TAGS TARGET_TAGS
k100-fluentd-elasticsearch-logging default 0.0.0.0/0  tcp:5601,tcp:9200,tcp:9300          k100-minion
waiting for logging services to be created by the master.
Running: cluster/../cluster/gce/../../cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get services -l name=kibana-logging -o template -t {{range.items}}{{.id}}{{end}}
Running: cluster/../cluster/gce/../../cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get services -l name=elasticsearch-logging -o template -t {{range.items}}{{.id}}{{end}}

Cluster logs are ingested into Elasticsearch running at http://104.155.200.110:9200
Kibana logging dashboard will be available at http://107.167.186.135:5601

@satnam6502
Copy link
Contributor Author

$ cluster/kubectl.sh get pods
Running: cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get pods
POD                                          IP                  CONTAINER(S)            IMAGE(S)                            HOST                                         LABELS                                   STATUS
elasticsearch-logging-controller-fplln                           elasticsearch-logging   dockerfile/elasticsearch            k100-minion-4.c.kubernetes-scale.internal/   name=elasticsearch-logging               Unknown
kibana-logging-controller-gziey                                  kibana-logging          kubernetes/kibana:1.0               k100-minion-1.c.kubernetes-scale.internal/   name=kibana-logging                      Unknown
monitoring-heapster-controller-oh43e                             heapster                kubernetes/heapster:v0.6            k100-minion-6.c.kubernetes-scale.internal/   name=heapster,uses=monitoring-influxdb   Unknown
monitoring-influx-grafana-controller-0133o                       influxdb                kubernetes/heapster_influxdb:v0.3   k100-minion-3.c.kubernetes-scale.internal/   name=influxGrafana                       Unknown
                                                                 grafana                 kubernetes/heapster_grafana:v0.3                                                                                          
skydns-ls6k1                                                     etcd                    quay.io/coreos/etcd:latest          k100-minion-4.c.kubernetes-scale.internal/   k8s-app=skydns                           Unknown
                                                                 kube2sky                kubernetes/kube2sky:1.0                                                                                                   
                                                                 skydns                  kubernetes/skydns:2014-12-23-001 

@thockin
Copy link
Member

thockin commented Feb 3, 2015

That etcd error is really tiresome - can we try to understand what it
really means and what to do when it happens? Maybe engage the coreos folks
and see - is this something we should retry? is it fatal?

On Tue, Feb 3, 2015 at 3:08 PM, Satnam Singh notifications@github.com
wrote:

satnam@kubernetes-master:/var/log$ etcd --version
etcd version 2.0.0

Reply to this email directly or view it on GitHub
#4078 (comment)
.

@satnam6502
Copy link
Contributor Author

+1. This is at the root of many failures and I think things have regressed a bit since the move to 2.0.0 rather than gotten better. It is also intermittent: running a command again might succeed (e.g. get pods). Next week I am happy to devote some time to this because for me this is intricately tied up with scalability work. Until then, perhaps we should make a wrapper library for etcd which builds in retries and other resilience?

@dchen1107
Copy link
Member

  • rule out etcd issue: etcd_server is running, and records related minions are created properly by using etcdctl
  • rule out disk issue too

From kube-apiserver.log, there are tons of 500

I0203 23:11:12.486678    5260 handlers.go:107] GET /api/v1beta1/pods?fields=Status.Host%3Dkubernetes-minion-40.c.kubernetes-scale.internal: (331.764µs) 500
goroutine 9550 [running]:
github.com/GoogleCloudPlatform/kubernetes/pkg/httplog.(*respLogger).WriteHeader(0xc20d6d18c0, 0x1f4)
        /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/pkg/httplog/log.go:182 +0x9a
github.com/GoogleCloudPlatform/kubernetes/pkg/apiserver.writeJSON(0x1f4, 0x7f09776526a0, 0xc20801c5e0, 0x7f0977652868, 0xc208a54c00, 0x7f0977660568, 0xc20d6d18c0)
        /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/pkg/apiserver/apiserver.go:447 +0x232
github.com/GoogleCloudPlatform/kubernetes/pkg/apiserver.errorJSON(0x7f0977656370, 0xc20b8e0b10, 0x7f09776526a0, 0xc20801c5e0, 0x7f0977660568, 0xc20d6d18c0)
        /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/pkg/apiserver/apiserver.go:454 +0xa3
github.com/GoogleCloudPlatform/kubernetes/pkg/apiserver.(*RESTHandler).handleRESTStorage(0xc208130cc0, 0xc2081f7ab0, 0x1, 0x1, 0xc209dd40d0, 0x7f0977660568, 0xc20d6d18c0, 0x7f09776564c0, 0xc208
0e3820, 0x0, ...)
        /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/pkg/apiserver/resthandler.go:173 +0xbe9
github.com/GoogleCloudPlatform/kubernetes/pkg/apiserver.(*RESTHandler).ServeHTTP(0xc208130cc0, 0x7f0977660568, 0xc20d6d18c0, 0xc209dd40d0)
        /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/pkg/apiserver/resthandler.go:56 +0x1f7
net/http.func·012(0x7f0977660568, 0xc20d6d18c0, 0xc209dd40d0)
        /usr/src/go/src/net/http/server.go:1297 +0xe1
net/http.HandlerFunc.ServeHTTP(0xc208170020, 0x7f0977660568, 0xc20d6d18c0, 0xc209dd40d0)
        /usr/src/go/src/net/http/server.go:1265 +0x41
github.com/GoogleCloudPlatform/kubernetes/pkg/apiserver.func·001(0xc20b8e0750, 0xc2084fd9a0)
        /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/pkg/apiserver/a
E0203 23:11:12.502507    5260 errors.go:52] apiserver received an error that is not an api.Status: 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0
]
I0203 23:11:12.502673    5260 handlers.go:107] GET /api/v1beta1/services: (346.697µs) 500
goroutine 691658 [running]:
github.com/GoogleCloudPlatform/kubernetes/pkg/httplog.(*respLogger).WriteHeader(0xc20db621e0, 0x1f4)
        /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/pkg/httplog/log.go:182 +0x9a
github.com/GoogleCloudPlatform/kubernetes/pkg/apiserver.writeJSON(0x1f4, 0x7f09776526a0, 0xc20801c5e0, 0x7f0977652868, 0xc208a54d00, 0x7f0977660568, 0xc20db621e0)
        /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/pkg/apiserver/apiserver.go:447 +0x232
github.com/GoogleCloudPlatform/kubernetes/pkg/apiserver.errorJSON(0x7f0977656370, 0xc20b8e1260, 0x7f09776526a0, 0xc20801c5e0, 0x7f0977660568, 0xc20db621e0)
        /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/pkg/apiserver/apiserver.go:454 +0xa3
github.com/GoogleCloudPlatform/kubernetes/pkg/apiserver.(*RESTHandler).handleRESTStorage(0xc208130cc0, 0xc2081c7b30, 0x1, 0x1, 0xc20b8ee820, 0x7f0977660568, 0xc20db621e0, 0x7f0977656510, 0xc208
0eb8c0, 0x0, ...)

@satnam6502
Copy link
Contributor Author

Dawn: is this from the 200-node cluster with master kubernetes-master or is it from the 100 node cluster with master k100-master? Thanks.

@dchen1107
Copy link
Member

@satnam6502: From 200-node cluster. I haven't looked at 100 one yet.

@goltermann goltermann added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Feb 4, 2015
@satnam6502
Copy link
Contributor Author

Again, with a 50-node cluster.

Attempt 1 at checking Kubelet installation on node k50-minion-46 ... [working]
Attempt 1 at checking Kubelet installation on node k50-minion-47 ... [working]
Attempt 1 at checking Kubelet installation on node k50-minion-48 ... [working]
Attempt 1 at checking Kubelet installation on node k50-minion-49 ... [working]
Attempt 1 at checking Kubelet installation on node k50-minion-50 ... [working]
Cluster validation succeeded
... calling setup-monitoring-firewall
Setting up firewalls to Heapster based cluster monitoring.
Created [https://www.googleapis.com/compute/v1/projects/kubernetes-scale/global/firewalls/k50-monitoring-heapster].
NAME                    NETWORK SRC_RANGES RULES                    SRC_TAGS TARGET_TAGS
k50-monitoring-heapster default 0.0.0.0/0  tcp:80,tcp:8083,tcp:8086          k50-minion
waiting for monitoring pods to be scheduled.
Running: cluster/../cluster/gce/../../cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get pods -l name=influxGrafana -o template -t {{range.items}}{{.currentState.hostIP}}:{{end}}
Running: cluster/../cluster/gce/../../cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get pods -l name=influxGrafana -o template -t {{range.items}}{{.currentState.hostIP}}:{{end}}
Running: cluster/../cluster/gce/../../cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get pods -l name=influxGrafana -o template -t {{range.items}}{{.currentState.hostIP}}:{{end}}
Running: cluster/../cluster/gce/../../cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get pods -l name=influxGrafana -o template -t {{range.items}}{{.currentState.hostIP}}:{{end}}
Running: cluster/../cluster/gce/../../cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get pods -l name=influxGrafana -o template -t {{range.items}}{{.currentState.hostIP}}:{{end}}
Running: cluster/../cluster/gce/../../cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get pods -l name=influxGrafana -o template -t {{range.items}}{{.currentState.hostIP}}:{{end}}
Running: cluster/../cluster/gce/../../cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get pods -l name=influxGrafana -o template -t {{range.items}}{{.currentState.hostIP}}:{{end}}
Running: cluster/../cluster/gce/../../cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get pods -l name=influxGrafana -o template -t {{range.items}}{{.currentState.hostIP}}:{{end}}
Running: cluster/../cluster/gce/../../cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get pods -l name=influxGrafana -o template -t {{range.items}}{{.currentState.hostIP}}:{{end}}
Running: cluster/../cluster/gce/../../cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get pods -l name=influxGrafana -o template -t {{range.items}}{{.currentState.hostIP}}:{{end}}
Monitoring pods failed to be scheduled!
... calling setup-logging-firewall
Created [https://www.googleapis.com/compute/v1/projects/kubernetes-scale/global/firewalls/k50-fluentd-elasticsearch-logging].
NAME                              NETWORK SRC_RANGES RULES                      SRC_TAGS TARGET_TAGS
k50-fluentd-elasticsearch-logging default 0.0.0.0/0  tcp:5601,tcp:9200,tcp:9300          k50-minion
waiting for logging services to be created by the master.
Running: cluster/../cluster/gce/../../cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get services -l name=kibana-logging -o template -t {{range.items}}{{.id}}{{end}}
Running: cluster/../cluster/gce/../../cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get services -l name=elasticsearch-logging -o template -t {{range.items}}{{.id}}{{end}}

Cluster logs are ingested into Elasticsearch running at http://104.155.200.110:9200
Kibana logging dashboard will be available at http://107.167.186.135:5601

Done
satnam@satnam-linux:~/gocode/src/github.com/GoogleCloudPlatform/kubernetes$ cluster/kubectl.sh get pods
Running: cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get pods
POD                                          IP                  CONTAINER(S)            IMAGE(S)                            HOST                                        LABELS                                   STATUS
elasticsearch-logging-controller-fplln                           elasticsearch-logging   dockerfile/elasticsearch            k50-minion-6.c.kubernetes-scale.internal/   name=elasticsearch-logging               Unknown
kibana-logging-controller-gziey                                  kibana-logging          kubernetes/kibana:1.0               k50-minion-6.c.kubernetes-scale.internal/   name=kibana-logging                      Unknown
monitoring-heapster-controller-0133o                             heapster                kubernetes/heapster:v0.6            k50-minion-4.c.kubernetes-scale.internal/   name=heapster,uses=monitoring-influxdb   Unknown
monitoring-influx-grafana-controller-ls6k1                       influxdb                kubernetes/heapster_influxdb:v0.3   k50-minion-4.c.kubernetes-scale.internal/   name=influxGrafana                       Unknown
                                                                 grafana                 kubernetes/heapster_grafana:v0.3                                                                                         
skydns-oh43e                                                     etcd                    quay.io/coreos/etcd:latest          k50-minion-4.c.kubernetes-scale.internal/   k8s-app=skydns                           Unknown
                                                                 kube2sky                kubernetes/kube2sky:1.0                                                                                                  
                                                                 skydns                  kubernetes/skydns:2014-12-23-001                                                                                         
satnam@satnam-linux:~/gocode/src/github.com/GoogleCloudPlatform/kubernetes$ cluster/kubectl.sh get minions
Running: cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get minions
NAME                                        LABELS              STATUS
k50-minion-1.c.kubernetes-scale.internal    <none>              Ready
k50-minion-10.c.kubernetes-scale.internal   <none>              Ready
k50-minion-11.c.kubernetes-scale.internal   <none>              Ready
k50-minion-12.c.kubernetes-scale.internal   <none>              Ready
k50-minion-13.c.kubernetes-scale.internal   <none>              Ready
k50-minion-14.c.kubernetes-scale.internal   <none>              Ready
k50-minion-15.c.kubernetes-scale.internal   <none>              Ready
k50-minion-16.c.kubernetes-scale.internal   <none>              Ready
k50-minion-17.c.kubernetes-scale.internal   <none>              Ready
k50-minion-18.c.kubernetes-scale.internal   <none>              Ready
k50-minion-19.c.kubernetes-scale.internal   <none>              Ready
k50-minion-2.c.kubernetes-scale.internal    <none>              Ready
k50-minion-20.c.kubernetes-scale.internal   <none>              Ready
k50-minion-21.c.kubernetes-scale.internal   <none>              Ready
k50-minion-22.c.kubernetes-scale.internal   <none>              Ready
k50-minion-23.c.kubernetes-scale.internal   <none>              Ready
k50-minion-24.c.kubernetes-scale.internal   <none>              Ready
k50-minion-25.c.kubernetes-scale.internal   <none>              Ready
k50-minion-26.c.kubernetes-scale.internal   <none>              Ready
k50-minion-27.c.kubernetes-scale.internal   <none>              Ready
k50-minion-28.c.kubernetes-scale.internal   <none>              Ready
k50-minion-29.c.kubernetes-scale.internal   <none>              Ready
k50-minion-3.c.kubernetes-scale.internal    <none>              Ready
k50-minion-30.c.kubernetes-scale.internal   <none>              Ready
k50-minion-31.c.kubernetes-scale.internal   <none>              Ready
k50-minion-32.c.kubernetes-scale.internal   <none>              Ready
k50-minion-33.c.kubernetes-scale.internal   <none>              Ready
k50-minion-34.c.kubernetes-scale.internal   <none>              Ready
k50-minion-35.c.kubernetes-scale.internal   <none>              Ready
k50-minion-36.c.kubernetes-scale.internal   <none>              Ready
k50-minion-37.c.kubernetes-scale.internal   <none>              Ready
k50-minion-38.c.kubernetes-scale.internal   <none>              Ready
k50-minion-39.c.kubernetes-scale.internal   <none>              Ready
k50-minion-4.c.kubernetes-scale.internal    <none>              Ready
k50-minion-40.c.kubernetes-scale.internal   <none>              Ready
k50-minion-41.c.kubernetes-scale.internal   <none>              Ready
k50-minion-42.c.kubernetes-scale.internal   <none>              Ready
k50-minion-43.c.kubernetes-scale.internal   <none>              Ready
k50-minion-44.c.kubernetes-scale.internal   <none>              Ready
k50-minion-45.c.kubernetes-scale.internal   <none>              Ready
k50-minion-46.c.kubernetes-scale.internal   <none>              Ready
k50-minion-47.c.kubernetes-scale.internal   <none>              Ready
k50-minion-48.c.kubernetes-scale.internal   <none>              Ready
k50-minion-49.c.kubernetes-scale.internal   <none>              Ready
k50-minion-5.c.kubernetes-scale.internal    <none>              Ready
k50-minion-50.c.kubernetes-scale.internal   <none>              Ready
k50-minion-6.c.kubernetes-scale.internal    <none>              Ready
k50-minion-7.c.kubernetes-scale.internal    <none>              Ready
k50-minion-8.c.kubernetes-scale.internal    <none>              Ready
k50-minion-9.c.kubernetes-scale.internal    <none>              Ready

@satnam6502
Copy link
Contributor Author

kubectl.sh get minions seems fast (so cache is working). However, the pods don't seem to transition out of the Unknown state: even with a 50 node cluster.

@dchen1107
Copy link
Member

what is the output of kubectl describe pods ? Noticed from the logs pasted above, there is an error related to scheduling:

Monitoring pods failed to be scheduled!

@dchen1107
Copy link
Member

  • 50-node cluster doesn't have 500 stacktrace in kube-apiserver.log pasted above.
  • checked etcd again, no crash, no disk issue
  • Taking pod monitoring-heapster-controller as an example, boundpods record is created properly under /registry/nodes/k50-minion-4/boundpods
  • checked kubelet, Pod is running happily
root@k50-minion-4:/var/log# docker ps -a |grep controller-0133o
00a5b661a4e6        kubernetes/heapster:v0.6               "/run.sh"              35 minutes ago      Up 35 minutes                                                        
                            k8s_heapster.ae7eda7c_monitoring-heapster-controller-0133o.default.api_675004db-ac03-11e4-a607-42010af0ccba_1ae30c2a                      
9fa98f6038df        kubernetes/pause:go                    "/pause"               37 minutes ago      Up 37 minutes                                                        
                            k8s_POD.332bdea_monitoring-heapster-controller-0133o.default.api_675004db-ac03-11e4-a607-42010af0ccba_b77eaf62                            

Please re-run kubectl get pods to see if you still observe above issue?

@satnam6502
Copy link
Contributor Author

For the 50-node cluster:

satnam@satnam-linux:~/gocode/src/github.com/GoogleCloudPlatform/kubernetes$ cluster/kubectl.sh get pods
Running: cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get pods
POD                                          IP                  CONTAINER(S)            IMAGE(S)                            HOST                                        LABELS                                   STATUS
elasticsearch-logging-controller-fplln                           elasticsearch-logging   dockerfile/elasticsearch            k50-minion-6.c.kubernetes-scale.internal/   name=elasticsearch-logging               Unknown
kibana-logging-controller-gziey                                  kibana-logging          kubernetes/kibana:1.0               k50-minion-6.c.kubernetes-scale.internal/   name=kibana-logging                      Unknown
monitoring-heapster-controller-0133o                             heapster                kubernetes/heapster:v0.6            k50-minion-4.c.kubernetes-scale.internal/   name=heapster,uses=monitoring-influxdb   Unknown
monitoring-influx-grafana-controller-ls6k1                       influxdb                kubernetes/heapster_influxdb:v0.3   k50-minion-4.c.kubernetes-scale.internal/   name=influxGrafana                       Unknown
                                                                 grafana                 kubernetes/heapster_grafana:v0.3                                                                                         
skydns-oh43e                                                     etcd                    quay.io/coreos/etcd:latest          k50-minion-4.c.kubernetes-scale.internal/   k8s-app=skydns                           Unknown
                                                                 kube2sky                kubernetes/kube2sky:1.0                                                                                                  
                                                                 skydns                  kubernetes/skydns:2014-12-23-001                                                                                         
satnam@satnam-linux:~/gocode/src/github.com/GoogleCloudPlatform/kubernetes$ cluster/kubectl.sh describe pods monitoring-heapster-controller-0133o
Running: cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl describe pods monitoring-heapster-controller-0133o
Name:                           monitoring-heapster-controller-0133o
Image(s):                       kubernetes/heapster:v0.6
Host:                           k50-minion-4.c.kubernetes-scale.internal/
Labels:                         name=heapster,uses=monitoring-influxdb
Status:                         Unknown
Replication Controllers:        monitoring-heapster-controller (1/1 replicas created)
Events:
Time                            From                                                    SubobjectPath                           Reason                  Message
Wed, 04 Feb 2015 00:18:54 +0000 {scheduler }                                                                                    failedScheduling        Error scheduling: no minions available to schedule pods
Wed, 04 Feb 2015 00:18:55 +0000 {scheduler }                                                                                    failedScheduling        Error scheduling: no minions available to schedule pods
Wed, 04 Feb 2015 00:18:57 +0000 {scheduler }                                                                                    failedScheduling        Error scheduling: no minions available to schedule pods
Wed, 04 Feb 2015 00:19:01 +0000 {scheduler }                                                                                    failedScheduling        Error scheduling: no minions available to schedule pods
Wed, 04 Feb 2015 00:19:09 +0000 {scheduler }                                                                                    failedScheduling        Error scheduling: no minions available to schedule pods
Wed, 04 Feb 2015 00:19:25 +0000 {scheduler }                                                                                    failedScheduling        Error scheduling: no minions available to schedule pods
Wed, 04 Feb 2015 00:19:57 +0000 {scheduler }                                                                                    failedScheduling        Error scheduling: no minions available to schedule pods
Wed, 04 Feb 2015 00:20:46 +0000 {scheduler }                                                                                    scheduled               Successfully assigned monitoring-heapster-controller-0133o to k50-minion-4.c.kubernetes-scale.internal
Wed, 04 Feb 2015 00:22:00 +0000 {kubelet k50-minion-4.c.kubernetes-scale.internal}      implicitly required container POD       pulled                  Successfully pulled image "kubernetes/pause:latest"
Wed, 04 Feb 2015 00:22:00 +0000 {kubelet k50-minion-4.c.kubernetes-scale.internal}      implicitly required container POD       created                 Created with docker id 9fa98f6038dfdaa04b46c8521c7ce7bdcf4037e0970607f86e03ba3f741a4d87
Wed, 04 Feb 2015 00:22:01 +0000 {kubelet k50-minion-4.c.kubernetes-scale.internal}      implicitly required container POD       started                 Started with docker id 9fa98f6038dfdaa04b46c8521c7ce7bdcf4037e0970607f86e03ba3f741a4d87
Wed, 04 Feb 2015 00:24:19 +0000 {kubelet k50-minion-4.c.kubernetes-scale.internal}      spec.containers{heapster}               pulled                  Successfully pulled image "kubernetes/heapster:v0.6"
Wed, 04 Feb 2015 00:24:19 +0000 {kubelet k50-minion-4.c.kubernetes-scale.internal}      spec.containers{heapster}               created                 Created with docker id 00a5b661a4e6398107fe68347dd0705d67fd7f5a46d33a668fcbe085ac740309
Wed, 04 Feb 2015 00:24:19 +0000 {kubelet k50-minion-4.c.kubernetes-scale.internal}      spec.containers{heapster}               started                 Started with docker id 00a5b661a4e6398107fe68347dd0705d67fd7f5a46d33a668fcbe085ac740309

satnam@satnam-linux:~/gocode/src/github.com/GoogleCloudPlatform/kubernetes$ cluster/kubectl.sh describe pods monitoring-influx-grafana-controller-ls6k1
Running: cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl describe pods monitoring-influx-grafana-controller-ls6k1
Name:                           monitoring-influx-grafana-controller-ls6k1
Image(s):                       kubernetes/heapster_influxdb:v0.3,kubernetes/heapster_grafana:v0.3
Host:                           k50-minion-4.c.kubernetes-scale.internal/
Labels:                         name=influxGrafana
Status:                         Unknown
Replication Controllers:        monitoring-influx-grafana-controller (1/1 replicas created)
Events:
Time                            From                                                    SubobjectPath                           Reason                  Message
Wed, 04 Feb 2015 00:18:54 +0000 {scheduler }                                                                                    failedScheduling        Error scheduling: no minions available to schedule pods
Wed, 04 Feb 2015 00:18:55 +0000 {scheduler }                                                                                    failedScheduling        Error scheduling: no minions available to schedule pods
Wed, 04 Feb 2015 00:18:57 +0000 {scheduler }                                                                                    failedScheduling        Error scheduling: no minions available to schedule pods
Wed, 04 Feb 2015 00:19:01 +0000 {scheduler }                                                                                    failedScheduling        Error scheduling: no minions available to schedule pods
Wed, 04 Feb 2015 00:19:09 +0000 {scheduler }                                                                                    failedScheduling        Error scheduling: no minions available to schedule pods
Wed, 04 Feb 2015 00:19:25 +0000 {scheduler }                                                                                    failedScheduling        Error scheduling: no minions available to schedule pods
Wed, 04 Feb 2015 00:19:57 +0000 {scheduler }                                                                                    failedScheduling        Error scheduling: no minions available to schedule pods
Wed, 04 Feb 2015 00:20:46 +0000 {scheduler }                                                                                    scheduled               Successfully assigned monitoring-influx-grafana-controller-ls6k1 to k50-minion-4.c.kubernetes-scale.internal
Wed, 04 Feb 2015 00:22:00 +0000 {kubelet k50-minion-4.c.kubernetes-scale.internal}      implicitly required container POD       pulled                  Successfully pulled image "kubernetes/pause:latest"
Wed, 04 Feb 2015 00:22:01 +0000 {kubelet k50-minion-4.c.kubernetes-scale.internal}      implicitly required container POD       started                 Started with docker id c938c536c2dfd8f1800bf267e4a37e4b8bd271102720862925cf19adf1d86c4c
Wed, 04 Feb 2015 00:22:01 +0000 {kubelet k50-minion-4.c.kubernetes-scale.internal}      implicitly required container POD       created                 Created with docker id c938c536c2dfd8f1800bf267e4a37e4b8bd271102720862925cf19adf1d86c4c
Wed, 04 Feb 2015 00:24:13 +0000 {kubelet k50-minion-4.c.kubernetes-scale.internal}      spec.containers{influxdb}               pulled                  Successfully pulled image "kubernetes/heapster_influxdb:v0.3"
Wed, 04 Feb 2015 00:24:14 +0000 {kubelet k50-minion-4.c.kubernetes-scale.internal}      spec.containers{influxdb}               started                 Started with docker id 20759f682ebc317c5b2bf6866840bb201bf703bfc6add73cca4a3a7626c03172
Wed, 04 Feb 2015 00:24:14 +0000 {kubelet k50-minion-4.c.kubernetes-scale.internal}      spec.containers{influxdb}               created                 Created with docker id 20759f682ebc317c5b2bf6866840bb201bf703bfc6add73cca4a3a7626c03172
Wed, 04 Feb 2015 00:26:09 +0000 {kubelet k50-minion-4.c.kubernetes-scale.internal}      spec.containers{grafana}                pulled                  Successfully pulled image "kubernetes/heapster_grafana:v0.3"
Wed, 04 Feb 2015 00:26:09 +0000 {kubelet k50-minion-4.c.kubernetes-scale.internal}      spec.containers{grafana}                created                 Created with docker id d4ca55abf6b1ff7b76c84fe5600085b20877c9a0edc26836ba5c4e93d6c35b64
Wed, 04 Feb 2015 00:26:09 +0000 {kubelet k50-minion-4.c.kubernetes-scale.internal}      spec.containers{grafana}                started                 Started with docker id d4ca55abf6b1ff7b76c84fe5600085b20877c9a0edc26836ba5c4e93d6c35b64

@dchen1107
Copy link
Member

From the latest output of kubectl describe, I think 50-nodes cluster should work as intended. Here is the hyperthesis I have: When you run kubectl get pods, nodes' status is just changed to READY, so that scheduler couldn't schedule pods.

@satnam6502
Copy link
Contributor Author

So the pods that were created at first are still Unknown but pods I launched later seem to have gone into the Running state (although the service is not working yet since they probably depend on some of the Unknown status pods working).

$../../../cluster/kubectl.sh get pods
Running: ../../../cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get pods
POD                                          IP                  CONTAINER(S)            IMAGE(S)                            HOST                                                        LABELS                                   STATUS
elasticsearch-logging-controller-fplln                           elasticsearch-logging   dockerfile/elasticsearch            k50-minion-6.c.kubernetes-scale.internal/                   name=elasticsearch-logging               Unknown
kibana-logging-controller-gziey                                  kibana-logging          kubernetes/kibana:1.0               k50-minion-6.c.kubernetes-scale.internal/                   name=kibana-logging                      Unknown
monitoring-heapster-controller-0133o                             heapster                kubernetes/heapster:v0.6            k50-minion-4.c.kubernetes-scale.internal/                   name=heapster,uses=monitoring-influxdb   Unknown
monitoring-influx-grafana-controller-ls6k1                       influxdb                kubernetes/heapster_influxdb:v0.3   k50-minion-4.c.kubernetes-scale.internal/                   name=influxGrafana                       Unknown
                                                                 grafana                 kubernetes/heapster_grafana:v0.3                                                                                                         
serve-hostnames-02qp4                        10.244.26.7         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-27.c.kubernetes-scale.internal/104.155.229.250   name=serve-hostnames                     Running
serve-hostnames-14m58                        10.244.4.5          serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-5.c.kubernetes-scale.internal/104.155.224.178    name=serve-hostnames                     Running
serve-hostnames-182tv                        10.244.15.5         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-16.c.kubernetes-scale.internal/104.155.228.213   name=serve-hostnames                     Running
serve-hostnames-1gypq                        10.244.29.5         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-30.c.kubernetes-scale.internal/104.155.235.176   name=serve-hostnames                     Running
serve-hostnames-26873                        10.244.35.5         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-36.c.kubernetes-scale.internal/104.155.209.126   name=serve-hostnames                     Running
serve-hostnames-29g6u                        10.244.1.6          serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-2.c.kubernetes-scale.internal/130.211.252.85     name=serve-hostnames                     Running
serve-hostnames-2nrtl                        10.244.8.5          serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-9.c.kubernetes-scale.internal/104.155.233.140    name=serve-hostnames                     Running
serve-hostnames-2plmk                        10.244.20.6         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-21.c.kubernetes-scale.internal/104.155.216.232   name=serve-hostnames                     Running
serve-hostnames-40noe                        10.244.21.6         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-22.c.kubernetes-scale.internal/104.155.226.2     name=serve-hostnames                     Running
serve-hostnames-4s9mi                        10.244.42.6         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-43.c.kubernetes-scale.internal/104.155.224.104   name=serve-hostnames                     Running
serve-hostnames-51wgg                        10.244.16.5         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-17.c.kubernetes-scale.internal/104.155.204.241   name=serve-hostnames                     Running
serve-hostnames-5m1zc                        10.244.17.5         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-18.c.kubernetes-scale.internal/104.155.224.240   name=serve-hostnames                     Running
serve-hostnames-5v48c                        10.244.31.5         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-32.c.kubernetes-scale.internal/104.155.236.248   name=serve-hostnames                     Running
serve-hostnames-7kq8u                        10.244.9.5          serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-10.c.kubernetes-scale.internal/104.155.235.220   name=serve-hostnames                     Running
serve-hostnames-9gvmk                        10.244.40.5         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-41.c.kubernetes-scale.internal/104.155.205.61    name=serve-hostnames                     Running
serve-hostnames-apz01                        10.244.0.6          serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-1.c.kubernetes-scale.internal/104.155.232.195    name=serve-hostnames                     Running
serve-hostnames-ca0qm                        10.244.19.6         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-20.c.kubernetes-scale.internal/104.155.206.144   name=serve-hostnames                     Running
serve-hostnames-cegta                        10.244.13.5         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-14.c.kubernetes-scale.internal/104.155.234.52    name=serve-hostnames                     Running
serve-hostnames-ckjdw                        10.244.22.5         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-23.c.kubernetes-scale.internal/104.155.235.155   name=serve-hostnames                     Running
serve-hostnames-ckn42                        10.244.27.5         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-28.c.kubernetes-scale.internal/104.155.231.245   name=serve-hostnames                     Running
serve-hostnames-dbfhf                        10.244.39.5         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-40.c.kubernetes-scale.internal/104.155.238.158   name=serve-hostnames                     Running
serve-hostnames-ehq23                        10.244.14.5         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-15.c.kubernetes-scale.internal/104.155.230.103   name=serve-hostnames                     Running
serve-hostnames-eql4y                        10.244.12.5         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-13.c.kubernetes-scale.internal/104.155.227.170   name=serve-hostnames                     Running
serve-hostnames-f0fyy                        10.244.6.5          serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-7.c.kubernetes-scale.internal/104.155.231.158    name=serve-hostnames                     Running
serve-hostnames-gb50a                        10.244.38.6         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-39.c.kubernetes-scale.internal/104.155.206.39    name=serve-hostnames                     Running
serve-hostnames-gfw5j                        10.244.45.5         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-46.c.kubernetes-scale.internal/107.167.187.171   name=serve-hostnames                     Running
serve-hostnames-h3x9k                        10.244.46.5         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-47.c.kubernetes-scale.internal/104.155.239.116   name=serve-hostnames                     Running
serve-hostnames-hh2gd                        10.244.33.5         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-34.c.kubernetes-scale.internal/104.155.238.195   name=serve-hostnames                     Running
serve-hostnames-i7hvs                        10.244.49.6         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-50.c.kubernetes-scale.internal/104.155.231.94    name=serve-hostnames                     Running
serve-hostnames-ilbb7                        10.244.23.5         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-24.c.kubernetes-scale.internal/104.155.237.48    name=serve-hostnames                     Running
serve-hostnames-ir0xc                        10.244.30.5         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-31.c.kubernetes-scale.internal/104.155.232.55    name=serve-hostnames                     Running
serve-hostnames-jmxbo                        10.244.36.6         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-37.c.kubernetes-scale.internal/104.155.232.31    name=serve-hostnames                     Running
serve-hostnames-jzjpe                        10.244.47.5         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-48.c.kubernetes-scale.internal/104.155.235.21    name=serve-hostnames                     Running
serve-hostnames-kjqd6                        10.244.43.5         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-44.c.kubernetes-scale.internal/104.155.224.188   name=serve-hostnames                     Running
serve-hostnames-mhlnm                        10.244.7.6          serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-8.c.kubernetes-scale.internal/130.211.243.27     name=serve-hostnames                     Running
serve-hostnames-nxnds                        10.244.34.5         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-35.c.kubernetes-scale.internal/104.155.239.178   name=serve-hostnames                     Running
serve-hostnames-nyxxv                        10.244.2.5          serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-3.c.kubernetes-scale.internal/130.211.251.61     name=serve-hostnames                     Running
serve-hostnames-pjzb3                        10.244.41.5         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-42.c.kubernetes-scale.internal/104.155.234.173   name=serve-hostnames                     Running
serve-hostnames-sb5qi                        10.244.37.5         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-38.c.kubernetes-scale.internal/104.155.235.188   name=serve-hostnames                     Running
serve-hostnames-ta0op                        10.244.10.5         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-11.c.kubernetes-scale.internal/107.167.190.16    name=serve-hostnames                     Running
serve-hostnames-unlpf                        10.244.11.5         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-12.c.kubernetes-scale.internal/104.155.233.222   name=serve-hostnames                     Running
serve-hostnames-uxshy                        10.244.32.5         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-33.c.kubernetes-scale.internal/104.155.235.132   name=serve-hostnames                     Running
serve-hostnames-v5drx                        10.244.28.6         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-29.c.kubernetes-scale.internal/104.155.216.186   name=serve-hostnames                     Running
serve-hostnames-wn1dx                        10.244.24.6         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-25.c.kubernetes-scale.internal/104.155.226.126   name=serve-hostnames                     Running
serve-hostnames-xuc9u                        10.244.26.6         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-27.c.kubernetes-scale.internal/104.155.229.250   name=serve-hostnames                     Running
serve-hostnames-y9eqi                        10.244.21.5         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-22.c.kubernetes-scale.internal/104.155.226.2     name=serve-hostnames                     Running
serve-hostnames-yp7fy                        10.244.25.5         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-26.c.kubernetes-scale.internal/104.155.202.16    name=serve-hostnames                     Running
serve-hostnames-zi4hx                        10.244.18.5         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-19.c.kubernetes-scale.internal/104.155.238.83    name=serve-hostnames                     Running
serve-hostnames-zicap                        10.244.48.5         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-49.c.kubernetes-scale.internal/130.211.248.137   name=serve-hostnames                     Running
serve-hostnames-zwk1b                        10.244.44.5         serve-hostname          kubernetes/serve_hostname:1.1       k50-minion-45.c.kubernetes-scale.internal/104.155.226.118   name=serve-hostnames                     Running
skydns-oh43e                                                     etcd                    quay.io/coreos/etcd:latest          k50-minion-4.c.kubernetes-scale.internal/                   k8s-app=skydns                           Unknown
                                                                 kube2sky                kubernetes/kube2sky:1.0                                                                                                                  
                                                                 skydns                  kubernetes/skydns:2014-12-23-001                                                                                                         
satnam@satnam-linux:~/gocode/src/github.com/GoogleCloudPlatform/kubernetes/test/scalability/serve_hostnames$ ../../../cluster/kubectl.sh get replicationControllers
Running: ../../../cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get replicationControllers
CONTROLLER                             CONTAINER(S)            IMAGE(S)                            SELECTOR                     REPLICAS
elasticsearch-logging-controller       elasticsearch-logging   dockerfile/elasticsearch            name=elasticsearch-logging   1
kibana-logging-controller              kibana-logging          kubernetes/kibana:1.0               name=kibana-logging          1
monitoring-heapster-controller         heapster                kubernetes/heapster:v0.6            name=heapster                1
monitoring-influx-grafana-controller   influxdb                kubernetes/heapster_influxdb:v0.3   name=influxGrafana           1
                                       grafana                 kubernetes/heapster_grafana:v0.3                                 
serve-hostnames                        serve-hostname          kubernetes/serve_hostname:1.1       name=serve-hostnames         50
skydns                                 etcd                    quay.io/coreos/etcd:latest          k8s-app=skydns               1
                                       kube2sky                kubernetes/kube2sky:1.0                                          
                                       skydns                  kubernetes/skydns:2014-12-23-001                                 
satnam@satnam-linux:~/gocode/src/github.com/GoogleCloudPlatform/kubernetes/test/scalability/serve_hostnames$ ../../../cluster/kubectl.sh get services
Running: ../../../cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get services
NAME                    LABELS                                    SELECTOR                     IP                  PORT
elasticsearch-logging   name=elasticsearch-logging                name=elasticsearch-logging   10.0.27.250         9200
kibana-logging          name=kibana-logging                       name=kibana-logging          10.0.36.98          5601
kubernetes              component=apiserver,provider=kubernetes   <none>                       10.0.0.2            443
kubernetes-ro           component=apiserver,provider=kubernetes   <none>                       10.0.0.1            80
monitoring-grafana      <none>                                    name=influxGrafana           10.0.12.41          80
monitoring-heapster     <none>                                    name=heapster                10.0.159.101        80
monitoring-influxdb     <none>                                    name=influxGrafana           10.0.9.170          80
serve-hostnames         name=serve-hostnames                      name=serve-hostnames         10.0.63.137         9500
skydns                  k8s-app=skydns                            k8s-app=skydns               10.0.0.10           53
satnam@satnam-linux:~/gocode/src/github.com/GoogleCloudPlatform/kubernetes/test/scalability/serve_hostnames$ make net
gcloud compute forwarding-rules --project kubernetes-scale describe --region asia-east1 serve-hostnames
IPAddress: 104.155.232.255
IPProtocol: TCP
creationTimestamp: '2015-02-03T17:20:59.788-08:00'
id: '3940834679997910829'
kind: compute#forwardingRule
name: serve-hostnames
portRange: 9500-9500
region: https://www.googleapis.com/compute/v1/projects/kubernetes-scale/regions/asia-east1
selfLink: https://www.googleapis.com/compute/v1/projects/kubernetes-scale/regions/asia-east1/forwardingRules/serve-hostnames
target: https://www.googleapis.com/compute/v1/projects/kubernetes-scale/regions/asia-east1/targetPools/serve-hostnames
satnam@satnam-linux:~/gocode/src/github.com/GoogleCloudPlatform/kubernetes/test/scalability/serve_hostnames$ curl 104.155.232.255:9500
curl: (7) Failed to connect to 104.155.232.255 port 9500: Connection refused

@satnam6502
Copy link
Contributor Author

Looking at a specific instance of serve_hostname via the service the right thing seems to be happening:

root@k50-minion-6:/home/satnam# docker exec -ti 30dca480436b bash
$ curl 10.0.63.137:9500
serve-hostnames-f0fyy$  curl 10.0.63.137:9500
serve-hostnames-ir0xc$  curl 10.0.63.137:9500                                                                                                      
serve-hostnames-gfw5j

Although the services does not seem to be reachable via its external IP.

satnam@satnam-linux:~/gocode/src/github.com/GoogleCloudPlatform/kubernetes/test/scalability/serve_hostnames$ make net
gcloud compute forwarding-rules --project kubernetes-scale describe --region asia-east1 serve-hostnames
IPAddress: 104.155.232.255
IPProtocol: TCP
creationTimestamp: '2015-02-03T17:20:59.788-08:00'
id: '3940834679997910829'
kind: compute#forwardingRule
name: serve-hostnames
portRange: 9500-9500
region: https://www.googleapis.com/compute/v1/projects/kubernetes-scale/regions/asia-east1
selfLink: https://www.googleapis.com/compute/v1/projects/kubernetes-scale/regions/asia-east1/forwardingRules/serve-hostnames
target: https://www.googleapis.com/compute/v1/projects/kubernetes-scale/regions/asia-east1/targetPools/serve-hostnames
satnam@satnam-linux:~/gocode/src/github.com/GoogleCloudPlatform/kubernetes/test/scalability/serve_hostnames$ curl 104.155.232.255:9500
curl: (7) Failed to connect to 104.155.232.255 port 9500: Connection refused

Looking into it (default firewall is set up for tcp 9500).

@ddysher
Copy link
Contributor

ddysher commented Feb 4, 2015

Does the Unknown state only for those built-in pods? If so, it's likely we have a race in cluster bootstrapping between scheduler and node controller, similar to what @dchen1107 suggested. Looking at scheduler code, seems it doesn't try to reschedule the failed pods. Can you check controller-manager log and compare the timestamp with scheduler to rule out the possibility (or hopefully found the problem)? I don't have access to the large cluster unfortunately.

@satnam6502
Copy link
Contributor Author

Well those pods are just regular pods that are kicked off by the addons process that @zmerlynn added so there is nothing that special about most of them (unlike the pods launched from manifest files). I'll poke around the logs when I get a moment -- still trying to debug why I can't get the external IP to work. The create services call did crash with an unhelpful error :-)

Running: ../../../cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl create -f serve-hostnames-service.yaml
F0203 18:21:01.268562   17955 create.go:78] 
make: *** [create] Error 255

@zmerlynn
Copy link
Member

zmerlynn commented Feb 4, 2015

You can check if there was anything weird about the startup by checking the
/var/log/kube-addons.log on the master.

On Tue, Feb 3, 2015 at 6:23 PM, Satnam Singh notifications@github.com
wrote:

Well those pods are just regular pods that are kicked off by the addons
process that @zmerlynn https://github.com/zmerlynn added so there is
nothing that special about most of them (unlike the pods launched from
manifest files). I'll poke around the logs when I get a moment -- still
trying to debug why I can't get the external IP to work. The create
services call did crash with an unhelpful error :-)


Reply to this email directly or view it on GitHub
#4078 (comment)
.

@satnam6502
Copy link
Contributor Author

See also #4095

@satnam6502
Copy link
Contributor Author

I can now contact the service using the external IP and get a valid response. Perhaps this bug needs to be renamed -- but the add ons don't seem to work -- at least Elasticsearch and Kibana are not accessible via their external IPs.

@satnam6502
Copy link
Contributor Author

When I dial up the replicas from 100 to 1,000 I sometimes get this error when I try to list any kind of resources: 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]

@satnam6502
Copy link
Contributor Author

I am going to close this issue and do more investigation early next week and file more specific issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/reliability priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

No branches or pull requests

6 participants