1000 node cluster: GCE API rate limit exceeded #21563

alex-mohr · 2016-02-19T17:52:08Z

I set up a 1000 node cluster on GKE using standard GCE advanced route networking. It wasn't possible for @dchen1107 to ssh to a node because gcloud was unable to add her ssh key to the project -- rate limit exceeded. Note that some other calls like list instances and resizing the MIG were not impacted, so it might be a particular category of API call that's getting rate limited?

We need to make sure 1000 node k8s clusters don't trigger GCE's rate limiter and make the user unable to use parts of the GCE api for their project.

Is there a vaguely-sane way to understand how much GCE api traffic each of our N controller objects is sending? Do we e.g. have prometheus metrics of sent calls (and perhaps result code for success/error?)

Note this may share an underlying cause with #21561 where large numbers of GCE calls result in api errors.

alex-mohr · 2016-02-19T17:52:46Z

@wojtek-t @gmarek @kubernetes/goog-control-plane for 1k nodes and calling GCE api excessively

gmarek · 2016-02-19T18:01:53Z

I don't think we have metrics for it. I'll try to add this on Monday.

alex-mohr · 2016-02-19T18:11:38Z

FWIW, some initial internal numbers show we did 1136411 compute.instances.get, 966858 compute.routes.insert, and 11360 compute.routes.list. I thought I had used a version of k8s that included @cjcullen 's #21172 and #21012 patches, but perhaps not. I'll repro using a build from 9am today.

alex-mohr · 2016-02-19T18:14:49Z

@gmarek found a way to see method calls: go to console, API Manager, Enabled APIs, Google Compute Engine, Usage, and toggle the "Response Codes" dropdown to be "Methods".

alex-mohr · 2016-02-19T18:22:36Z

... and why are there so many compute.instances.get calls? If it's the master, can it get by with a compute.instances.list? And if the kubelet making them, perhaps we can either throttle that rate down or change the logic to rely on the metadata server in the VM or the k8s master instead?

gmarek · 2016-02-19T18:28:39Z

I think some of it may be NodeController that queries cloud provider every sync period, but it seems way to much for NodeController itself. Does Kubelet communicates with cloud provider in any way? @dchen1107

alex-mohr · 2016-02-19T18:37:05Z

FYI in the middle of turning up a new 1k node cluster using today ~9am cut.

https://console.developers.google.com/home/dashboard?project=mohr-instance-manager1 shows API rate increasing to 118 qps so far.
https://console.developers.google.com/compute/instanceGroups/list?project=mohr-instance-manager1
reports "API rate limit exceeded. Rate limit may take several minutes to update if Google Compute Engine has just been enabled, or if this is the first time you use Google Compute Engine." and doesn't display anything.
https://console.developers.google.com/apis/api/compute_component/usage?project=mohr-instance-manager1 shows Client errors (4xx)

davidopp · 2016-02-21T04:52:25Z

cc/ @roberthbailey

davidopp · 2016-02-21T06:07:50Z

AFAICT NodeController only calls into the cloud provider from one place,
https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/node/nodecontroller.go#L471
but this should happen only once every nc.nodeMonitorPeriod which defaults to 5 seconds. And it's just one call (every 5s), not one call per node.

davidopp · 2016-02-21T08:07:18Z

I think I found a problem in route controller (see #21561). It would also explain the bazillion calls to compute.instances.get, since GCECloud.CreateRoute() calls GCECloud.getInstanceByName() which calls gce.service.Instances.Get().

I looked through the Kubelet code for calls to the cloud provider and didn't find anything worrisome.

goltermann · 2016-02-29T21:02:15Z

Everything looks merged on #21561. Anything remaining there or here?

alex-mohr · 2016-02-29T21:44:59Z

I'm doing one last test to verify now that #21561 merged, but I'll mark closed and re-open if any other issues come up.

alex-mohr · 2016-02-29T21:51:50Z

Oops, for posterity, it was PR #22094 together with PR #22044 that seems to have fixed both issues.

roberthbailey · 2016-03-02T06:06:51Z

@alex-mohr did you mean to reopen this with your last comment?

davidopp · 2016-03-02T06:24:45Z

I think he did, but I think the general opinion (not yet verified AFAIK) was that the more recent problem he was having was because his master size was too small. So maybe he can verify that and then close if it is the case.

We should probably add an optional flag to kube-up that asks it to pick MASTER_SIZE for you based on NUM_NODES. At the very least, we should add some recommended master sizes (ideally for both GCP and AWS) to
https://github.com/kubernetes/kubernetes/blob/master/docs/admin/cluster-large.md

gmarek · 2016-03-02T09:08:29Z

@davidopp PTAL #22261

alex-mohr · 2016-03-02T18:44:58Z

Sorry re: last comment. There are still a bunch of rate limit exceededs when creating a cluster using n1-standard-32 master, but it seems to eventually come up. And the rate limit exceededs eventually stop. It's not great, but does eventually result in a functional cluster, so may be enough to unblock 1.2 and pick improvements to 1.2.x?

FWIW, I suspect the current triggers for rate limits are some of the following:

GetInstanceByName on each item in the node or route list, where GetInstanceByName makes a (synchronous) call to gce.service.Instances.Get. I didn't read the logic closely enough to see if that's necessary. If it is, perhaps we could replace with a single gce.service.Instances.List before the loop?
We spawn a bunch of operations (like creating routes) in go routines and each independently polls the op for completion. I think we'd benefit from some form of shared rate limiter on polling operations to limit that to something like 5 qps, or adding some form of backoff when getting that error. As is, client seems to speed up when it gets rate limit exceededs because GCE api rejects them faster than it processes admitted requests -- yay.

roberthbailey · 2016-03-02T18:58:21Z

Have you tested since #22099 was merged? Prior to that PR every node was making calls to the GCE API on startup.

alex-mohr · 2016-03-02T19:08:20Z

@roberthbailey Yes, it was sync from Monday evening, so did include #22099.

roberthbailey · 2016-03-02T19:17:39Z

Bummer.

alex-mohr · 2016-03-02T19:25:21Z

@lavalamp Perhaps you'd be interested in extending the charter of the (Google) CSI team to include the GCE cloudprovider bits? Seems like there's a lot of overlap in terms of approach between preventing k8s clients from swamping k8s master and preventing k8s from swamping GCE?

justinsb · 2016-03-02T19:33:10Z

Did you see the shared-rate limiter we put in for AWS? It's not the bestest, but maybe we could use it for GCE also! Maybe with some improvements from getting more eyes on it!

https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/aws/retry_handler.go

davidopp · 2016-03-02T19:53:46Z

I think it makes sense to keep all of the cloud provider stuff as part of control-plane team for now, and I agree with Justin we should look at these problems holistically since the same themes will probably crop up on every cloud provider.

Whether anyone (on any team) has time to make the improvements in time for 1.2 is another story....

lavalamp · 2016-03-02T21:53:29Z

agree w/ @davidopp. Happy to consult though, if needed/desired.

davidopp · 2016-03-15T04:45:55Z

Moving to next-candidate, upgrading to P1.

spiffxp · 2017-06-15T22:05:10Z

/sig cluster-lifecycle
/area platform/gce
since we lack a sig-gcp at the moment

spiffxp · 2017-06-15T22:05:38Z

/assign
actually, I think this was fixed by the linked PR

spiffxp · 2017-06-15T22:05:43Z

/close

alex-mohr added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. team/control-plane labels Feb 19, 2016

alex-mohr added this to the v1.2 milestone Feb 19, 2016

alex-mohr added the team/gke label Feb 19, 2016

alex-mohr assigned davidopp Feb 19, 2016

alex-mohr closed this as completed Feb 29, 2016

alex-mohr reopened this Mar 1, 2016

davidopp added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Mar 7, 2016

davidopp modified the milestones: next-candidate, v1.2 Mar 15, 2016

davidopp added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Mar 15, 2016

bgrant0607 mentioned this issue Mar 16, 2016

Add a rate limiter to the GCE cloudprovider #23019

Merged

k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label May 31, 2017

k8s-ci-robot added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. area/platform/gce labels Jun 15, 2017

k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jun 15, 2017

k8s-ci-robot assigned spiffxp Jun 15, 2017

k8s-ci-robot closed this as completed Jun 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1000 node cluster: GCE API rate limit exceeded #21563

1000 node cluster: GCE API rate limit exceeded #21563

alex-mohr commented Feb 19, 2016

alex-mohr commented Feb 19, 2016

gmarek commented Feb 19, 2016

alex-mohr commented Feb 19, 2016

alex-mohr commented Feb 19, 2016

alex-mohr commented Feb 19, 2016

gmarek commented Feb 19, 2016

alex-mohr commented Feb 19, 2016

davidopp commented Feb 21, 2016

davidopp commented Feb 21, 2016

davidopp commented Feb 21, 2016

goltermann commented Feb 29, 2016

alex-mohr commented Feb 29, 2016

alex-mohr commented Feb 29, 2016

roberthbailey commented Mar 2, 2016

davidopp commented Mar 2, 2016

gmarek commented Mar 2, 2016

alex-mohr commented Mar 2, 2016

roberthbailey commented Mar 2, 2016

alex-mohr commented Mar 2, 2016

roberthbailey commented Mar 2, 2016

alex-mohr commented Mar 2, 2016

justinsb commented Mar 2, 2016

davidopp commented Mar 2, 2016

lavalamp commented Mar 2, 2016

davidopp commented Mar 15, 2016

spiffxp commented Jun 15, 2017

spiffxp commented Jun 15, 2017

spiffxp commented Jun 15, 2017

1000 node cluster: GCE API rate limit exceeded #21563

1000 node cluster: GCE API rate limit exceeded #21563

Comments

alex-mohr commented Feb 19, 2016

alex-mohr commented Feb 19, 2016

gmarek commented Feb 19, 2016

alex-mohr commented Feb 19, 2016

alex-mohr commented Feb 19, 2016

alex-mohr commented Feb 19, 2016

gmarek commented Feb 19, 2016

alex-mohr commented Feb 19, 2016

davidopp commented Feb 21, 2016

davidopp commented Feb 21, 2016

davidopp commented Feb 21, 2016

goltermann commented Feb 29, 2016

alex-mohr commented Feb 29, 2016

alex-mohr commented Feb 29, 2016

roberthbailey commented Mar 2, 2016

davidopp commented Mar 2, 2016

gmarek commented Mar 2, 2016

alex-mohr commented Mar 2, 2016

roberthbailey commented Mar 2, 2016

alex-mohr commented Mar 2, 2016

roberthbailey commented Mar 2, 2016

alex-mohr commented Mar 2, 2016

justinsb commented Mar 2, 2016

davidopp commented Mar 2, 2016

lavalamp commented Mar 2, 2016

davidopp commented Mar 15, 2016

spiffxp commented Jun 15, 2017

spiffxp commented Jun 15, 2017

spiffxp commented Jun 15, 2017