Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1000 node cluster: GCE API rate limit exceeded #21563

Closed
alex-mohr opened this issue Feb 19, 2016 · 28 comments
Closed

1000 node cluster: GCE API rate limit exceeded #21563

alex-mohr opened this issue Feb 19, 2016 · 28 comments
Assignees
Labels
area/platform/gce priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle.

Comments

@alex-mohr
Copy link
Contributor

I set up a 1000 node cluster on GKE using standard GCE advanced route networking. It wasn't possible for @dchen1107 to ssh to a node because gcloud was unable to add her ssh key to the project -- rate limit exceeded. Note that some other calls like list instances and resizing the MIG were not impacted, so it might be a particular category of API call that's getting rate limited?

We need to make sure 1000 node k8s clusters don't trigger GCE's rate limiter and make the user unable to use parts of the GCE api for their project.

Is there a vaguely-sane way to understand how much GCE api traffic each of our N controller objects is sending? Do we e.g. have prometheus metrics of sent calls (and perhaps result code for success/error?)

Note this may share an underlying cause with #21561 where large numbers of GCE calls result in api errors.

@alex-mohr alex-mohr added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. team/control-plane labels Feb 19, 2016
@alex-mohr alex-mohr added this to the v1.2 milestone Feb 19, 2016
@alex-mohr
Copy link
Contributor Author

@wojtek-t @gmarek @kubernetes/goog-control-plane for 1k nodes and calling GCE api excessively

@gmarek
Copy link
Contributor

gmarek commented Feb 19, 2016

I don't think we have metrics for it. I'll try to add this on Monday.

@alex-mohr
Copy link
Contributor Author

FWIW, some initial internal numbers show we did 1136411 compute.instances.get, 966858 compute.routes.insert, and 11360 compute.routes.list. I thought I had used a version of k8s that included @cjcullen 's #21172 and #21012 patches, but perhaps not. I'll repro using a build from 9am today.

@alex-mohr
Copy link
Contributor Author

@gmarek found a way to see method calls: go to console, API Manager, Enabled APIs, Google Compute Engine, Usage, and toggle the "Response Codes" dropdown to be "Methods".

@alex-mohr
Copy link
Contributor Author

... and why are there so many compute.instances.get calls? If it's the master, can it get by with a compute.instances.list? And if the kubelet making them, perhaps we can either throttle that rate down or change the logic to rely on the metadata server in the VM or the k8s master instead?

@gmarek
Copy link
Contributor

gmarek commented Feb 19, 2016

I think some of it may be NodeController that queries cloud provider every sync period, but it seems way to much for NodeController itself. Does Kubelet communicates with cloud provider in any way? @dchen1107

@alex-mohr
Copy link
Contributor Author

FYI in the middle of turning up a new 1k node cluster using today ~9am cut.

@davidopp
Copy link
Member

cc/ @roberthbailey

@davidopp
Copy link
Member

AFAICT NodeController only calls into the cloud provider from one place,
https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/node/nodecontroller.go#L471
but this should happen only once every nc.nodeMonitorPeriod which defaults to 5 seconds. And it's just one call (every 5s), not one call per node.

@davidopp
Copy link
Member

I think I found a problem in route controller (see #21561). It would also explain the bazillion calls to compute.instances.get, since GCECloud.CreateRoute() calls GCECloud.getInstanceByName() which calls gce.service.Instances.Get().

I looked through the Kubelet code for calls to the cloud provider and didn't find anything worrisome.

@goltermann
Copy link
Contributor

Everything looks merged on #21561. Anything remaining there or here?

@alex-mohr
Copy link
Contributor Author

I'm doing one last test to verify now that #21561 merged, but I'll mark closed and re-open if any other issues come up.

@alex-mohr
Copy link
Contributor Author

Oops, for posterity, it was PR #22094 together with PR #22044 that seems to have fixed both issues.

@alex-mohr alex-mohr reopened this Mar 1, 2016
@roberthbailey
Copy link
Contributor

@alex-mohr did you mean to reopen this with your last comment?

@davidopp
Copy link
Member

davidopp commented Mar 2, 2016

I think he did, but I think the general opinion (not yet verified AFAIK) was that the more recent problem he was having was because his master size was too small. So maybe he can verify that and then close if it is the case.

We should probably add an optional flag to kube-up that asks it to pick MASTER_SIZE for you based on NUM_NODES. At the very least, we should add some recommended master sizes (ideally for both GCP and AWS) to
https://github.com/kubernetes/kubernetes/blob/master/docs/admin/cluster-large.md

@gmarek
Copy link
Contributor

gmarek commented Mar 2, 2016

@davidopp PTAL #22261

@alex-mohr
Copy link
Contributor Author

Sorry re: last comment. There are still a bunch of rate limit exceededs when creating a cluster using n1-standard-32 master, but it seems to eventually come up. And the rate limit exceededs eventually stop. It's not great, but does eventually result in a functional cluster, so may be enough to unblock 1.2 and pick improvements to 1.2.x?

FWIW, I suspect the current triggers for rate limits are some of the following:

  • GetInstanceByName on each item in the node or route list, where GetInstanceByName makes a (synchronous) call to gce.service.Instances.Get. I didn't read the logic closely enough to see if that's necessary. If it is, perhaps we could replace with a single gce.service.Instances.List before the loop?
  • We spawn a bunch of operations (like creating routes) in go routines and each independently polls the op for completion. I think we'd benefit from some form of shared rate limiter on polling operations to limit that to something like 5 qps, or adding some form of backoff when getting that error. As is, client seems to speed up when it gets rate limit exceededs because GCE api rejects them faster than it processes admitted requests -- yay.

@roberthbailey
Copy link
Contributor

Have you tested since #22099 was merged? Prior to that PR every node was making calls to the GCE API on startup.

@alex-mohr
Copy link
Contributor Author

@roberthbailey Yes, it was sync from Monday evening, so did include #22099.

@roberthbailey
Copy link
Contributor

Bummer.

@alex-mohr
Copy link
Contributor Author

@lavalamp Perhaps you'd be interested in extending the charter of the (Google) CSI team to include the GCE cloudprovider bits? Seems like there's a lot of overlap in terms of approach between preventing k8s clients from swamping k8s master and preventing k8s from swamping GCE?

@justinsb
Copy link
Member

justinsb commented Mar 2, 2016

Did you see the shared-rate limiter we put in for AWS? It's not the bestest, but maybe we could use it for GCE also! Maybe with some improvements from getting more eyes on it!

https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/aws/retry_handler.go

@davidopp
Copy link
Member

davidopp commented Mar 2, 2016

I think it makes sense to keep all of the cloud provider stuff as part of control-plane team for now, and I agree with Justin we should look at these problems holistically since the same themes will probably crop up on every cloud provider.

Whether anyone (on any team) has time to make the improvements in time for 1.2 is another story....

@lavalamp
Copy link
Member

lavalamp commented Mar 2, 2016

agree w/ @davidopp. Happy to consult though, if needed/desired.

@davidopp davidopp added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Mar 7, 2016
@davidopp
Copy link
Member

Moving to next-candidate, upgrading to P1.

@davidopp davidopp modified the milestones: next-candidate, v1.2 Mar 15, 2016
@davidopp davidopp added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Mar 15, 2016
@k8s-github-robot k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label May 31, 2017
@spiffxp
Copy link
Member

spiffxp commented Jun 15, 2017

/sig cluster-lifecycle
/area platform/gce
since we lack a sig-gcp at the moment

@k8s-ci-robot k8s-ci-robot added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. area/platform/gce labels Jun 15, 2017
@k8s-github-robot k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jun 15, 2017
@spiffxp
Copy link
Member

spiffxp commented Jun 15, 2017

/assign
actually, I think this was fixed by the linked PR

@spiffxp
Copy link
Member

spiffxp commented Jun 15, 2017

/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform/gce priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle.
Projects
None yet
Development

No branches or pull requests

10 participants