-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1000 node cluster: GCE API rate limit exceeded #21563
Comments
I don't think we have metrics for it. I'll try to add this on Monday. |
@gmarek found a way to see method calls: go to console, API Manager, Enabled APIs, Google Compute Engine, Usage, and toggle the "Response Codes" dropdown to be "Methods". |
... and why are there so many compute.instances.get calls? If it's the master, can it get by with a compute.instances.list? And if the kubelet making them, perhaps we can either throttle that rate down or change the logic to rely on the metadata server in the VM or the k8s master instead? |
I think some of it may be NodeController that queries cloud provider every sync period, but it seems way to much for NodeController itself. Does Kubelet communicates with cloud provider in any way? @dchen1107 |
FYI in the middle of turning up a new 1k node cluster using today ~9am cut.
|
cc/ @roberthbailey |
AFAICT NodeController only calls into the cloud provider from one place, |
I think I found a problem in route controller (see #21561). It would also explain the bazillion calls to compute.instances.get, since GCECloud.CreateRoute() calls GCECloud.getInstanceByName() which calls gce.service.Instances.Get(). I looked through the Kubelet code for calls to the cloud provider and didn't find anything worrisome. |
Everything looks merged on #21561. Anything remaining there or here? |
I'm doing one last test to verify now that #21561 merged, but I'll mark closed and re-open if any other issues come up. |
@alex-mohr did you mean to reopen this with your last comment? |
I think he did, but I think the general opinion (not yet verified AFAIK) was that the more recent problem he was having was because his master size was too small. So maybe he can verify that and then close if it is the case. We should probably add an optional flag to |
Sorry re: last comment. There are still a bunch of rate limit exceededs when creating a cluster using n1-standard-32 master, but it seems to eventually come up. And the rate limit exceededs eventually stop. It's not great, but does eventually result in a functional cluster, so may be enough to unblock 1.2 and pick improvements to 1.2.x? FWIW, I suspect the current triggers for rate limits are some of the following:
|
Have you tested since #22099 was merged? Prior to that PR every node was making calls to the GCE API on startup. |
@roberthbailey Yes, it was sync from Monday evening, so did include #22099. |
Bummer. |
@lavalamp Perhaps you'd be interested in extending the charter of the (Google) CSI team to include the GCE cloudprovider bits? Seems like there's a lot of overlap in terms of approach between preventing k8s clients from swamping k8s master and preventing k8s from swamping GCE? |
Did you see the shared-rate limiter we put in for AWS? It's not the bestest, but maybe we could use it for GCE also! Maybe with some improvements from getting more eyes on it! |
I think it makes sense to keep all of the cloud provider stuff as part of control-plane team for now, and I agree with Justin we should look at these problems holistically since the same themes will probably crop up on every cloud provider. Whether anyone (on any team) has time to make the improvements in time for 1.2 is another story.... |
agree w/ @davidopp. Happy to consult though, if needed/desired. |
Moving to next-candidate, upgrading to P1. |
/sig cluster-lifecycle |
/assign |
/close |
I set up a 1000 node cluster on GKE using standard GCE advanced route networking. It wasn't possible for @dchen1107 to ssh to a node because gcloud was unable to add her ssh key to the project -- rate limit exceeded. Note that some other calls like list instances and resizing the MIG were not impacted, so it might be a particular category of API call that's getting rate limited?
We need to make sure 1000 node k8s clusters don't trigger GCE's rate limiter and make the user unable to use parts of the GCE api for their project.
Is there a vaguely-sane way to understand how much GCE api traffic each of our N controller objects is sending? Do we e.g. have prometheus metrics of sent calls (and perhaps result code for success/error?)
Note this may share an underlying cause with #21561 where large numbers of GCE calls result in api errors.
The text was updated successfully, but these errors were encountered: