Avoid using tag filters for EC2 API where possible #76749

mcrute · 2019-04-18T04:52:42Z

What type of PR is this?
/kind bug

What this PR does / why we need it:
For very large clusters these tag filters are not efficient within the EC2 API and will result in rate limiting. Most of these queries have filters that are targeted narrowly enough that the elimination of the tags filter will not return significantly more data but will be executed more efficiently by the EC2 API and reduce the chance of throttling.

Additionally, some API wrappers did not support pagination despite the underlying API calls being paginated. This change adds pagination to prevent accidentally truncating the returned results.

This change, once approved, will also need back-ported to 1.13 and 1.14.

Which issue(s) this PR fixes:
Fixes #76747

Does this PR introduce a user-facing change?:

Limit use of tags when calling EC2 API to prevent API throttling for very large clusters

/sig aws
/sig cloud-provider
/priority critical-urgent
/assign @mcrute @justinsb

zhan849 · 2019-04-23T16:52:40Z

pkg/cloudprovider/providers/aws/aws.go

+	baseFilters := []*ec2.Filter{newEc2Filter("instance-state-name", "running")}
+
+	filters := c.tagging.addFilters(baseFilters)
+	di, err := c.describeInstances(filters)


Can you pls explain more in detail why this is going to help?

I think aws limits rate by number of calls and looks like this is going to double the DescribeInstances call count when provisioning volume

There are two rate limits at play here; the rate limit for DescribeInstances and an overload protection rate limit for some downstream services used by DescribeInstances. The previous tag query was particularly inefficient and resulted in the triggering of the downstream overload protection limit which can not be raised. It is significantly more efficient to make two DescribeInstances calls with a more simple tag filter than one with the combined filter. It should be possible to raise the DescribeInstances rate limit if a user still hits that limit with this update.

Thanks @mcrute

I double checked the code, since this function is only used in volume creation, and it is just looking for a set of zones that has running instances from the cluster, would using kube client (aws.Cloud already has this) to fetch such information be enough? This will not need to make cloud provider call at all. We can start with List from api server directly and look for optimization by listing from cache.

It is true that there could be a race between node die and and the node actually disappear from api server, but under such condition, there will be more high level reconciliation triggered so we are good.

wanna see what other people thinks about it

Looks like getting zone from kubernetes work - we tried #76976 (same logic but slightly different function calls in the version we use (1.12.3)), and before we were able only to create 1 PVC every ~5sec, and now with the reduced calls, we are able to create 50 PVC all at a time and get all volumes bound within 30sec, all pods started running within 90sec.

zhan849 · 2019-04-23T17:07:58Z

pkg/cloudprovider/providers/aws/tags.go

-	// There are no "or" filters by key, so we look for both the legacy and new key, and then we have to post-filter
-	f := newEc2Filter("tag-key", TagNameKubernetesClusterLegacy, t.clusterTagKey())
+
+	f := newEc2Filter("tag-key", t.clusterTagKey())


Out of curiosity, why do we use "tag-key" as filter name here but "tag:{name}" filter name in addLegacyFilters?

I have this backwards... will swap the lines and update.

zhan849 · 2019-04-23T17:17:52Z

pkg/cloudprovider/providers/aws/tags.go

+// 1.5 -> 1.6 clusters and exists for backwards compatibility
+//
+// This lets us run multiple k8s clusters in a single EC2 AZ
+func (t *awsTagging) addLegacyFilters(filters []*ec2.Filter) []*ec2.Filter {


I see you are removing quite a few legacy filters from various places including load balancer, routes, and subnets, is this going to change the behavior and cause backward compatibility issues?

Removing these filters may result in a few more items being returned from the API calls but those extra items will be filtered out in the existing filtering loops below each change in this PR. The trade-off is that these queries are much more efficient within the EC2 API and are less likely to trigger internal rate limits as was the case with DescribeInstances. Given we're already doing a second pass of filtering this code is entirely backwards compatible.

Also worth noting is that many of the existing filters are incredibly specific and should result in a very few items being returned so the additional tag filtering within the API was redundant.

I see, looks good with recent version, @justinsb could you pls check this as well to see if clusters provisioned by legacy kube up (not sure if we still care about it) and old versions of kops are fine.

micahhausler

/approve
/lgtm

zhan849 · 2019-05-07T17:04:21Z

since there is no objections from community, shall we can proceed and merge it in? @mcrute @micahhausler

For very large clusters these tag filters are not efficient within the EC2 API and will result in rate limiting. Most of these queries have filters that are targeted narrowly enough that the elimination of the tags filter will not return significantly more data but will be executed more efficiently by the EC2 API. Additionally, some API wrappers did not support pagination despite the underlying API calls being paginated. This change adds pagination to prevent truncating the returned results.

micahhausler · 2019-05-07T20:54:42Z

/lgtm
/approve

k8s-ci-robot · 2019-05-07T20:55:13Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mcrute, micahhausler

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~staging/src/k8s.io/legacy-cloud-providers/aws/OWNERS~~ [mcrute,micahhausler]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2019-05-07T23:17:05Z

@mcrute: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
pull-kubernetes-e2e-gce-100-performance	`c8edfa2`	link	`/test pull-kubernetes-e2e-gce-100-performance`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Copy-pasted accidentally in kubernetes#76749

…-upstream-release-1.14 Automated cherry pick of #76749: Avoid using tag filters for EC2 API where possible

k8s-ci-robot requested review from justinsb and nckturner April 18, 2019 04:53

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cloudprovider labels Apr 18, 2019

zhan849 reviewed Apr 23, 2019

View reviewed changes

micahhausler approved these changes Apr 23, 2019

View reviewed changes

k8s-ci-robot assigned micahhausler Apr 23, 2019

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Apr 23, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 7, 2019

k8s-ci-robot merged commit f5d958a into kubernetes:master May 7, 2019

mcrute deleted the ec2-rate-limit-fix branch May 20, 2019 09:30

mcrute mentioned this pull request May 20, 2019

Automated cherry pick of #76749: Avoid using tag filters for EC2 API where possible #78100

Merged

zhan849 mentioned this pull request May 20, 2019

Cloudprovider AWS should query instance by ID when possible #78136

Closed

justinsb mentioned this pull request May 31, 2019

Remove unnecessary comment #78557

Merged

justinsb added a commit to justinsb/kubernetes that referenced this pull request May 31, 2019

Remove unnecessary comment

a2abed4

Copy-pasted accidentally in kubernetes#76749

k8s-ci-robot added a commit that referenced this pull request Aug 31, 2019

Merge pull request #78100 from mcrute/automated-cherry-pick-of-#76749…

295944e

…-upstream-release-1.14 Automated cherry pick of #76749: Avoid using tag filters for EC2 API where possible

zhan849 mentioned this pull request Oct 5, 2019

AWS ELB LoadBalancer not being updated in v1.14.7 #82923

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid using tag filters for EC2 API where possible #76749

Avoid using tag filters for EC2 API where possible #76749

mcrute commented Apr 18, 2019 •

edited

Loading

zhan849 Apr 23, 2019

mcrute Apr 23, 2019

zhan849 Apr 23, 2019

zhan849 Apr 24, 2019 •

edited

Loading

zhan849 Apr 23, 2019

mcrute Apr 23, 2019

zhan849 Apr 23, 2019

mcrute Apr 23, 2019 •

edited

Loading

zhan849 Apr 23, 2019 •

edited

Loading

micahhausler left a comment

zhan849 commented May 7, 2019 •

edited

Loading

micahhausler commented May 7, 2019

k8s-ci-robot commented May 7, 2019

k8s-ci-robot commented May 7, 2019

Avoid using tag filters for EC2 API where possible #76749

Avoid using tag filters for EC2 API where possible #76749

Conversation

mcrute commented Apr 18, 2019 • edited Loading

zhan849 Apr 23, 2019

Choose a reason for hiding this comment

mcrute Apr 23, 2019

Choose a reason for hiding this comment

zhan849 Apr 23, 2019

Choose a reason for hiding this comment

zhan849 Apr 24, 2019 • edited Loading

Choose a reason for hiding this comment

zhan849 Apr 23, 2019

Choose a reason for hiding this comment

mcrute Apr 23, 2019

Choose a reason for hiding this comment

zhan849 Apr 23, 2019

Choose a reason for hiding this comment

mcrute Apr 23, 2019 • edited Loading

Choose a reason for hiding this comment

zhan849 Apr 23, 2019 • edited Loading

Choose a reason for hiding this comment

micahhausler left a comment

Choose a reason for hiding this comment

zhan849 commented May 7, 2019 • edited Loading

micahhausler commented May 7, 2019

k8s-ci-robot commented May 7, 2019

k8s-ci-robot commented May 7, 2019

mcrute commented Apr 18, 2019 •

edited

Loading

zhan849 Apr 24, 2019 •

edited

Loading

mcrute Apr 23, 2019 •

edited

Loading

zhan849 Apr 23, 2019 •

edited

Loading

zhan849 commented May 7, 2019 •

edited

Loading