-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid using tag filters for EC2 API where possible #76749
Conversation
baseFilters := []*ec2.Filter{newEc2Filter("instance-state-name", "running")} | ||
|
||
filters := c.tagging.addFilters(baseFilters) | ||
di, err := c.describeInstances(filters) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you pls explain more in detail why this is going to help?
I think aws limits rate by number of calls and looks like this is going to double the DescribeInstances call count when provisioning volume
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two rate limits at play here; the rate limit for DescribeInstances
and an overload protection rate limit for some downstream services used by DescribeInstances
. The previous tag query was particularly inefficient and resulted in the triggering of the downstream overload protection limit which can not be raised. It is significantly more efficient to make two DescribeInstances
calls with a more simple tag filter than one with the combined filter. It should be possible to raise the DescribeInstances
rate limit if a user still hits that limit with this update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mcrute
I double checked the code, since this function is only used in volume creation, and it is just looking for a set of zones that has running instances from the cluster, would using kube client (aws.Cloud
already has this) to fetch such information be enough? This will not need to make cloud provider call at all. We can start with List from api server directly and look for optimization by listing from cache.
It is true that there could be a race between node die and and the node actually disappear from api server, but under such condition, there will be more high level reconciliation triggered so we are good.
wanna see what other people thinks about it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like getting zone from kubernetes work - we tried #76976 (same logic but slightly different function calls in the version we use (1.12.3)), and before we were able only to create 1 PVC every ~5sec, and now with the reduced calls, we are able to create 50 PVC all at a time and get all volumes bound within 30sec, all pods started running within 90sec.
// There are no "or" filters by key, so we look for both the legacy and new key, and then we have to post-filter | ||
f := newEc2Filter("tag-key", TagNameKubernetesClusterLegacy, t.clusterTagKey()) | ||
|
||
f := newEc2Filter("tag-key", t.clusterTagKey()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, why do we use "tag-key" as filter name here but "tag:{name}" filter name in addLegacyFilters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have this backwards... will swap the lines and update.
// 1.5 -> 1.6 clusters and exists for backwards compatibility | ||
// | ||
// This lets us run multiple k8s clusters in a single EC2 AZ | ||
func (t *awsTagging) addLegacyFilters(filters []*ec2.Filter) []*ec2.Filter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you are removing quite a few legacy filters from various places including load balancer, routes, and subnets, is this going to change the behavior and cause backward compatibility issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing these filters may result in a few more items being returned from the API calls but those extra items will be filtered out in the existing filtering loops below each change in this PR. The trade-off is that these queries are much more efficient within the EC2 API and are less likely to trigger internal rate limits as was the case with DescribeInstances
. Given we're already doing a second pass of filtering this code is entirely backwards compatible.
Also worth noting is that many of the existing filters are incredibly specific and should result in a very few items being returned so the additional tag filtering within the API was redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, looks good with recent version, @justinsb could you pls check this as well to see if clusters provisioned by legacy kube up (not sure if we still care about it) and old versions of kops are fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
/lgtm
since there is no objections from community, shall we can proceed and merge it in? @mcrute @micahhausler |
For very large clusters these tag filters are not efficient within the EC2 API and will result in rate limiting. Most of these queries have filters that are targeted narrowly enough that the elimination of the tags filter will not return significantly more data but will be executed more efficiently by the EC2 API. Additionally, some API wrappers did not support pagination despite the underlying API calls being paginated. This change adds pagination to prevent truncating the returned results.
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mcrute, micahhausler The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@mcrute: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Copy-pasted accidentally in kubernetes#76749
…-upstream-release-1.14 Automated cherry pick of #76749: Avoid using tag filters for EC2 API where possible
What type of PR is this?
/kind bug
What this PR does / why we need it:
For very large clusters these tag filters are not efficient within the EC2 API and will result in rate limiting. Most of these queries have filters that are targeted narrowly enough that the elimination of the tags filter will not return significantly more data but will be executed more efficiently by the EC2 API and reduce the chance of throttling.
Additionally, some API wrappers did not support pagination despite the underlying API calls being paginated. This change adds pagination to prevent accidentally truncating the returned results.
This change, once approved, will also need back-ported to 1.13 and 1.14.
Which issue(s) this PR fixes:
Fixes #76747
Does this PR introduce a user-facing change?:
/sig aws
/sig cloud-provider
/priority critical-urgent
/assign @mcrute @justinsb