Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud provider AWS library should query instance by ID when possible #78140

Merged
merged 1 commit into from
Jul 12, 2019

Conversation

zhan849
Copy link
Contributor

@zhan849 zhan849 commented May 20, 2019

What type of PR is this?
/kind feature
/sig aws
/sig cloud-provider
/sig scalability

What this PR does / why we need it:
This PR changed the implementation of getInstanceByNodeName() in aws library to query instance from provider using instance id when possible, since it is much more efficient to get instance by ID in aws than run a filter. This PR checks if it is able to get provider id from v1.Node, if possible, it will get instance by node id, or it will fall back to running the query

We tested internally wit batch PVC provisioning benchmark and here are the improvements shown that makes k8s more scalable in aws environment:

  • Reduced more than 90% DescribeInstance calls (1638 calls including lots of retries -> 116 calls all successful with batch size 50)
  • Latency of each DescribeInstance call also got reduced
  • Before this change, peak DescribeInstances QPS grows linearly and ~= batch size, with this change, peak DescribeInstances QPS stays stably at ~10 calls/sec

Which issue(s) this PR fixes:
Fixes #78136

This is also the first step towards fixing #78409 , by introduce informer-fed node cache to aws library.

Special notes for your reviewer:
Regarding the tradeoff of offloading the workload to kubernetes:

  1. Node querying has limited frequency, mostly during node controller reconciliation, and the only bursts are during batch infrastructure provisioning (i.e. create a batch of 100 PVC with ebs, aggressive autoscaling by adding 100 nodes to cluster, etc). Get() calls are also relatively cheap. Therefore, load added to api server is limited, and can be optimized further by introducing informer interface to aws cloud provider

  2. Usually in production, each aws account would have multiple Kubernetes clusters and other non-k8s clusters, but aws related api call throttles are account-wide, so that is to say, if one kubernetes cluster is making inefficient call to cloud provider, other k8s/non-k8s clusters will be affected as well. so it worth the tradeoff and affects of burst in k8s calls can be limited to k8s cluster level, with proper retry mechanism (which we already have) control plane can make progress eventually

Does this PR introduce a user-facing change?:

none

Release note:

Optimize EC2 DescribeInstances API calls in aws cloud provider library by querying instance ID instead of EC2 filters when possible

/assign @zhan849 @mcrute @micahhausler
/cc @justinsb @jsafrane

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. sig/aws do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. labels May 20, 2019
@k8s-ci-robot k8s-ci-robot requested a review from jsafrane May 20, 2019 23:38
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 20, 2019
@k8s-ci-robot k8s-ci-robot requested a review from justinsb May 20, 2019 23:38
@k8s-ci-robot k8s-ci-robot added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels May 20, 2019
@k8s-ci-robot
Copy link
Contributor

Hi @zhan849. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. area/cloudprovider labels May 20, 2019
@zhan849 zhan849 force-pushed the aws-get-instance-by-id branch from f506114 to 20c2c4f Compare May 21, 2019 02:54
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 21, 2019
@mcrute
Copy link
Contributor

mcrute commented May 21, 2019

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 21, 2019
@mcrute
Copy link
Contributor

mcrute commented May 21, 2019

@zhan849 could you please add some release notes as well... something along the lines of (feel free to re-word this):

reduce the number of EC2 DescribeInstances API calls by querying the instance ID from the API server

@mcrute
Copy link
Contributor

mcrute commented May 21, 2019

/priority important-soon

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels May 21, 2019
@zhan849 zhan849 force-pushed the aws-get-instance-by-id branch from 20c2c4f to 08b4390 Compare May 21, 2019 14:49
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 21, 2019
@k8s-ci-robot k8s-ci-robot added this to the v1.16 milestone Jun 28, 2019
@zhan849 zhan849 force-pushed the aws-get-instance-by-id branch from 631e65b to 37b59d2 Compare June 28, 2019 17:38
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 28, 2019
@zhan849 zhan849 force-pushed the aws-get-instance-by-id branch from 37b59d2 to 1c94382 Compare June 28, 2019 19:50
@zhan849
Copy link
Contributor Author

zhan849 commented Jun 28, 2019

/retest

@zhan849
Copy link
Contributor Author

zhan849 commented Jun 28, 2019

added informer-fed node cache to this PR, added tests. all checks passed.
@jsafrane @justinsb @mcrute

@jsafrane
Copy link
Member

jsafrane commented Jul 1, 2019

lgtm-ish, but I'd prefer second pair of yes on this. @justinsb @gnufied @liggitt

}

node, err := c.nodeInformer.Lister().Get(string(nodeName))
if err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this require a RBAC policy change so as AWS cloudprovider can list nodes?

Copy link
Contributor Author

@zhan849 zhan849 Jul 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! I checked that after v1.15, we have deprecated auto-provisioned cloud provider role, and have user to create them #66635

I will modify the release note about the face that we need "list" permission now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, @justinsb we might want to modify kops in this case as well? kubernetes/kops#6899

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not remember if we need a RBAC rule if we are using a shared informer. If we do not - we are probably fine. We need to confirm that...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually @justinsb I did some tests and verified that we do not need additional RBAC rules for system:aws-cloud-provider to use shared informer. Our clusters are provisioned using kops and all roles contain default values

@zhan849 zhan849 force-pushed the aws-get-instance-by-id branch from 1c94382 to 6ae7620 Compare July 1, 2019 16:14
@gnufied
Copy link
Member

gnufied commented Jul 1, 2019

Two more questions:

fix kubelet log flushing issue in azure disk

Why is this line in PR description?

Reduced more than 90% DescribeInstance calls (1638 calls including lots of retries -> 116 calls all successful with batch size 50)

I understand fetching by instance-id is more efficient but how does this PR reduces number of DescribeInstance calls? Can you explain more?

@zhan849
Copy link
Contributor Author

zhan849 commented Jul 1, 2019

Two more questions:

fix kubelet log flushing issue in azure disk

Why is this line in PR description?

Reduced more than 90% DescribeInstance calls (1638 calls including lots of retries -> 116 calls all successful with batch size 50)

I understand fetching by instance-id is more efficient but how does this PR reduces number of DescribeInstance calls? Can you explain more?

thanks @gnufied
for question 1 - fixed, might have been some mis-copied stuff.
for question 2 - it's mostly about retries, in practice, making describing call with filters instead of ids at a high frequency are more likely to trigger throttling or 500+ return code, and result in long tail of backoff and retries (lots of them in our practice), but actual amount of gain would depend on other activity in the same aws account

@gnufied
Copy link
Member

gnufied commented Jul 1, 2019

making describing call with filters instead of ids at a high frequency are more likely to trigger throttling or 500+ return code, and result in long tail of backoff and retries (lots of them in our practice).

Is there any AWS documentation for this?

@zhan849
Copy link
Contributor Author

zhan849 commented Jul 1, 2019

making describing call with filters instead of ids at a high frequency are more likely to trigger throttling or 500+ return code, and result in long tail of backoff and retries (lots of them in our practice).

Is there any AWS documentation for this?

@gnufied I cannot find any - higher failure rate for filter is just our observation and is likely to be account specific

@gnufied
Copy link
Member

gnufied commented Jul 1, 2019

I cannot find any - higher failure rate for filter is just our observation and is likely to be account specific

Okay, I think the change is reasonable. I would like us to confirm RBAC rule change, whether we need it or not.

@zhan849
Copy link
Contributor Author

zhan849 commented Jul 2, 2019

I cannot find any - higher failure rate for filter is just our observation and is likely to be account specific

Okay, I think the change is reasonable. I would like us to confirm RBAC rule change, whether we need it or not.

@gnufied I did some test and verified that we do not need additional RBAC for aws:cloud-provider to shared informer

@gnufied
Copy link
Member

gnufied commented Jul 2, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 2, 2019
@gnufied
Copy link
Member

gnufied commented Jul 12, 2019

I think we had enough opportunity for other approves to take a look at it.

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gnufied, zhan849

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 12, 2019
@zhan849
Copy link
Contributor Author

zhan849 commented Jul 12, 2019

/retest

1 similar comment
@zhan849
Copy link
Contributor Author

zhan849 commented Jul 12, 2019

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cloudprovider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cloudprovider AWS should query instance by ID when possible
8 participants