Cloud provider AWS library should query instance by ID when possible #78140

zhan849 · 2019-05-20T23:38:29Z

What type of PR is this?
/kind feature
/sig aws
/sig cloud-provider
/sig scalability

What this PR does / why we need it:
This PR changed the implementation of getInstanceByNodeName() in aws library to query instance from provider using instance id when possible, since it is much more efficient to get instance by ID in aws than run a filter. This PR checks if it is able to get provider id from v1.Node, if possible, it will get instance by node id, or it will fall back to running the query

We tested internally wit batch PVC provisioning benchmark and here are the improvements shown that makes k8s more scalable in aws environment:

Reduced more than 90% DescribeInstance calls (1638 calls including lots of retries -> 116 calls all successful with batch size 50)
Latency of each DescribeInstance call also got reduced
Before this change, peak DescribeInstances QPS grows linearly and ~= batch size, with this change, peak DescribeInstances QPS stays stably at ~10 calls/sec

Which issue(s) this PR fixes:
Fixes #78136

This is also the first step towards fixing #78409 , by introduce informer-fed node cache to aws library.

Special notes for your reviewer:
Regarding the tradeoff of offloading the workload to kubernetes:

Node querying has limited frequency, mostly during node controller reconciliation, and the only bursts are during batch infrastructure provisioning (i.e. create a batch of 100 PVC with ebs, aggressive autoscaling by adding 100 nodes to cluster, etc). Get() calls are also relatively cheap. Therefore, load added to api server is limited, and can be optimized further by introducing informer interface to aws cloud provider
Usually in production, each aws account would have multiple Kubernetes clusters and other non-k8s clusters, but aws related api call throttles are account-wide, so that is to say, if one kubernetes cluster is making inefficient call to cloud provider, other k8s/non-k8s clusters will be affected as well. so it worth the tradeoff and affects of burst in k8s calls can be limited to k8s cluster level, with proper retry mechanism (which we already have) control plane can make progress eventually

Does this PR introduce a user-facing change?:

none

Release note:

Optimize EC2 DescribeInstances API calls in aws cloud provider library by querying instance ID instead of EC2 filters when possible

/assign @zhan849 @mcrute @micahhausler
/cc @justinsb @jsafrane

k8s-ci-robot · 2019-05-20T23:38:36Z

Hi @zhan849. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mcrute · 2019-05-21T09:29:36Z

/ok-to-test

mcrute · 2019-05-21T09:32:09Z

@zhan849 could you please add some release notes as well... something along the lines of (feel free to re-word this):

reduce the number of EC2 DescribeInstances API calls by querying the instance ID from the API server

mcrute · 2019-05-21T09:33:36Z

/priority important-soon

zhan849 · 2019-06-28T23:08:03Z

/retest

zhan849 · 2019-06-28T23:27:40Z

added informer-fed node cache to this PR, added tests. all checks passed.
@jsafrane @justinsb @mcrute

jsafrane · 2019-07-01T09:12:33Z

lgtm-ish, but I'd prefer second pair of yes on this. @justinsb @gnufied @liggitt

staging/src/k8s.io/legacy-cloud-providers/aws/aws.go

gnufied · 2019-07-01T14:48:53Z

staging/src/k8s.io/legacy-cloud-providers/aws/aws.go

+	}
+
+	node, err := c.nodeInformer.Lister().Get(string(nodeName))
+	if err != nil {


Does this require a RBAC policy change so as AWS cloudprovider can list nodes?

Good question! I checked that after v1.15, we have deprecated auto-provisioned cloud provider role, and have user to create them #66635

I will modify the release note about the face that we need "list" permission now.

Also, @justinsb we might want to modify kops in this case as well? kubernetes/kops#6899

I do not remember if we need a RBAC rule if we are using a shared informer. If we do not - we are probably fine. We need to confirm that...

actually @justinsb I did some tests and verified that we do not need additional RBAC rules for system:aws-cloud-provider to use shared informer. Our clusters are provisioned using kops and all roles contain default values

gnufied · 2019-07-01T20:05:03Z

Two more questions:

fix kubelet log flushing issue in azure disk

Why is this line in PR description?

Reduced more than 90% DescribeInstance calls (1638 calls including lots of retries -> 116 calls all successful with batch size 50)

I understand fetching by instance-id is more efficient but how does this PR reduces number of DescribeInstance calls? Can you explain more?

zhan849 · 2019-07-01T20:46:21Z

Two more questions:

fix kubelet log flushing issue in azure disk

Why is this line in PR description?

Reduced more than 90% DescribeInstance calls (1638 calls including lots of retries -> 116 calls all successful with batch size 50)

I understand fetching by instance-id is more efficient but how does this PR reduces number of DescribeInstance calls? Can you explain more?

thanks @gnufied
for question 1 - fixed, might have been some mis-copied stuff.
for question 2 - it's mostly about retries, in practice, making describing call with filters instead of ids at a high frequency are more likely to trigger throttling or 500+ return code, and result in long tail of backoff and retries (lots of them in our practice), but actual amount of gain would depend on other activity in the same aws account

gnufied · 2019-07-01T20:57:14Z

making describing call with filters instead of ids at a high frequency are more likely to trigger throttling or 500+ return code, and result in long tail of backoff and retries (lots of them in our practice).

Is there any AWS documentation for this?

zhan849 · 2019-07-01T21:07:52Z

making describing call with filters instead of ids at a high frequency are more likely to trigger throttling or 500+ return code, and result in long tail of backoff and retries (lots of them in our practice).

Is there any AWS documentation for this?

@gnufied I cannot find any - higher failure rate for filter is just our observation and is likely to be account specific

gnufied · 2019-07-01T21:43:34Z

I cannot find any - higher failure rate for filter is just our observation and is likely to be account specific

Okay, I think the change is reasonable. I would like us to confirm RBAC rule change, whether we need it or not.

zhan849 · 2019-07-02T03:42:34Z

I cannot find any - higher failure rate for filter is just our observation and is likely to be account specific

Okay, I think the change is reasonable. I would like us to confirm RBAC rule change, whether we need it or not.

@gnufied I did some test and verified that we do not need additional RBAC for aws:cloud-provider to shared informer

gnufied · 2019-07-02T15:05:03Z

/lgtm

gnufied · 2019-07-12T13:59:20Z

I think we had enough opportunity for other approves to take a look at it.

/approve

k8s-ci-robot · 2019-07-12T13:59:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gnufied, zhan849

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~staging/src/k8s.io/legacy-cloud-providers/aws/OWNERS~~ [gnufied]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

zhan849 · 2019-07-12T15:37:23Z

/retest

zhan849 · 2019-07-12T16:14:12Z

/retest

k8s-ci-robot assigned mcrute, micahhausler and zhan849 May 20, 2019

k8s-ci-robot requested a review from jsafrane May 20, 2019 23:38

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 20, 2019

k8s-ci-robot requested a review from justinsb May 20, 2019 23:38

k8s-ci-robot added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels May 20, 2019

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. area/cloudprovider labels May 20, 2019

zhan849 force-pushed the aws-get-instance-by-id branch from f506114 to 20c2c4f Compare May 21, 2019 02:54

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 21, 2019

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 21, 2019

k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels May 21, 2019

zhan849 force-pushed the aws-get-instance-by-id branch from 20c2c4f to 08b4390 Compare May 21, 2019 14:49

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 21, 2019

zhan849 mentioned this pull request May 22, 2019

AWS EBS provisioner should get node zone info from k8s #76976

Merged

k8s-ci-robot added this to the v1.16 milestone Jun 28, 2019

zhan849 force-pushed the aws-get-instance-by-id branch from 631e65b to 37b59d2 Compare June 28, 2019 17:38

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 28, 2019

zhan849 force-pushed the aws-get-instance-by-id branch from 37b59d2 to 1c94382 Compare June 28, 2019 19:50

gnufied reviewed Jul 1, 2019

View reviewed changes

staging/src/k8s.io/legacy-cloud-providers/aws/aws.go Outdated Show resolved Hide resolved

gnufied reviewed Jul 1, 2019

View reviewed changes

staging/src/k8s.io/legacy-cloud-providers/aws/aws.go Outdated Show resolved Hide resolved

gnufied reviewed Jul 1, 2019

View reviewed changes

Cloud provider AWS library should query instance by ID when possible

6ae7620

zhan849 force-pushed the aws-get-instance-by-id branch from 1c94382 to 6ae7620 Compare July 1, 2019 16:14

k8s-ci-robot assigned gnufied Jul 2, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 2, 2019

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 12, 2019

k8s-ci-robot merged commit 416a717 into kubernetes:master Jul 12, 2019

huffmanca mentioned this pull request Jul 18, 2019

Bug 1704356: Backport AWS cloud-provider improvements openshift/origin#23424

Merged

wongma7 mentioned this pull request Oct 14, 2020

instances: initial implementation of instancesV2 interface kubernetes/cloud-provider-aws#131

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cloud provider AWS library should query instance by ID when possible #78140

Cloud provider AWS library should query instance by ID when possible #78140

zhan849 commented May 20, 2019 •

edited

Loading

k8s-ci-robot commented May 20, 2019

mcrute commented May 21, 2019

mcrute commented May 21, 2019

mcrute commented May 21, 2019

zhan849 commented Jun 28, 2019

zhan849 commented Jun 28, 2019

jsafrane commented Jul 1, 2019

gnufied Jul 1, 2019

zhan849 Jul 1, 2019 •

edited

Loading

zhan849 Jul 1, 2019

gnufied Jul 1, 2019

zhan849 Jul 2, 2019

gnufied commented Jul 1, 2019

zhan849 commented Jul 1, 2019

gnufied commented Jul 1, 2019

zhan849 commented Jul 1, 2019 •

edited

Loading

gnufied commented Jul 1, 2019

zhan849 commented Jul 2, 2019

gnufied commented Jul 2, 2019

gnufied commented Jul 12, 2019

k8s-ci-robot commented Jul 12, 2019

zhan849 commented Jul 12, 2019

zhan849 commented Jul 12, 2019

Cloud provider AWS library should query instance by ID when possible #78140

Cloud provider AWS library should query instance by ID when possible #78140

Conversation

zhan849 commented May 20, 2019 • edited Loading

k8s-ci-robot commented May 20, 2019

mcrute commented May 21, 2019

mcrute commented May 21, 2019

mcrute commented May 21, 2019

zhan849 commented Jun 28, 2019

zhan849 commented Jun 28, 2019

jsafrane commented Jul 1, 2019

gnufied Jul 1, 2019

Choose a reason for hiding this comment

zhan849 Jul 1, 2019 • edited Loading

Choose a reason for hiding this comment

zhan849 Jul 1, 2019

Choose a reason for hiding this comment

gnufied Jul 1, 2019

Choose a reason for hiding this comment

zhan849 Jul 2, 2019

Choose a reason for hiding this comment

gnufied commented Jul 1, 2019

zhan849 commented Jul 1, 2019

gnufied commented Jul 1, 2019

zhan849 commented Jul 1, 2019 • edited Loading

gnufied commented Jul 1, 2019

zhan849 commented Jul 2, 2019

gnufied commented Jul 2, 2019

gnufied commented Jul 12, 2019

k8s-ci-robot commented Jul 12, 2019

zhan849 commented Jul 12, 2019

zhan849 commented Jul 12, 2019

zhan849 commented May 20, 2019 •

edited

Loading

zhan849 Jul 1, 2019 •

edited

Loading

zhan849 commented Jul 1, 2019 •

edited

Loading