AWS EBS provisioner should get node zone info from k8s #76976

zhan849 · 2019-04-24T01:12:28Z

What type of PR is this?
/kind bug

What this PR does / why we need it:
This PR changed the implementation of GetCandidateZonesForDynamicVolume() in aws library to get information from k8s master instead of cloud provider. This eliminates all calls to cloud provider and allows us to provision volume much faster - without this change, we were only able to create 1 PVC every 2~5 sec to avoid too many Server.Unavailable from ec2, but with this change, we are able to create 50 PVCs at a time, all PVC got bound within 30sec, and all related Pods started running within 90sec.

Which issue(s) this PR fixes:
Fixes #76975

Special notes for your reviewer:

Please provide insights about whether we should fall back to querying cloud provider if we don't find zone labels.
Semantic wide I think since this call is no longer related with aws, and could be moved to aws_ebs.go, but since we are already adding k8s api exposure in aws library, and this also includes minimum change, I think just changing GetCandidateZonesForDynamicVolume() could be the best fix.

Release note:

`aws-cloud-provider` service account in the `kube-system` namespace need to be granted with list node permission with this optimization

/assign @zhan849 @mcrute @justinsb @micahhausler

k8s-ci-robot · 2019-04-24T01:12:35Z

Hi @zhan849. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mcrute · 2019-04-24T19:37:01Z

/ok-to-test

zhan849 · 2019-04-24T21:19:13Z

/retest

zhan849 · 2019-04-25T00:12:32Z

@mcrute are these 2 failed tests known flaky ones?

/retest

liggitt · 2019-05-22T03:28:59Z

This PR changed the implementation of GetCandidateZonesForDynamicVolume() in aws library to get information from k8s master instead of cloud provider.

What are the implications of this for a kubelet that accidentally or maliciously records incorrect information in its node object? Is there a check against the source of truth (cloud provider) at any point in the process?

liggitt · 2019-05-22T03:31:14Z

staging/src/k8s.io/legacy-cloud-providers/aws/aws.go

+	labelKeyNodeRole    = "kubernetes.io/role"
+	nodeMasterRole      = "master"
+	nodeMinionRole      = "node"
+	labelKeyNodeMaster  = "node-role.kubernetes.io/master"


what behavior is being attached to these labels in this PR? Since these are ad hoc, it is not expected these labels will implicitly drive behavior of kubernetes components (the only existing example of LB routing not including nodes with a master label is considered a bug)

Yeah, components of kube may not rely on role labels for proper function. Node role labels are informational only and optional

liggitt · 2019-05-22T03:34:10Z

staging/src/k8s.io/legacy-cloud-providers/aws/aws.go

-		master := false
-		for _, tag := range instance.Tags {
-			tagKey := aws.StringValue(tag.Key)
-			if awsTagNameMasterRoles.Has(tagKey) {


checking against metadata was probably more reasonable if that was outside the node's control. does checking against these label keys imply all aws clusters must label node objects with these labels now? I see this PR is being cherry picked to 1.14. are all installations making use of the aws cloud provider aware they must now label nodes this way?

liggitt · 2019-05-22T03:36:00Z

staging/src/k8s.io/legacy-cloud-providers/aws/aws.go

+}
+
+// isMasterNode checks if the node is labeled as master
+func (c *Cloud) isMasterNode(node *v1.Node) bool {


this is elevating an ad hoc label to implicitly drive behavior. these labels are not uniformly applied, and are not maintained by kubernetes components (unlike the os/arch labels), so I would not expect them to be used this way

zhan849 · 2019-05-22T03:40:18Z

This PR changed the implementation of GetCandidateZonesForDynamicVolume() in aws library to get information from k8s master instead of cloud provider.

What are the implications of this for a kubelet that accidentally or maliciously records incorrect information in its node object? Is there a check against the source of truth (cloud provider) at any point in the process?

Zone information is actually maintained by node controller and is retrieved directly from cloud provider and thus trustworthy: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/cloud/node_controller.go#L296

smarterclayton · 2019-05-22T03:42:24Z

staging/src/k8s.io/legacy-cloud-providers/aws/aws.go

-	klog.V(2).Infof("Found instances in zones %s", zones)
-	return zones, nil
+// isMinionNode checks if the node is labeled as minion
+func (c *Cloud) isMinionNode(node *v1.Node) bool {


This check is not safe. Nothing guarantees these labels and a conformant cluster is not required to set them. This would break backwards compatibility with existing clusters on upgrade as well.

liggitt · 2019-05-22T03:45:52Z

zone labels are also mutable by kubelets.

I'm sympathetic to the issue being fixed, but I don't think this use of node role labels is acceptable. If we want to start using those labels like this, we need guarantees that they will be set and maintained, and we don't have those now.

I think this needs to be reverted to remove this label use, and alternate methods of solving the PV issue explored (I could see using info from Node objects as an efficient check, and verifying node zone or master tag against metadata the first time we encounter the node, or if what is recorded in the node disagrees with our last check against metadata)

smarterclayton · 2019-05-22T03:46:46Z

Agree with Jordan. Please open a revert and we can discuss how this could be implemented differently.

zhan849 · 2019-05-22T04:19:26Z

@liggitt @smarterclayton thanks for the feedbacks. so to summarize, there are 2 debatable issues:

Is zone info in Node label trustworthy? - my answer is yes, since currently the zone labels (beta version) are maintained by node controller, the information is retrieved from cloud provider directly and is always updated during node reconciliation, it can be trusted. Even if kubelet can be misconfigured by wrong zone information, node controller is going to override them. Once we have the GA version of zone labels pushed out, we should update node controller to maintain both zone labels.
Are the master/minion labels trustworthy? - my answer is "semantic wide they are equivalent to checking node labels from cloud provider" - just as people are not required to set master/minion labels in v1.Node, they are not required to set such labels in cloud providers as well. Therefore besides mainstream cluster provisioners, which maintains these sets of labels in both Kubernetes and cloud provider, the function's behavior will remain unchanged: the original impl skips nodes that are explicitly labeled with well-known master labels, same here.

I will open another ticket to move the conversation over. We can hold off cherry-picking to release-1.14 for not, but I think it's too early to determine whether this PR would create regression and should be reverted

liggitt · 2019-05-22T04:29:43Z

just as people are not required to set master/minion labels in v1.Node, they are not required to set such labels in cloud providers as well

cloud providers have more leeway to dictate how metadata tags must be configured in order to interoperate with the kubernetes cloud provider code. I'm concerned about trading the existing behavior of metadata tags for labels that current users of the aws cloud provider are not guaranteed to have set.

We can hold off cherry-picking to release-1.14 for not, but I think it's too early to determine whether this PR would create regression and should be reverted

Please go ahead and revert this PR, it breaks compatibility for existing AWS cloud provider users, and bases behavior of core kubernetes components on the node role labels, which is not an intended use. Code freeze is approaching, and we should not leave this use in place.

zhan849 · 2019-05-22T04:38:40Z

@liggitt thanks! will send out the PR shortly, lets follow up the discussion in #78199

zhan849 · 2019-05-22T06:19:54Z

@liggitt ping #78200

Manually revert #76976

liggitt · 2019-05-22T13:31:46Z

staging/src/k8s.io/legacy-cloud-providers/aws/aws.go


-	filters := c.tagging.addFilters(baseFilters)
-	di, err := c.describeInstances(filters)
+	// TODO: list from cache?


yes, this should list from an informer-fed cache. rapidly repeatedly listing all nodes (in the example scenario of provisioning 50 PVs in 30 seconds) in a 5000 node cluster is problematic.

👍 I do have plan to optimize that part further(#78140 is another example that can benefit from informer), as cloud provider is now out of tree, making such change would be easier.

The intention is to move away unnecessary “list” calls from cloud provider, which has degree of inefficiency even compared with listing from apiserver and would cause cross cluster effect.

a useful example could be the informer-fed node zone cache GCE maintains (see https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/legacy-cloud-providers/gce/gce.go#L683-L745)

AWS has the extra wrinkle of excluding masters, but I wonder if that could be done as a much smaller metadata query that is marked as needing to be run when nodes are added/removed

Ah seems GCE has been doing it for a while, thanks a lot Jordan! tbh having a dedicated zone for master is probably not a common, or even recommended practice, and for non-topology-aware storage class, it's a good check here though as in such case, having volume provisioner to provision a volume in a zone that is more likely to get pod scheduled can reduce operational overhead

I'd recommend talking with @msau42 about how best to filter the available zones (and whether there's additional info in the CreateVolume call that could be used to narrow down which nodes' zones should be considered)

Thanks Jordan. Pinged Michelle for inputs from him.

liggitt · 2019-05-22T13:39:09Z

staging/src/k8s.io/legacy-cloud-providers/aws/aws.go

-		if err != nil {
-			return nil, err
+	for _, n := range nodes.Items {
+		if !c.isNodeReady(&n) {


node readiness is not the same as the running condition in the aws query... will being more restrictive here cause problems?

I think the intention of this function is to prevent the case that we create a volume in a region that the pod cannot even be scheduled on to. From cloud provider info, “running” is the most we can do but “NodeReady” brings more confidence.

I agree with @zhan849 the more correct status is that the node is Ready according to the API server vs simply running according to the EC2 control plane. It's possible that a node is running but has not yet joined the cluster or may never join for a variety of reasons.

zhan849 · 2019-05-24T04:53:07Z

@liggitt @smarterclayton I'm proposing an alternative optimization that leaves the code path for volumes that need to bind immediately, but use the zone info from selected node for topo-aware volume provisioning, which still achieved significant reduction in cloud provider footprint.

I pinged sig-storage in the PR but also wanna get feedbacks from you guys:
#78276

thanks for the time!

k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Apr 24, 2019

k8s-ci-robot assigned justinsb, mcrute, micahhausler and zhan849 Apr 24, 2019

k8s-ci-robot requested review from jsafrane and zmerlynn April 24, 2019 01:12

zhan849 changed the title ~~get node zone info from k8s~~ AWS EBS provisioner should get node zone info from k8s Apr 24, 2019

zhan849 force-pushed the aws-get-zone branch 2 times, most recently from e68ab3f to 64157fe Compare April 24, 2019 02:13

zhan849 mentioned this pull request Apr 24, 2019

Avoid using tag filters for EC2 API where possible #76749

Merged

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 24, 2019

zhan849 force-pushed the aws-get-zone branch 2 times, most recently from 8b6166d to 15dd4c2 Compare April 24, 2019 21:18

zhan849 force-pushed the aws-get-zone branch from 15dd4c2 to db9047e Compare April 25, 2019 03:26

k8s-ci-robot removed the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 25, 2019

liggitt reviewed May 22, 2019

View reviewed changes

smarterclayton reviewed May 22, 2019

View reviewed changes

zhan849 mentioned this pull request May 22, 2019

Discuss the feasibility of aws-ebs provisioner to get node zone info from k8s instead of cloud provider #78199

Closed

zhan849 added a commit to zhan849/kubernetes that referenced this pull request May 22, 2019

Manually revert kubernetes#76976

47b8db4

zhan849 mentioned this pull request May 22, 2019

Manually revert #76976 #78200

Merged

k8s-ci-robot added a commit that referenced this pull request May 22, 2019

Merge pull request #78200 from zhan849/revert-76976

a27d3ae

Manually revert #76976

liggitt reviewed May 22, 2019

View reviewed changes

zhan849 mentioned this pull request May 24, 2019

Use zone from node for topology aware aws-ebs volume creation #78276

Merged

jsafrane mentioned this pull request May 27, 2019

Cloud provider AWS library should query instance by ID when possible #78140

Merged

superlime pushed a commit to superlime/kubernetes that referenced this pull request Jun 3, 2019

Manually revert kubernetes#76976

4d00f07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS EBS provisioner should get node zone info from k8s #76976

AWS EBS provisioner should get node zone info from k8s #76976

zhan849 commented Apr 24, 2019 •

edited

Loading

k8s-ci-robot commented Apr 24, 2019

mcrute commented Apr 24, 2019

zhan849 commented Apr 24, 2019

zhan849 commented Apr 25, 2019

liggitt commented May 22, 2019

liggitt May 22, 2019

smarterclayton May 22, 2019

liggitt May 22, 2019

liggitt May 22, 2019

zhan849 commented May 22, 2019

smarterclayton May 22, 2019

liggitt commented May 22, 2019

smarterclayton commented May 22, 2019

zhan849 commented May 22, 2019 •

edited

Loading

liggitt commented May 22, 2019

zhan849 commented May 22, 2019

zhan849 commented May 22, 2019

liggitt May 22, 2019

zhan849 May 22, 2019 •

edited

Loading

liggitt May 22, 2019

zhan849 May 22, 2019

liggitt May 22, 2019

zhan849 May 22, 2019

liggitt May 22, 2019

zhan849 May 22, 2019

mcrute May 22, 2019

zhan849 commented May 24, 2019

AWS EBS provisioner should get node zone info from k8s #76976

AWS EBS provisioner should get node zone info from k8s #76976

Conversation

zhan849 commented Apr 24, 2019 • edited Loading

k8s-ci-robot commented Apr 24, 2019

mcrute commented Apr 24, 2019

zhan849 commented Apr 24, 2019

zhan849 commented Apr 25, 2019

liggitt commented May 22, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhan849 commented May 22, 2019

Choose a reason for hiding this comment

liggitt commented May 22, 2019

smarterclayton commented May 22, 2019

zhan849 commented May 22, 2019 • edited Loading

liggitt commented May 22, 2019

zhan849 commented May 22, 2019

zhan849 commented May 22, 2019

Choose a reason for hiding this comment

zhan849 May 22, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhan849 commented May 24, 2019

zhan849 commented Apr 24, 2019 •

edited

Loading

zhan849 commented May 22, 2019 •

edited

Loading

zhan849 May 22, 2019 •

edited

Loading