Cloud Controller Manager now sets Node.Spec.ProviderID #50730

andrewsykim · 2017-08-16T03:38:03Z

What this PR does / why we need it:
Cloud Controller Manager now sets Node.Spec.ProviderID.

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #
#49836.

Special notes for your reviewer:

As part of an effort to move cloud controller manager into beta Cloud Controller Manager Beta Requirements #48690.

jhorwit2 · 2017-08-16T22:38:04Z

pkg/controller/cloud/node_controller.go

+			// we should attempt to set providerID on curNode, but
+			// we can continue if we fail since we will attempt to set
+			// node addresses given the node name in getNodeAddressesByProviderIDOrName
+			glog.Errorf("%v", err)


This is changing logic from the kubelet. If the kubelet encounters an error getting the ProviderID it won't continue fetching subsequent information from the cloud provider.

kubernetes/pkg/kubelet/kubelet_node_status.go

Line 295 in 02b520f

return nil, err

Actually, the kubelet will just keep trying until it succeeds or the kubelet stops. If i'm not mistaken, that's not something the CCM does since it only listens to OnAdd.

I also opted to not return here since some functions will try using the node name if the provider id fails. Example here. To me it makes sense to fail loudly if you fail to get a provider id, but it shouldn't prevent nodes from removing the cloud taint and being left un-scheduable.

I don't like how we use a combination of provider ID and names for functions. It does give us some flexibility like in this case but it's so inconsistent. I don't think we want to be refactoring that right now though 😛

I agree with your reasoning for not making it required since the node name fallback is supported in every current provider as far as I know.

My concern is why was ProviderID originally something required prior to making a node schedulable in the kubelet and if it's OK to not follow that pattern here.

edit: I can't seem to find any original proposals. Do you know where those might be?

I also opted to not return here since some functions will try using the node name if the provider id fails. Example here. To me it makes sense to fail loudly if you fail to get a provider id, but it shouldn't prevent nodes from removing the cloud taint and being left un-scheduable.

Seems logical for now. A retry mechanism for AddCloudNode is probably not a bad idea but is out of scope for #49836 really.

luxas · 2017-08-17T15:51:52Z

PTAL @wlan0 @thockin

wlan0 · 2017-08-18T21:16:40Z

@luxas @prydie @thockin @jhorwit2 @andrewsykim

The provider id should be set in the kubelet and not by the CCM. The CCM only uses it to fetch metadata about the node, and to uniquely identify the node. This is why we have this parameter in the kubelet - https://github.com/kubernetes/kubernetes/blob/master/cmd/kubelet/app/options/options.go#L216

If the nodename can be used to uniquely get information about the node (such as to fetch the providerID like here), then we wouldn't have relied on providerID in the first place. This pattern of setting providerID should not be propagated further since it does not work for all cloudproviders, and it's going to create inconsistencies.

wlan0 · 2017-08-18T21:16:57Z

/assign wlan0

andrewsykim · 2017-08-18T21:58:46Z

@wlan0 I had no idea we had the --provider-id flag in the kubelet.

My only issue with setting providerID in the kubelet is that it leaves room for error and inconsistencies. We rely on cluster admins to correctly set --provider-id (aws://i-abc123, digitalocean://1234567, etc). If we opted to set providerID in CCM then we can ensure that it will be in the same format across the board.

I guess this isn't a big deal since we would fall back to using node name, but seems like being able to set the providerID from CCM could be beneficial here since some cloud providers have to do some extra work to get a node only based on its name.

I personally am not favouring one over the other at the moment, will have to think through this a bit more.

jhorwit2 · 2017-08-18T22:29:58Z

I didn't know about that flag either :). The refactoring proposal seems to omit the part about how refactoring the provider id out of the kubelet will be done.

So now it's up to the cluster admin / cloud provider / whoever to know the format of the existing provider id and update any installation scripts accordingly. That seems like it's going to be error prone when certain cloud providers accept multiple formats for the provider id; like aws. At the very least I feel we would want to provide docs with the expected format for each provider.

What are your thoughts @wlan0 on allowing the CCM to optionally set the provider id based on the node name. If a known unimplemented error is returned then ignore setting provider id, but if the error isn't that then retry to get the provider id just like the kubelet did. GCE already has support to generate the provider ID by the node name as well as the metadata server. I didn't check any others to see if they did.

wlan0 · 2017-08-19T00:06:52Z

The documentation definitely could be improved. I apologise for that. This decision unfortunately is hidden in some comment somewhere in one of the PRs made for this change.

To address @jhorwitz' comment -

It's not going to be error prone if the administrator need not learn any new format. The administrator needs to set the provider ID in the way the cloud understands it.

The kubelet has the responsibility to convert that provider ID into a kubernetes specific format. If this hasn't already been addressed, then it ought to be.

Secondly, there are some clouds whose providerID cannot be inferred from a remote location even within the cloud (e.g. openstack). This is why we wanted to have the ability to set a unique id while starting the kubelet.

It is going to lead to inconsistent cloud controllers, which will be hard to change and maintain if we decide to optionally allow providerId to be set from the CCM. We will not be able to reason about when this is set, and therefore, we won't be able to rely on a consistent behaviour to build systems around it, or test it easily.

andrewsykim · 2017-08-19T00:46:43Z

I think you pinged @jhorwitz instead of @jhorwit2 😛

@wlan0 on that note, do you know what the direction of auto detecting cloud provider is? It seems like we call updateCloudProviderFromMachineInfo in the case of auto detecting cloud providers which already has some assumptions of what the providerID should look like.

wlan0 · 2017-08-19T02:42:08Z

Sorry, I was typing from my phone and mistyped.

cc @jhorwit2

@wlan0 on that note, do you know what the direction of auto detecting cloud provider is? It seems like we call updateCloudProviderFromMachineInfo in the case of auto detecting cloud providers which already has some assumptions of what the providerID should look like.

That's a great question. The external cloud controller will eventually (in two releases, more or less) be taken out of the kubernetes core repository, and each of the clouds (the companies) will host a cloud controller repository for their own cloud - we can call this Stage II of our two stage plan to remove cloud specific code from the kubernetes repository.

So, by the end of stage II, we'd end up with separate controllers for aws, gce, and so on.... Since the admin will be running a controller specific to their cloud, there won't be any automatic detection required.

In the interim Stage I, while the core still contains the cloud-controller-manager with all the cloud provider integrations vendored in, If auto detection is required, then this feature is up for discussion. What do you think is the best way to do this @andrewsykim?

@thockin @luxas What do you think about auto detecting cloud providers in Stage I?

andrewsykim · 2017-08-19T15:50:10Z

@wlan0 thanks for clarifying and providing context.

Personally, I don't think the auto-detect feature is going in the right direction and we should deprecate it (or just remove it cause it's an alpha feature anyways).

From a quick glance, it seems like it uses cAdvisor to gather machine info which is just moving the cloud provider data into some other monolithic source. It'll also make migrating CCM's into their own respective repos more difficult if people have been depending on --cloud-provider=auto-detect.

If admins have to make the effort to run cloud controller manager as a separate process, we should try to reduce the chances of having to change its configuration over time. The switch between the cloud-controller-manager image in the core repo and to the cloud provider specific repo should be seamless. Having to switch from kube-controller-manager --cloud-provider=aws -> cloud-controller-manager --cloud-provider=auto-detect -> aws-cloud-controller-manager --cloud-provider=aws seems unnecessary.

Let me know what you think, we should probably open a separate issue for this and follow up there.

jhorwit2 · 2017-08-19T15:53:53Z

@andrewsykim another problem i saw with cadvisor strategy is that now you are coupled to the kubelet & cadvisor release cycles in order to add support for a new cloud provider that wants to use that feature. Currently, only aws, gce and azure are supported in cadvisor.

wlan0 · 2017-08-19T22:11:05Z

@andrewsykim Thanks for that thorough follow up

Personally, I don't think the auto-detect feature is going in the right direction and we should deprecate it (or just remove it cause it's an alpha feature anyways).

I tend to agree. This is a good chance to cull it, and we should do it.

If admins have to make the effort to run cloud controller manager as a separate process, we should try to reduce the chances of having to change its configuration over time. The switch between the cloud-controller-manager image in the core repo and to the cloud provider specific repo should be seamless. Having to switch from kube-controller-manager --cloud-provider=aws -> cloud-controller-manager --cloud-provider=auto-detect -> aws-cloud-controller-manager --cloud-provider=aws seems unnecessary.

Just one nit - with aws-cloud-controller-manager, there won't be a --cloud-provider flag, but the overarching idea that you're discussing is something I completely agree with.

@andrewsykim another problem i saw with cadvisor strategy is that now you are coupled to the kubelet & cadvisor release cycles in order to add support for a new cloud provider that wants to use that feature. Currently, only aws, gce and azure are supported in cadvisor.

Another reason to deprecate auto-detect.

prydie · 2017-08-23T18:30:27Z

I think there is confusion over the auto-detection here. If kubelet has an in-tree cloudprovider it sets node.Spec.ProviderID using it:

		// TODO: We can't assume that the node has credentials to talk to the
		// cloudprovider from arbitrary nodes. At most, we should talk to a
		// local metadata server here.
		if node.Spec.ProviderID == "" {
			node.Spec.ProviderID, err = cloudprovider.GetInstanceProviderID(kl.cloud, kl.nodeName)
			if err != nil {
				return nil, err
			}
		}

Source: pkg/kubelet/kubelet_node_status.go

What #49836 seeks to address is the fact this does not happen for out-of-tree cloud providers.

deads2k · 2017-08-24T13:31:37Z

@rrati fyi

andrewsykim · 2017-08-25T02:57:08Z

@prydie you are right, I think we should stick to the same logic as the kubelet where possible, I changed the code here so we check that provider id is "" before setting it

wlan0 · 2017-08-25T12:37:28Z

LGTM. Requesting a second pair of eyes to take a look - @luxas

prydie · 2017-08-25T13:29:35Z

LGTM

luxas

/lgtm

We should document this behavior SUPER clearly though :)

cc @thockin for final approval

luxas · 2017-08-25T17:14:46Z

/release-note

k8s-github-robot · 2017-08-25T17:15:23Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andrewsykim, luxas

Associated issue: 49836

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

~~pkg/controller/cloud/OWNERS~~ [luxas]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

luxas · 2017-08-25T17:19:39Z

hehe, forgot that I'm an approver of this code ;)
I'll let this merge, it will take some time to get to the queue, hopefully @thockin has had time to look at it by then. In any case, four involved persons have already approved this, so...

k8s-github-robot · 2017-08-26T03:43:36Z

Automatic merge from submit-queue (batch tested with PRs 49850, 47782, 50595, 50730, 51341)

Automatic merge from submit-queue (batch tested with PRs 51819, 51706, 51761, 51818, 51500) Fix providerID update validation **What this PR does / why we need it**: Cloud controller manager supports updating providerID in #50730, but the node updating was blocked by validation rule. This is to propose a fix for updating the validation rule by allowing altering spec.providerID if not set. Please check #51596 for detail **Which issue this PR fixes** fixes #51596 **Special notes for your reviewer**: **Release note**: ```release-note ```

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 16, 2017

k8s-github-robot assigned deads2k Aug 16, 2017

k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. release-note-label-needed labels Aug 16, 2017

jhorwit2 reviewed Aug 16, 2017

View reviewed changes

k8s-ci-robot assigned wlan0 Aug 18, 2017

jhorwit2 mentioned this pull request Aug 20, 2017

[CCM] Cloud node controller will not remove nodes that no longer exists for cloud providers that require ProviderID #50985

Closed

andrewsykim mentioned this pull request Aug 20, 2017

Remove --cloud-provider=auto-detect #50986

Closed

3 tasks

luxas added area/cloudprovider sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. labels Aug 23, 2017

andrewsykim force-pushed the 49836 branch from 76c436b to 3d3a733 Compare August 25, 2017 02:56

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 25, 2017

Cloud Controller Manager now sets Node.Spec.ProviderID

2cc3c33

andrewsykim force-pushed the 49836 branch from 3d3a733 to 2cc3c33 Compare August 25, 2017 03:00

luxas approved these changes Aug 25, 2017

View reviewed changes

k8s-ci-robot assigned luxas Aug 25, 2017

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 25, 2017

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-label-needed labels Aug 25, 2017

luxas added this to the v1.8 milestone Aug 25, 2017

k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 25, 2017

k8s-github-robot merged commit 21aa8ca into kubernetes:master Aug 26, 2017

This was referenced Aug 30, 2017

Cloud Controller Manager updating ProviderID blocked by validation rule #51596

Closed

Fix providerID update validation #51761

Merged

jhorwit2 mentioned this pull request Sep 5, 2017

Automated cherry pick of #51761 #51954

Merged

karataliu mentioned this pull request Oct 18, 2017

Add cloud-controller-manager support for Kubernetes cluster Azure/acs-engine#1584

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cloud Controller Manager now sets Node.Spec.ProviderID #50730

Cloud Controller Manager now sets Node.Spec.ProviderID #50730

andrewsykim commented Aug 16, 2017

jhorwit2 Aug 16, 2017

jhorwit2 Aug 16, 2017

andrewsykim Aug 17, 2017

andrewsykim Aug 17, 2017

jhorwit2 Aug 17, 2017 •

edited

Loading

prydie Aug 17, 2017

luxas commented Aug 17, 2017

wlan0 commented Aug 18, 2017 •

edited

Loading

wlan0 commented Aug 18, 2017

andrewsykim commented Aug 18, 2017

jhorwit2 commented Aug 18, 2017

wlan0 commented Aug 19, 2017

andrewsykim commented Aug 19, 2017

wlan0 commented Aug 19, 2017

andrewsykim commented Aug 19, 2017 •

edited

Loading

jhorwit2 commented Aug 19, 2017 •

edited

Loading

wlan0 commented Aug 19, 2017

prydie commented Aug 23, 2017 •

edited

Loading

deads2k commented Aug 24, 2017

andrewsykim commented Aug 25, 2017

wlan0 commented Aug 25, 2017

prydie commented Aug 25, 2017

luxas left a comment

luxas commented Aug 25, 2017

k8s-github-robot commented Aug 25, 2017

luxas commented Aug 25, 2017

k8s-github-robot commented Aug 26, 2017

Cloud Controller Manager now sets Node.Spec.ProviderID #50730

Cloud Controller Manager now sets Node.Spec.ProviderID #50730

Conversation

andrewsykim commented Aug 16, 2017

jhorwit2 Aug 16, 2017

Choose a reason for hiding this comment

jhorwit2 Aug 16, 2017

Choose a reason for hiding this comment

andrewsykim Aug 17, 2017

Choose a reason for hiding this comment

andrewsykim Aug 17, 2017

Choose a reason for hiding this comment

jhorwit2 Aug 17, 2017 • edited Loading

Choose a reason for hiding this comment

prydie Aug 17, 2017

Choose a reason for hiding this comment

luxas commented Aug 17, 2017

wlan0 commented Aug 18, 2017 • edited Loading

wlan0 commented Aug 18, 2017

andrewsykim commented Aug 18, 2017

jhorwit2 commented Aug 18, 2017

wlan0 commented Aug 19, 2017

andrewsykim commented Aug 19, 2017

wlan0 commented Aug 19, 2017

andrewsykim commented Aug 19, 2017 • edited Loading

jhorwit2 commented Aug 19, 2017 • edited Loading

wlan0 commented Aug 19, 2017

prydie commented Aug 23, 2017 • edited Loading

deads2k commented Aug 24, 2017

andrewsykim commented Aug 25, 2017

wlan0 commented Aug 25, 2017

prydie commented Aug 25, 2017

luxas left a comment

Choose a reason for hiding this comment

luxas commented Aug 25, 2017

k8s-github-robot commented Aug 25, 2017

luxas commented Aug 25, 2017

k8s-github-robot commented Aug 26, 2017

jhorwit2 Aug 17, 2017 •

edited

Loading

wlan0 commented Aug 18, 2017 •

edited

Loading

andrewsykim commented Aug 19, 2017 •

edited

Loading

jhorwit2 commented Aug 19, 2017 •

edited

Loading

prydie commented Aug 23, 2017 •

edited

Loading