Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud Controller Manager now sets Node.Spec.ProviderID #50730

Merged
merged 1 commit into from
Aug 26, 2017

Conversation

andrewsykim
Copy link
Member

What this PR does / why we need it:
Cloud Controller Manager now sets Node.Spec.ProviderID.

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #
#49836.

Special notes for your reviewer:

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 16, 2017
@k8s-github-robot k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. release-note-label-needed labels Aug 16, 2017
// we should attempt to set providerID on curNode, but
// we can continue if we fail since we will attempt to set
// node addresses given the node name in getNodeAddressesByProviderIDOrName
glog.Errorf("%v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is changing logic from the kubelet. If the kubelet encounters an error getting the ProviderID it won't continue fetching subsequent information from the cloud provider.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the kubelet will just keep trying until it succeeds or the kubelet stops. If i'm not mistaken, that's not something the CCM does since it only listens to OnAdd.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also opted to not return here since some functions will try using the node name if the provider id fails. Example here. To me it makes sense to fail loudly if you fail to get a provider id, but it shouldn't prevent nodes from removing the cloud taint and being left un-scheduable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like how we use a combination of provider ID and names for functions. It does give us some flexibility like in this case but it's so inconsistent. I don't think we want to be refactoring that right now though 😛

Copy link
Contributor

@jhorwit2 jhorwit2 Aug 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your reasoning for not making it required since the node name fallback is supported in every current provider as far as I know.

My concern is why was ProviderID originally something required prior to making a node schedulable in the kubelet and if it's OK to not follow that pattern here.

edit: I can't seem to find any original proposals. Do you know where those might be?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also opted to not return here since some functions will try using the node name if the provider id fails. Example here. To me it makes sense to fail loudly if you fail to get a provider id, but it shouldn't prevent nodes from removing the cloud taint and being left un-scheduable.

Seems logical for now. A retry mechanism for AddCloudNode is probably not a bad idea but is out of scope for #49836 really.

@luxas
Copy link
Member

luxas commented Aug 17, 2017

PTAL @wlan0 @thockin

@wlan0
Copy link
Member

wlan0 commented Aug 18, 2017

@luxas @prydie @thockin @jhorwit2 @andrewsykim

The provider id should be set in the kubelet and not by the CCM. The CCM only uses it to fetch metadata about the node, and to uniquely identify the node. This is why we have this parameter in the kubelet - https://github.com/kubernetes/kubernetes/blob/master/cmd/kubelet/app/options/options.go#L216

If the nodename can be used to uniquely get information about the node (such as to fetch the providerID like here), then we wouldn't have relied on providerID in the first place. This pattern of setting providerID should not be propagated further since it does not work for all cloudproviders, and it's going to create inconsistencies.

@wlan0
Copy link
Member

wlan0 commented Aug 18, 2017

/assign wlan0

@andrewsykim
Copy link
Member Author

@wlan0 I had no idea we had the --provider-id flag in the kubelet.

My only issue with setting providerID in the kubelet is that it leaves room for error and inconsistencies. We rely on cluster admins to correctly set --provider-id (aws://i-abc123, digitalocean://1234567, etc). If we opted to set providerID in CCM then we can ensure that it will be in the same format across the board.

I guess this isn't a big deal since we would fall back to using node name, but seems like being able to set the providerID from CCM could be beneficial here since some cloud providers have to do some extra work to get a node only based on its name.

I personally am not favouring one over the other at the moment, will have to think through this a bit more.

@jhorwit2
Copy link
Contributor

I didn't know about that flag either :). The refactoring proposal seems to omit the part about how refactoring the provider id out of the kubelet will be done.

So now it's up to the cluster admin / cloud provider / whoever to know the format of the existing provider id and update any installation scripts accordingly. That seems like it's going to be error prone when certain cloud providers accept multiple formats for the provider id; like aws. At the very least I feel we would want to provide docs with the expected format for each provider.

What are your thoughts @wlan0 on allowing the CCM to optionally set the provider id based on the node name. If a known unimplemented error is returned then ignore setting provider id, but if the error isn't that then retry to get the provider id just like the kubelet did. GCE already has support to generate the provider ID by the node name as well as the metadata server. I didn't check any others to see if they did.

@wlan0
Copy link
Member

wlan0 commented Aug 19, 2017

The documentation definitely could be improved. I apologise for that. This decision unfortunately is hidden in some comment somewhere in one of the PRs made for this change.

To address @jhorwitz' comment -

It's not going to be error prone if the administrator need not learn any new format. The administrator needs to set the provider ID in the way the cloud understands it.

The kubelet has the responsibility to convert that provider ID into a kubernetes specific format. If this hasn't already been addressed, then it ought to be.

Secondly, there are some clouds whose providerID cannot be inferred from a remote location even within the cloud (e.g. openstack). This is why we wanted to have the ability to set a unique id while starting the kubelet.

It is going to lead to inconsistent cloud controllers, which will be hard to change and maintain if we decide to optionally allow providerId to be set from the CCM. We will not be able to reason about when this is set, and therefore, we won't be able to rely on a consistent behaviour to build systems around it, or test it easily.

@andrewsykim
Copy link
Member Author

I think you pinged @jhorwitz instead of @jhorwit2 😛

@wlan0 on that note, do you know what the direction of auto detecting cloud provider is? It seems like we call updateCloudProviderFromMachineInfo in the case of auto detecting cloud providers which already has some assumptions of what the providerID should look like.

@wlan0
Copy link
Member

wlan0 commented Aug 19, 2017

Sorry, I was typing from my phone and mistyped.

cc @jhorwit2

@wlan0 on that note, do you know what the direction of auto detecting cloud provider is? It seems like we call updateCloudProviderFromMachineInfo in the case of auto detecting cloud providers which already has some assumptions of what the providerID should look like.

That's a great question. The external cloud controller will eventually (in two releases, more or less) be taken out of the kubernetes core repository, and each of the clouds (the companies) will host a cloud controller repository for their own cloud - we can call this Stage II of our two stage plan to remove cloud specific code from the kubernetes repository.

So, by the end of stage II, we'd end up with separate controllers for aws, gce, and so on.... Since the admin will be running a controller specific to their cloud, there won't be any automatic detection required.

In the interim Stage I, while the core still contains the cloud-controller-manager with all the cloud provider integrations vendored in, If auto detection is required, then this feature is up for discussion. What do you think is the best way to do this @andrewsykim?

@thockin @luxas What do you think about auto detecting cloud providers in Stage I?

@andrewsykim
Copy link
Member Author

andrewsykim commented Aug 19, 2017

@wlan0 thanks for clarifying and providing context.

Personally, I don't think the auto-detect feature is going in the right direction and we should deprecate it (or just remove it cause it's an alpha feature anyways).

From a quick glance, it seems like it uses cAdvisor to gather machine info which is just moving the cloud provider data into some other monolithic source. It'll also make migrating CCM's into their own respective repos more difficult if people have been depending on --cloud-provider=auto-detect.

If admins have to make the effort to run cloud controller manager as a separate process, we should try to reduce the chances of having to change its configuration over time. The switch between the cloud-controller-manager image in the core repo and to the cloud provider specific repo should be seamless. Having to switch from kube-controller-manager --cloud-provider=aws -> cloud-controller-manager --cloud-provider=auto-detect -> aws-cloud-controller-manager --cloud-provider=aws seems unnecessary.

Let me know what you think, we should probably open a separate issue for this and follow up there.

@jhorwit2
Copy link
Contributor

jhorwit2 commented Aug 19, 2017

@andrewsykim another problem i saw with cadvisor strategy is that now you are coupled to the kubelet & cadvisor release cycles in order to add support for a new cloud provider that wants to use that feature. Currently, only aws, gce and azure are supported in cadvisor.

@wlan0
Copy link
Member

wlan0 commented Aug 19, 2017

@andrewsykim Thanks for that thorough follow up

Personally, I don't think the auto-detect feature is going in the right direction and we should deprecate it (or just remove it cause it's an alpha feature anyways).

I tend to agree. This is a good chance to cull it, and we should do it.

If admins have to make the effort to run cloud controller manager as a separate process, we should try to reduce the chances of having to change its configuration over time. The switch between the cloud-controller-manager image in the core repo and to the cloud provider specific repo should be seamless. Having to switch from kube-controller-manager --cloud-provider=aws -> cloud-controller-manager --cloud-provider=auto-detect -> aws-cloud-controller-manager --cloud-provider=aws seems unnecessary.

Just one nit - with aws-cloud-controller-manager, there won't be a --cloud-provider flag, but the overarching idea that you're discussing is something I completely agree with.

@andrewsykim another problem i saw with cadvisor strategy is that now you are coupled to the kubelet & cadvisor release cycles in order to add support for a new cloud provider that wants to use that feature. Currently, only aws, gce and azure are supported in cadvisor.

Another reason to deprecate auto-detect.

@prydie
Copy link
Contributor

prydie commented Aug 23, 2017

I think there is confusion over the auto-detection here. If kubelet has an in-tree cloudprovider it sets node.Spec.ProviderID using it:

		// TODO: We can't assume that the node has credentials to talk to the
		// cloudprovider from arbitrary nodes. At most, we should talk to a
		// local metadata server here.
		if node.Spec.ProviderID == "" {
			node.Spec.ProviderID, err = cloudprovider.GetInstanceProviderID(kl.cloud, kl.nodeName)
			if err != nil {
				return nil, err
			}
		}

Source: pkg/kubelet/kubelet_node_status.go

What #49836 seeks to address is the fact this does not happen for out-of-tree cloud providers.

@luxas luxas added area/cloudprovider sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. labels Aug 23, 2017
@deads2k
Copy link
Contributor

deads2k commented Aug 24, 2017

@rrati fyi

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 25, 2017
@andrewsykim
Copy link
Member Author

@prydie you are right, I think we should stick to the same logic as the kubelet where possible, I changed the code here so we check that provider id is "" before setting it

@wlan0
Copy link
Member

wlan0 commented Aug 25, 2017

LGTM. Requesting a second pair of eyes to take a look - @luxas

@prydie
Copy link
Contributor

prydie commented Aug 25, 2017

LGTM

Copy link
Member

@luxas luxas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

We should document this behavior SUPER clearly though :)

cc @thockin for final approval

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 25, 2017
@luxas
Copy link
Member

luxas commented Aug 25, 2017

/release-note

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-label-needed labels Aug 25, 2017
@luxas luxas added this to the v1.8 milestone Aug 25, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andrewsykim, luxas

Associated issue: 49836

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 25, 2017
@luxas
Copy link
Member

luxas commented Aug 25, 2017

hehe, forgot that I'm an approver of this code ;)
I'll let this merge, it will take some time to get to the queue, hopefully @thockin has had time to look at it by then. In any case, four involved persons have already approved this, so...

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 49850, 47782, 50595, 50730, 51341)

@k8s-github-robot k8s-github-robot merged commit 21aa8ca into kubernetes:master Aug 26, 2017
k8s-github-robot pushed a commit that referenced this pull request Sep 3, 2017
Automatic merge from submit-queue (batch tested with PRs 51819, 51706, 51761, 51818, 51500)

Fix providerID update validation

**What this PR does / why we need it**:
Cloud controller manager supports updating providerID in #50730, but the node updating was blocked by 
validation rule.

This is to propose a fix for updating the validation rule by allowing altering spec.providerID if not set.

Please check #51596 for detail

**Which issue this PR fixes**
fixes #51596

**Special notes for your reviewer**:

**Release note**:
```release-note
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cloudprovider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants