-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cloud Controller Manager now sets Node.Spec.ProviderID #50730
Conversation
// we should attempt to set providerID on curNode, but | ||
// we can continue if we fail since we will attempt to set | ||
// node addresses given the node name in getNodeAddressesByProviderIDOrName | ||
glog.Errorf("%v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is changing logic from the kubelet. If the kubelet encounters an error getting the ProviderID it won't continue fetching subsequent information from the cloud provider.
kubernetes/pkg/kubelet/kubelet_node_status.go
Line 295 in 02b520f
return nil, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the kubelet will just keep trying until it succeeds or the kubelet stops. If i'm not mistaken, that's not something the CCM does since it only listens to OnAdd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also opted to not return here since some functions will try using the node name if the provider id fails. Example here. To me it makes sense to fail loudly if you fail to get a provider id, but it shouldn't prevent nodes from removing the cloud taint and being left un-scheduable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like how we use a combination of provider ID and names for functions. It does give us some flexibility like in this case but it's so inconsistent. I don't think we want to be refactoring that right now though 😛
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with your reasoning for not making it required since the node name fallback is supported in every current provider as far as I know.
My concern is why was ProviderID originally something required prior to making a node schedulable in the kubelet and if it's OK to not follow that pattern here.
edit: I can't seem to find any original proposals. Do you know where those might be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also opted to not return here since some functions will try using the node name if the provider id fails. Example here. To me it makes sense to fail loudly if you fail to get a provider id, but it shouldn't prevent nodes from removing the cloud taint and being left un-scheduable.
Seems logical for now. A retry mechanism for AddCloudNode
is probably not a bad idea but is out of scope for #49836 really.
@luxas @prydie @thockin @jhorwit2 @andrewsykim The provider id should be set in the kubelet and not by the CCM. The CCM only uses it to fetch metadata about the node, and to uniquely identify the node. This is why we have this parameter in the kubelet - https://github.com/kubernetes/kubernetes/blob/master/cmd/kubelet/app/options/options.go#L216 If the nodename can be used to uniquely get information about the node (such as to fetch the providerID like here), then we wouldn't have relied on providerID in the first place. This pattern of setting providerID should not be propagated further since it does not work for all cloudproviders, and it's going to create inconsistencies. |
/assign wlan0 |
@wlan0 I had no idea we had the My only issue with setting providerID in the kubelet is that it leaves room for error and inconsistencies. We rely on cluster admins to correctly set I guess this isn't a big deal since we would fall back to using node name, but seems like being able to set the providerID from CCM could be beneficial here since some cloud providers have to do some extra work to get a node only based on its name. I personally am not favouring one over the other at the moment, will have to think through this a bit more. |
I didn't know about that flag either :). The refactoring proposal seems to omit the part about how refactoring the provider id out of the kubelet will be done. So now it's up to the cluster admin / cloud provider / whoever to know the format of the existing provider id and update any installation scripts accordingly. That seems like it's going to be error prone when certain cloud providers accept multiple formats for the provider id; like aws. At the very least I feel we would want to provide docs with the expected format for each provider. What are your thoughts @wlan0 on allowing the CCM to optionally set the provider id based on the node name. If a known unimplemented error is returned then ignore setting provider id, but if the error isn't that then retry to get the provider id just like the kubelet did. GCE already has support to generate the provider ID by the node name as well as the metadata server. I didn't check any others to see if they did. |
The documentation definitely could be improved. I apologise for that. This decision unfortunately is hidden in some comment somewhere in one of the PRs made for this change. To address @jhorwitz' comment - It's not going to be error prone if the administrator need not learn any new format. The administrator needs to set the provider ID in the way the cloud understands it. The kubelet has the responsibility to convert that provider ID into a kubernetes specific format. If this hasn't already been addressed, then it ought to be. Secondly, there are some clouds whose providerID cannot be inferred from a remote location even within the cloud (e.g. openstack). This is why we wanted to have the ability to set a unique id while starting the kubelet. It is going to lead to inconsistent cloud controllers, which will be hard to change and maintain if we decide to optionally allow providerId to be set from the CCM. We will not be able to reason about when this is set, and therefore, we won't be able to rely on a consistent behaviour to build systems around it, or test it easily. |
I think you pinged @jhorwitz instead of @jhorwit2 😛 @wlan0 on that note, do you know what the direction of auto detecting cloud provider is? It seems like we call updateCloudProviderFromMachineInfo in the case of auto detecting cloud providers which already has some assumptions of what the providerID should look like. |
Sorry, I was typing from my phone and mistyped. cc @jhorwit2
That's a great question. The external cloud controller will eventually (in two releases, more or less) be taken out of the kubernetes core repository, and each of the clouds (the companies) will host a cloud controller repository for their own cloud - we can call this Stage II of our two stage plan to remove cloud specific code from the kubernetes repository. So, by the end of stage II, we'd end up with separate controllers for aws, gce, and so on.... Since the admin will be running a controller specific to their cloud, there won't be any automatic detection required. In the interim Stage I, while the core still contains the cloud-controller-manager with all the cloud provider integrations vendored in, If auto detection is required, then this feature is up for discussion. What do you think is the best way to do this @andrewsykim? @thockin @luxas What do you think about auto detecting cloud providers in Stage I? |
@wlan0 thanks for clarifying and providing context. Personally, I don't think the auto-detect feature is going in the right direction and we should deprecate it (or just remove it cause it's an alpha feature anyways). From a quick glance, it seems like it uses cAdvisor to gather machine info which is just moving the cloud provider data into some other monolithic source. It'll also make migrating CCM's into their own respective repos more difficult if people have been depending on If admins have to make the effort to run cloud controller manager as a separate process, we should try to reduce the chances of having to change its configuration over time. The switch between the Let me know what you think, we should probably open a separate issue for this and follow up there. |
@andrewsykim another problem i saw with cadvisor strategy is that now you are coupled to the kubelet & cadvisor release cycles in order to add support for a new cloud provider that wants to use that feature. Currently, only aws, gce and azure are supported in cadvisor. |
@andrewsykim Thanks for that thorough follow up
I tend to agree. This is a good chance to cull it, and we should do it.
Just one nit - with aws-cloud-controller-manager, there won't be a
Another reason to deprecate auto-detect. |
I think there is confusion over the auto-detection here. If kubelet has an in-tree cloudprovider it sets // TODO: We can't assume that the node has credentials to talk to the
// cloudprovider from arbitrary nodes. At most, we should talk to a
// local metadata server here.
if node.Spec.ProviderID == "" {
node.Spec.ProviderID, err = cloudprovider.GetInstanceProviderID(kl.cloud, kl.nodeName)
if err != nil {
return nil, err
}
} Source: What #49836 seeks to address is the fact this does not happen for out-of-tree cloud providers. |
@rrati fyi |
@prydie you are right, I think we should stick to the same logic as the kubelet where possible, I changed the code here so we check that provider id is |
LGTM. Requesting a second pair of eyes to take a look - @luxas |
LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/release-note |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andrewsykim, luxas Associated issue: 49836 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
hehe, forgot that I'm an approver of this code ;) |
Automatic merge from submit-queue (batch tested with PRs 49850, 47782, 50595, 50730, 51341) |
Automatic merge from submit-queue (batch tested with PRs 51819, 51706, 51761, 51818, 51500) Fix providerID update validation **What this PR does / why we need it**: Cloud controller manager supports updating providerID in #50730, but the node updating was blocked by validation rule. This is to propose a fix for updating the validation rule by allowing altering spec.providerID if not set. Please check #51596 for detail **Which issue this PR fixes** fixes #51596 **Special notes for your reviewer**: **Release note**: ```release-note ```
What this PR does / why we need it:
Cloud Controller Manager now sets
Node.Spec.ProviderID
.Which issue this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close that issue when PR gets merged): fixes ##49836.
Special notes for your reviewer: