-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use NoSchedule taint in Node controller instead of filter node in scheduler #42406
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED The following people have approved this PR: resouer Needs approval from an approver in each of these OWNERS Files: We suggest the following people: |
@kubernetes/sig-scheduling-pr-reviews |
Please squash commits. I won't have time to review this until KubeCon (i.e. end of March), as I have planned vacation:) We can wait until then, or ask @davidopp to do the review once 1.7 opens. |
@gmarek Both work for me. I am now fixing tests, there're test cases need to be updated since we changed the basic assumption of scheduler, I will squash after fix them all. |
@kubernetes/sig-scheduling-pr-reviews |
// TODO(harry) move this to metav1 | ||
// TaintNodeNoSchedule would be automatically added by node controller when | ||
// node runs into a unschedulable condition and removed when node becomes schedulable. | ||
TaintNodeNoSchedule = "node.alpha.kubernetes.io/noSchedule" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think move it to well_known_labels.go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will do it after this, since it's a vendor udpate
// - NodeReady condition status is ConditionTrue, | ||
// - NodeOutOfDisk condition status is ConditionFalse, | ||
// - NodeNetworkUnavailable condition status is ConditionFalse. | ||
func nodeSchedulable(cond v1.NodeCondition) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot find anywhere we use it, seems duplicate with unSchedulable fun, or I missed something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, this should be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest I'm not sure about this change. We never had a discussion on how we want to move to Taint-based model, nor what's going to be responsible for adding those taints. E.g. it's not clear to me that Kubelet shouldn't be responsible for adding NoSchedule taints, and NC only for NoExecute ones. @davidopp @dchen1107 @kubernetes/sig-node-pr-reviews
return changedToSchedulable, changedToUnSchedulable | ||
} | ||
|
||
func unSchedulable(cond *v1.NodeCondition) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the negation here. Can you make it isSchedulable
?
@@ -1121,6 +1212,20 @@ func (nc *NodeController) markNodeForTainting(node *v1.Node) bool { | |||
return nc.zoneNotReadyOrUnreachableTainer[utilnode.GetZoneKey(node)].Add(node.Name, string(node.UID)) | |||
} | |||
|
|||
func (nc *NodeController) taintNodeNoSchedule(node *v1.Node) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no need for those helper functions.
} | ||
|
||
func unSchedulable(cond *v1.NodeCondition) bool { | ||
// We consider the node for scheduling only when its: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we could do
// isSchedulable is a flight check for node readiness, disk, and network.
as the header comment and delete the extra comments below since they're redundant to the implementation.
// Create a new schedulable node, since we're first going to apply | ||
// the unschedulable condition and verify that pods aren't scheduled. | ||
// Note that scheduler only filter out nodes with unschedulable=true spec, so no need to check node conditions anymore. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the anymore
part is confusing - not sure comments in the code should reference past versions of the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should delete it. It is just for reviewers actually.
In #43815 (comment) it was suggested to do it in kubelet, so I agree this hasn't really been decided yet. |
@davidopp I generally not against to kubelet since it's where status change happens. But no sure if there's any difference between these two approaches? |
@resouer @davidopp @gmarek @vishh With Kubelet setting taints, how would we deal with version skew? In the short term, older kubelets won't support taints. In the long term, older kubelets may not support the same sets of taints. What's the plan for configuring node labels, opaque resources, dedicated nodes, etc.? |
I'm not sure those are all the same category, but to answer: node labels: kubelet publishes some and cluster admin can add them manually. opaque resources: DaemonSet on node patches NodeStatus to include these resources dedicated nodes: today, cluster admin manually adds taints, and user is responsible for setting tolerations. In the "some day if we ever get around to it" future, some command-line tool would take a high-level description of the dedicated node groups and set the taints and configure an admission controller to add tolerations.
Regarding your first question, assuming we only need to support one version of skew: 1.7 kubelet can start publishing taints. This will work with 1.6 master or 1.7 master. (1.6 master already understands taints, e.g. API server supports it and scheduler uses it in predicate.) If you use 1.7 kubelet (with either 1.6 master or 1.7 mater) the taints will be redundant since scheduler still uses node condition to block scheduling. 1.8 scheduler can remove the logic that blocks scheduling based on node conditions. Since both 1.7 kubelet and 1.8 kubelet will be publishing taints, this is safe (1.8 master will work with 1.8 kubelet or 1.7 kubelet). Regarding your second question: Adding new taints that should block scheduling shouldn't be a problem. If users want to tolerate them, they can add the tolerations before upgrading kubelet. But I don't think there's any skew issue. |
Management of node labels is still an open problem. There exists no notion of ownership today for node labels and kubelet overwrites node labels. In general any new scheduling feature that depends on the nodes should (ideally) not be turned on at the master level until all the nodes can support that feature. As @davidopp mentioned, NodeConditions have to be respected by the scheduler until the least supported version of kubelet has switched to taints. If we support two older versions, then scheduler needs to support node conditions until v1.9, assuming >=v1.7 kubelets will switch to taints. Any changes to existing taints at the kubelet level should also respect the two older releases support policy. |
We need to support 2 older releases of Kubelets, so 1.9. The reason that I asked about the other features is because ideally taints would work similarly to other scheduling-related attributes, such as node conditions (used for scheduling today), node labels, and node resources. I also realize that may not help much, though, since multiple agents set those properties: kubelet, node controller, node problem detector, node feature discovery, cluster bootstrap tools, etc. If Kubelet does not set taints, then it will need to convey equivalent information through other means, such as conditions. As mentioned elsewhere, if we wanted to eliminate conditions, we'd have to at least add reason and message to taints, and would need to create a NoEffect taint, since we have a number of "non-critical" conditions reported in Borg that I'd like to be able to report. Note also that node conditions replaced phase #7856, taints wouldn't make as much sense as conditions for other resource types (e.g., Pods, ReplicaSets, Services), and we do need some consistent way of reasoning about orchestration-relevant "states" across resource types (#34363). Given the version-skew challenges, the likelihood of needing to customize taints in different cluster environments, and the unlikelihood of eliminating conditions, for backward compatibility if nothing else, I'm inclined to favor the flexibility of controlling taint policy outside of Kubelet. But it doesn't seem like an irreversible decision. |
I don't have a well thought though opinion on generic taints, but I do have one about taints that correspond to conditions. I think that they should be set by the same entity that sets the condition, for consistent user experience (no additional lag or new problems when something breaks - behavior would be exactly as it is today). |
I think @gmarek brings up a good point about setting condition and taint in the same place for consistency. OTOH I think @bgrant0607 's point that we avoid the version skew issue if we only set taints in the master, is a very good one. And setting all taints in the same place has an advantage from a code understandability perspective, even if that means it is sometimes a different place than where the corresponding NodeCondition was set. So I'm kinda leaning towards favoring doing it in the NodeController (like this PR). |
@@ -77,6 +77,16 @@ var ( | |||
Key: metav1.TaintNodeNotReady, | |||
Effect: v1.TaintEffectNoExecute, | |||
} | |||
|
|||
// TODO(resouer) will move this to metav1 well_known_labels.go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not do it now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metav1 is in separate repo now, will do it after this.
// TODO(resouer) will move this to metav1 well_known_labels.go. | ||
// TaintNodeNoSchedule would be automatically added by node controller when | ||
// node runs into a unschedulable condition and removed when node becomes schedulable. | ||
TaintNodeNoSchedule = "node.alpha.kubernetes.io/noSchedule" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it already beta?
@@ -601,6 +611,80 @@ func (nc *NodeController) Run() { | |||
}() | |||
} | |||
|
|||
// addOrRemoveNoScheduleTaintByCondition taint or remove NoSchedule taint on this node based on it's condition change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain what return values mean.
changedToSchedulable = true | ||
} | ||
} | ||
// if condition == nil && savedCondition != nil, we do nothing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove
} else { | ||
// Otherwise, there are no saved node status, it always worth to check conditions | ||
for _, condition := range observedMap { | ||
if condition != nil && !isSchedulable(condition) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrap this in a function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a break
in this part, maybe leave it as is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can return bool and break depending on the value. Please don't duplicate nontrivial code.
} | ||
|
||
if changedToUnSchedulable { | ||
if err := controller.AddOrUpdateTaintOnNode(nc.kubeClient, node.Name, NoScheduleTaintTemplate); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you're going to update NodeStatus
either way it's better to use v1.AddOrUpdateTaint
, as it won't issue a call to the API server. Same for removal.
This will require a change in tryUpdateNodeStatus
, so it'll always(-ish) check if there's a reason to update the NodeStatus, not only when Ready
condition is updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But tryUpdateNodeStatus
will not update the taint here (which is in node spec). Am I understanding right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I was thinking that maybe we should just starting to update whole Nodes, but you're probably right that we shouldn't... We should start batching Spec updates at some point
@@ -471,23 +471,6 @@ func (f *ConfigFactory) ResponsibleForPod(pod *v1.Pod) bool { | |||
|
|||
func getNodeConditionPredicate() corelisters.NodeConditionPredicate { | |||
return func(node *v1.Node) bool { | |||
for i := range node.Status.Conditions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is turning on the beta feature which substitutes very important GA one. We need to have a flag gate to allow users to disable this change, if it proves problematic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Where should we add the flag gate? Scheduler or Controller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scheduler. Controller may create those taints no matter what.
@jamiehannaford - I'm sorry, but I really need to drop everything that's not already in last rounds of reviews. |
To do it properly @k82cn is working on proposal on how to map Conditions to Taints. |
yes :). I'll try to draft a doc for it this week. |
Please ping me back when doc is ready |
Here's the design doc, please help to review :). https://docs.google.com/document/d/15Mg2GgxumUh9wQjc3LgalDeTUsVsBpfjQaBK4QB66-k/edit?usp=sharing |
11c404f
to
b938099
Compare
Use patch instead of update Use tainter instead for actions Add test to taint node noschedule Remove condition check in integration test Add safe gate to controller based taint Update gen files Chang condition format
@resouer PR needs rebase |
Dead PR, just close it. |
What this PR does / why we need it:
Use NoSchedule taint in Node controller instead of filter node in scheduler
Which issue this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close that issue when PR gets merged): fixes #42001cc @gmarek
Release note: