Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use NoSchedule taint in Node controller instead of filter node in scheduler #42406

Closed
wants to merge 1 commit into from

Conversation

resouer
Copy link
Contributor

@resouer resouer commented Mar 2, 2017

What this PR does / why we need it:
Use NoSchedule taint in Node controller instead of filter node in scheduler

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #42001

cc @gmarek

Release note:

NONE

@resouer resouer requested a review from gmarek March 2, 2017 09:01
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 2, 2017
@k8s-reviewable
Copy link

This change is Reviewable

@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

The following people have approved this PR: resouer

Needs approval from an approver in each of these OWNERS Files:

We suggest the following people:
cc @wojtek-t
You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-none Denotes a PR that doesn't merit a release note. labels Mar 2, 2017
@aveshagarwal
Copy link
Member

@kubernetes/sig-scheduling-pr-reviews

@gmarek
Copy link
Contributor

gmarek commented Mar 2, 2017

Please squash commits. I won't have time to review this until KubeCon (i.e. end of March), as I have planned vacation:) We can wait until then, or ask @davidopp to do the review once 1.7 opens.

@resouer
Copy link
Contributor Author

resouer commented Mar 3, 2017

@gmarek Both work for me.

I am now fixing tests, there're test cases need to be updated since we changed the basic assumption of scheduler, I will squash after fix them all.

@davidopp
Copy link
Member

davidopp commented Mar 3, 2017

@kubernetes/sig-scheduling-pr-reviews

// TODO(harry) move this to metav1
// TaintNodeNoSchedule would be automatically added by node controller when
// node runs into a unschedulable condition and removed when node becomes schedulable.
TaintNodeNoSchedule = "node.alpha.kubernetes.io/noSchedule"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think move it to well_known_labels.go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will do it after this, since it's a vendor udpate

// - NodeReady condition status is ConditionTrue,
// - NodeOutOfDisk condition status is ConditionFalse,
// - NodeNetworkUnavailable condition status is ConditionFalse.
func nodeSchedulable(cond v1.NodeCondition) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot find anywhere we use it, seems duplicate with unSchedulable fun, or I missed something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, this should be removed

Copy link
Contributor

@gmarek gmarek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest I'm not sure about this change. We never had a discussion on how we want to move to Taint-based model, nor what's going to be responsible for adding those taints. E.g. it's not clear to me that Kubelet shouldn't be responsible for adding NoSchedule taints, and NC only for NoExecute ones. @davidopp @dchen1107 @kubernetes/sig-node-pr-reviews

return changedToSchedulable, changedToUnSchedulable
}

func unSchedulable(cond *v1.NodeCondition) bool {
Copy link
Contributor

@gmarek gmarek Apr 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the negation here. Can you make it isSchedulable?

@@ -1121,6 +1212,20 @@ func (nc *NodeController) markNodeForTainting(node *v1.Node) bool {
return nc.zoneNotReadyOrUnreachableTainer[utilnode.GetZoneKey(node)].Add(node.Name, string(node.UID))
}

func (nc *NodeController) taintNodeNoSchedule(node *v1.Node) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need for those helper functions.

}

func unSchedulable(cond *v1.NodeCondition) bool {
// We consider the node for scheduling only when its:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could do

// isSchedulable is a flight check for node readiness, disk, and network.

as the header comment and delete the extra comments below since they're redundant to the implementation.

// Create a new schedulable node, since we're first going to apply
// the unschedulable condition and verify that pods aren't scheduled.
// Note that scheduler only filter out nodes with unschedulable=true spec, so no need to check node conditions anymore.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the anymore part is confusing - not sure comments in the code should reference past versions of the code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should delete it. It is just for reviewers actually.

@davidopp
Copy link
Member

davidopp commented Apr 5, 2017

To be honest I'm not sure about this change. We never had a discussion on how we want to move to Taint-based model, nor what's going to be responsible for adding those taints. E.g. it's not clear to me that Kubelet shouldn't be responsible for adding NoSchedule taints, and NC only for NoExecute ones.

In #43815 (comment) it was suggested to do it in kubelet, so I agree this hasn't really been decided yet.

@resouer
Copy link
Contributor Author

resouer commented Apr 5, 2017

@davidopp I generally not against to kubelet since it's where status change happens. But no sure if there's any difference between these two approaches?

@bgrant0607
Copy link
Member

bgrant0607 commented Apr 5, 2017

@resouer @davidopp @gmarek @vishh

With Kubelet setting taints, how would we deal with version skew?

In the short term, older kubelets won't support taints.

In the long term, older kubelets may not support the same sets of taints.

What's the plan for configuring node labels, opaque resources, dedicated nodes, etc.?

@davidopp
Copy link
Member

davidopp commented Apr 5, 2017

What's the plan for configuring node labels, opaque resources, dedicated nodes, etc.?

I'm not sure those are all the same category, but to answer:

node labels: kubelet publishes some and cluster admin can add them manually.

opaque resources: DaemonSet on node patches NodeStatus to include these resources

dedicated nodes: today, cluster admin manually adds taints, and user is responsible for setting tolerations. In the "some day if we ever get around to it" future, some command-line tool would take a high-level description of the dedicated node groups and set the taints and configure an admission controller to add tolerations.

With Kubelet setting taints, how would we deal with version skew?

In the short term, older kubelets won't support taints.

In the long term, older kubelets may not support the same sets of taints.

Regarding your first question, assuming we only need to support one version of skew:

1.7 kubelet can start publishing taints. This will work with 1.6 master or 1.7 master. (1.6 master already understands taints, e.g. API server supports it and scheduler uses it in predicate.) If you use 1.7 kubelet (with either 1.6 master or 1.7 mater) the taints will be redundant since scheduler still uses node condition to block scheduling.

1.8 scheduler can remove the logic that blocks scheduling based on node conditions. Since both 1.7 kubelet and 1.8 kubelet will be publishing taints, this is safe (1.8 master will work with 1.8 kubelet or 1.7 kubelet).

Regarding your second question:

Adding new taints that should block scheduling shouldn't be a problem. If users want to tolerate them, they can add the tolerations before upgrading kubelet. But I don't think there's any skew issue.

@vishh
Copy link
Contributor

vishh commented Apr 6, 2017

Management of node labels is still an open problem. There exists no notion of ownership today for node labels and kubelet overwrites node labels.

In general any new scheduling feature that depends on the nodes should (ideally) not be turned on at the master level until all the nodes can support that feature.

As @davidopp mentioned, NodeConditions have to be respected by the scheduler until the least supported version of kubelet has switched to taints. If we support two older versions, then scheduler needs to support node conditions until v1.9, assuming >=v1.7 kubelets will switch to taints.

Any changes to existing taints at the kubelet level should also respect the two older releases support policy.

@bgrant0607
Copy link
Member

We need to support 2 older releases of Kubelets, so 1.9.

https://github.com/kubernetes/community/blob/master/contributors/design-proposals/versioning.md#supported-releases-and-component-skew

The reason that I asked about the other features is because ideally taints would work similarly to other scheduling-related attributes, such as node conditions (used for scheduling today), node labels, and node resources. I also realize that may not help much, though, since multiple agents set those properties: kubelet, node controller, node problem detector, node feature discovery, cluster bootstrap tools, etc.

If Kubelet does not set taints, then it will need to convey equivalent information through other means, such as conditions. As mentioned elsewhere, if we wanted to eliminate conditions, we'd have to at least add reason and message to taints, and would need to create a NoEffect taint, since we have a number of "non-critical" conditions reported in Borg that I'd like to be able to report. Note also that node conditions replaced phase #7856, taints wouldn't make as much sense as conditions for other resource types (e.g., Pods, ReplicaSets, Services), and we do need some consistent way of reasoning about orchestration-relevant "states" across resource types (#34363).

Given the version-skew challenges, the likelihood of needing to customize taints in different cluster environments, and the unlikelihood of eliminating conditions, for backward compatibility if nothing else, I'm inclined to favor the flexibility of controlling taint policy outside of Kubelet. But it doesn't seem like an irreversible decision.

@gmarek
Copy link
Contributor

gmarek commented Apr 6, 2017

I don't have a well thought though opinion on generic taints, but I do have one about taints that correspond to conditions. I think that they should be set by the same entity that sets the condition, for consistent user experience (no additional lag or new problems when something breaks - behavior would be exactly as it is today).

@davidopp
Copy link
Member

I think @gmarek brings up a good point about setting condition and taint in the same place for consistency.

OTOH I think @bgrant0607 's point that we avoid the version skew issue if we only set taints in the master, is a very good one. And setting all taints in the same place has an advantage from a code understandability perspective, even if that means it is sometimes a different place than where the corresponding NodeCondition was set.

So I'm kinda leaning towards favoring doing it in the NodeController (like this PR).

@gmarek gmarek self-assigned this Apr 14, 2017
@@ -77,6 +77,16 @@ var (
Key: metav1.TaintNodeNotReady,
Effect: v1.TaintEffectNoExecute,
}

// TODO(resouer) will move this to metav1 well_known_labels.go.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not do it now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metav1 is in separate repo now, will do it after this.

// TODO(resouer) will move this to metav1 well_known_labels.go.
// TaintNodeNoSchedule would be automatically added by node controller when
// node runs into a unschedulable condition and removed when node becomes schedulable.
TaintNodeNoSchedule = "node.alpha.kubernetes.io/noSchedule"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it already beta?

@@ -601,6 +611,80 @@ func (nc *NodeController) Run() {
}()
}

// addOrRemoveNoScheduleTaintByCondition taint or remove NoSchedule taint on this node based on it's condition change.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain what return values mean.

changedToSchedulable = true
}
}
// if condition == nil && savedCondition != nil, we do nothing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove

} else {
// Otherwise, there are no saved node status, it always worth to check conditions
for _, condition := range observedMap {
if condition != nil && !isSchedulable(condition) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrap this in a function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a break in this part, maybe leave it as is?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can return bool and break depending on the value. Please don't duplicate nontrivial code.

}

if changedToUnSchedulable {
if err := controller.AddOrUpdateTaintOnNode(nc.kubeClient, node.Name, NoScheduleTaintTemplate); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you're going to update NodeStatus either way it's better to use v1.AddOrUpdateTaint, as it won't issue a call to the API server. Same for removal.

This will require a change in tryUpdateNodeStatus, so it'll always(-ish) check if there's a reason to update the NodeStatus, not only when Ready condition is updated.

Copy link
Contributor Author

@resouer resouer Apr 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But tryUpdateNodeStatus will not update the taint here (which is in node spec). Am I understanding right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I was thinking that maybe we should just starting to update whole Nodes, but you're probably right that we shouldn't... We should start batching Spec updates at some point

@@ -471,23 +471,6 @@ func (f *ConfigFactory) ResponsibleForPod(pod *v1.Pod) bool {

func getNodeConditionPredicate() corelisters.NodeConditionPredicate {
return func(node *v1.Node) bool {
for i := range node.Status.Conditions {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is turning on the beta feature which substitutes very important GA one. We need to have a flag gate to allow users to disable this change, if it proves problematic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Where should we add the flag gate? Scheduler or Controller?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scheduler. Controller may create those taints no matter what.

@jamiehannaford
Copy link
Contributor

@gmarek That's fair enough. If somebody could review/merge this PR for 1.7, there's a chance we could also fix #45717 which would be great, since it'd help stabilise self-hosted. But I totally get if folks are overloaded.

@gmarek
Copy link
Contributor

gmarek commented May 22, 2017

@jamiehannaford - I'm sorry, but I really need to drop everything that's not already in last rounds of reviews.

@resouer resouer force-pushed the node-controller branch from dcc8e0e to 37c0076 Compare May 23, 2017 01:17
@gmarek gmarek modified the milestones: next-candidate, v1.7 May 30, 2017
@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 30, 2017
@resouer resouer force-pushed the node-controller branch from 37c0076 to f16fc29 Compare June 4, 2017 08:19
@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 4, 2017
@resouer resouer force-pushed the node-controller branch from f16fc29 to 7f7201c Compare June 4, 2017 09:06
@k8s-github-robot k8s-github-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 4, 2017
@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 22, 2017
@luxas
Copy link
Member

luxas commented Jul 7, 2017

Is there gonna be a push on this PR during v1.8 @davidopp @gmarek @resouer @k82cn ?

@gmarek
Copy link
Contributor

gmarek commented Jul 7, 2017

To do it properly @k82cn is working on proposal on how to map Conditions to Taints.

@k82cn
Copy link
Member

k82cn commented Jul 9, 2017

yes :). I'll try to draft a doc for it this week.

@jamiehannaford
Copy link
Contributor

@k82cn @gmarek Great, feel free to link here once we have a proposal. I can also help out with implementing this

@resouer
Copy link
Contributor Author

resouer commented Jul 12, 2017

Please ping me back when doc is ready

@k82cn
Copy link
Member

k82cn commented Jul 17, 2017

Here's the design doc, please help to review :).

https://docs.google.com/document/d/15Mg2GgxumUh9wQjc3LgalDeTUsVsBpfjQaBK4QB66-k/edit?usp=sharing

@resouer resouer force-pushed the node-controller branch 2 times, most recently from 11c404f to b938099 Compare July 18, 2017 03:59
@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 18, 2017
Use patch instead of update

Use tainter instead for actions

Add test to taint node noschedule

Remove condition check in integration test

Add safe gate to controller based taint

Update gen files

Chang condition format
@k8s-github-robot
Copy link

@resouer PR needs rebase

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 25, 2017
@resouer
Copy link
Contributor Author

resouer commented Jul 31, 2017

Dead PR, just close it.

@resouer resouer closed this Jul 31, 2017
@resouer resouer deleted the node-controller branch September 13, 2017 08:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note-none Denotes a PR that doesn't merit a release note. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NodeController should add NoSchedule taints and we should get rid of getNodeConditionPredicate()