Use NoSchedule taint in Node controller instead of filter node in scheduler #42406

resouer · 2017-03-02T09:01:29Z

What this PR does / why we need it:
Use NoSchedule taint in Node controller instead of filter node in scheduler

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #42001

cc @gmarek

Release note:

NONE

k8s-reviewable · 2017-03-02T09:02:38Z

This change is

k8s-github-robot · 2017-03-02T09:04:32Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

The following people have approved this PR: resouer

Needs approval from an approver in each of these OWNERS Files:

pkg/controller/node/OWNERS
plugin/pkg/scheduler/OWNERS

We suggest the following people:
cc @wojtek-t
You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

aveshagarwal · 2017-03-02T14:57:06Z

@kubernetes/sig-scheduling-pr-reviews

gmarek · 2017-03-02T18:22:50Z

Please squash commits. I won't have time to review this until KubeCon (i.e. end of March), as I have planned vacation:) We can wait until then, or ask @davidopp to do the review once 1.7 opens.

resouer · 2017-03-03T02:12:04Z

@gmarek Both work for me.

I am now fixing tests, there're test cases need to be updated since we changed the basic assumption of scheduler, I will squash after fix them all.

davidopp · 2017-03-03T02:14:53Z

@kubernetes/sig-scheduling-pr-reviews

aveshagarwal · 2017-03-03T15:19:03Z

pkg/controller/node/nodecontroller.go

+	// TODO(harry) move this to metav1
+	// TaintNodeNoSchedule would be automatically added by node controller when
+	// node runs into a unschedulable condition and removed when node becomes schedulable.
+	TaintNodeNoSchedule = "node.alpha.kubernetes.io/noSchedule"


I think move it to well_known_labels.go.

Yes, I will do it after this, since it's a vendor udpate

wanghaoran1988 · 2017-03-16T14:06:43Z

pkg/controller/node/nodecontroller.go

+// - NodeReady condition status is ConditionTrue,
+// - NodeOutOfDisk condition status is ConditionFalse,
+// - NodeNetworkUnavailable condition status is ConditionFalse.
+func nodeSchedulable(cond v1.NodeCondition) bool {


I cannot find anywhere we use it, seems duplicate with unSchedulable fun, or I missed something?

Oh yes, this should be removed

gmarek

To be honest I'm not sure about this change. We never had a discussion on how we want to move to Taint-based model, nor what's going to be responsible for adding those taints. E.g. it's not clear to me that Kubelet shouldn't be responsible for adding NoSchedule taints, and NC only for NoExecute ones. @davidopp @dchen1107 @kubernetes/sig-node-pr-reviews

gmarek · 2017-04-03T09:41:25Z

pkg/controller/node/nodecontroller.go

+	return changedToSchedulable, changedToUnSchedulable
+}
+
+func unSchedulable(cond *v1.NodeCondition) bool {


I don't like the negation here. Can you make it isSchedulable?

gmarek · 2017-04-03T09:45:06Z

pkg/controller/node/nodecontroller.go

@@ -1121,6 +1212,20 @@ func (nc *NodeController) markNodeForTainting(node *v1.Node) bool {
 	return nc.zoneNotReadyOrUnreachableTainer[utilnode.GetZoneKey(node)].Add(node.Name, string(node.UID))
 }

+func (nc *NodeController) taintNodeNoSchedule(node *v1.Node) error {


There's no need for those helper functions.

jayunit100 · 2017-04-03T12:07:35Z

pkg/controller/node/nodecontroller.go

+}
+
+func unSchedulable(cond *v1.NodeCondition) bool {
+	// We consider the node for scheduling only when its:


I guess we could do

// isSchedulable is a flight check for node readiness, disk, and network.

as the header comment and delete the extra comments below since they're redundant to the implementation.

jayunit100 · 2017-04-03T12:38:58Z

test/integration/scheduler/scheduler_test.go

 	// Create a new schedulable node, since we're first going to apply
 	// the unschedulable condition and verify that pods aren't scheduled.
+	// Note that scheduler only filter out nodes with unschedulable=true spec, so no need to check node conditions anymore.


the anymore part is confusing - not sure comments in the code should reference past versions of the code?

I think we should delete it. It is just for reviewers actually.

davidopp · 2017-04-05T05:39:33Z

To be honest I'm not sure about this change. We never had a discussion on how we want to move to Taint-based model, nor what's going to be responsible for adding those taints. E.g. it's not clear to me that Kubelet shouldn't be responsible for adding NoSchedule taints, and NC only for NoExecute ones.

In #43815 (comment) it was suggested to do it in kubelet, so I agree this hasn't really been decided yet.

resouer · 2017-04-05T07:25:29Z

@davidopp I generally not against to kubelet since it's where status change happens. But no sure if there's any difference between these two approaches?

bgrant0607 · 2017-04-05T21:26:42Z

@resouer @davidopp @gmarek @vishh

With Kubelet setting taints, how would we deal with version skew?

In the short term, older kubelets won't support taints.

In the long term, older kubelets may not support the same sets of taints.

What's the plan for configuring node labels, opaque resources, dedicated nodes, etc.?

davidopp · 2017-04-05T22:45:52Z

What's the plan for configuring node labels, opaque resources, dedicated nodes, etc.?

I'm not sure those are all the same category, but to answer:

node labels: kubelet publishes some and cluster admin can add them manually.

opaque resources: DaemonSet on node patches NodeStatus to include these resources

dedicated nodes: today, cluster admin manually adds taints, and user is responsible for setting tolerations. In the "some day if we ever get around to it" future, some command-line tool would take a high-level description of the dedicated node groups and set the taints and configure an admission controller to add tolerations.

With Kubelet setting taints, how would we deal with version skew?

In the short term, older kubelets won't support taints.

In the long term, older kubelets may not support the same sets of taints.

Regarding your first question, assuming we only need to support one version of skew:

1.7 kubelet can start publishing taints. This will work with 1.6 master or 1.7 master. (1.6 master already understands taints, e.g. API server supports it and scheduler uses it in predicate.) If you use 1.7 kubelet (with either 1.6 master or 1.7 mater) the taints will be redundant since scheduler still uses node condition to block scheduling.

1.8 scheduler can remove the logic that blocks scheduling based on node conditions. Since both 1.7 kubelet and 1.8 kubelet will be publishing taints, this is safe (1.8 master will work with 1.8 kubelet or 1.7 kubelet).

Regarding your second question:

Adding new taints that should block scheduling shouldn't be a problem. If users want to tolerate them, they can add the tolerations before upgrading kubelet. But I don't think there's any skew issue.

vishh · 2017-04-06T00:03:28Z

Management of node labels is still an open problem. There exists no notion of ownership today for node labels and kubelet overwrites node labels.

In general any new scheduling feature that depends on the nodes should (ideally) not be turned on at the master level until all the nodes can support that feature.

As @davidopp mentioned, NodeConditions have to be respected by the scheduler until the least supported version of kubelet has switched to taints. If we support two older versions, then scheduler needs to support node conditions until v1.9, assuming >=v1.7 kubelets will switch to taints.

Any changes to existing taints at the kubelet level should also respect the two older releases support policy.

bgrant0607 · 2017-04-06T05:13:46Z

We need to support 2 older releases of Kubelets, so 1.9.

https://github.com/kubernetes/community/blob/master/contributors/design-proposals/versioning.md#supported-releases-and-component-skew

The reason that I asked about the other features is because ideally taints would work similarly to other scheduling-related attributes, such as node conditions (used for scheduling today), node labels, and node resources. I also realize that may not help much, though, since multiple agents set those properties: kubelet, node controller, node problem detector, node feature discovery, cluster bootstrap tools, etc.

If Kubelet does not set taints, then it will need to convey equivalent information through other means, such as conditions. As mentioned elsewhere, if we wanted to eliminate conditions, we'd have to at least add reason and message to taints, and would need to create a NoEffect taint, since we have a number of "non-critical" conditions reported in Borg that I'd like to be able to report. Note also that node conditions replaced phase #7856, taints wouldn't make as much sense as conditions for other resource types (e.g., Pods, ReplicaSets, Services), and we do need some consistent way of reasoning about orchestration-relevant "states" across resource types (#34363).

Given the version-skew challenges, the likelihood of needing to customize taints in different cluster environments, and the unlikelihood of eliminating conditions, for backward compatibility if nothing else, I'm inclined to favor the flexibility of controlling taint policy outside of Kubelet. But it doesn't seem like an irreversible decision.

gmarek · 2017-04-06T07:27:10Z

I don't have a well thought though opinion on generic taints, but I do have one about taints that correspond to conditions. I think that they should be set by the same entity that sets the condition, for consistent user experience (no additional lag or new problems when something breaks - behavior would be exactly as it is today).

davidopp · 2017-04-13T22:24:39Z

I think @gmarek brings up a good point about setting condition and taint in the same place for consistency.

OTOH I think @bgrant0607 's point that we avoid the version skew issue if we only set taints in the master, is a very good one. And setting all taints in the same place has an advantage from a code understandability perspective, even if that means it is sometimes a different place than where the corresponding NodeCondition was set.

So I'm kinda leaning towards favoring doing it in the NodeController (like this PR).

gmarek · 2017-04-14T09:48:23Z

pkg/controller/node/nodecontroller.go

@@ -77,6 +77,16 @@ var (
 		Key:    metav1.TaintNodeNotReady,
 		Effect: v1.TaintEffectNoExecute,
 	}
+
+	// TODO(resouer) will move this to metav1 well_known_labels.go.


Why not do it now?

metav1 is in separate repo now, will do it after this.

gmarek · 2017-04-14T09:48:33Z

pkg/controller/node/nodecontroller.go

+	// TODO(resouer) will move this to metav1 well_known_labels.go.
+	// TaintNodeNoSchedule would be automatically added by node controller when
+	// node runs into a unschedulable condition and removed when node becomes schedulable.
+	TaintNodeNoSchedule = "node.alpha.kubernetes.io/noSchedule"


Isn't it already beta?

gmarek · 2017-04-14T09:49:15Z

pkg/controller/node/nodecontroller.go

@@ -601,6 +611,80 @@ func (nc *NodeController) Run() {
 	}()
 }

+// addOrRemoveNoScheduleTaintByCondition taint or remove NoSchedule taint on this node based on it's condition change.


Explain what return values mean.

gmarek · 2017-04-14T09:54:46Z

pkg/controller/node/nodecontroller.go

+					changedToSchedulable = true
+				}
+			}
+			// if condition == nil && savedCondition != nil, we do nothing


gmarek · 2017-04-14T09:55:28Z

pkg/controller/node/nodecontroller.go

+	} else {
+		// Otherwise, there are no saved node status, it always worth to check conditions
+		for _, condition := range observedMap {
+			if condition != nil && !isSchedulable(condition) {


wrap this in a function

There's a break in this part, maybe leave it as is?

You can return bool and break depending on the value. Please don't duplicate nontrivial code.

gmarek · 2017-04-14T09:58:08Z

pkg/controller/node/nodecontroller.go

+	}
+
+	if changedToUnSchedulable {
+		if err := controller.AddOrUpdateTaintOnNode(nc.kubeClient, node.Name, NoScheduleTaintTemplate); err != nil {


As you're going to update NodeStatus either way it's better to use v1.AddOrUpdateTaint, as it won't issue a call to the API server. Same for removal.

This will require a change in tryUpdateNodeStatus, so it'll always(-ish) check if there's a reason to update the NodeStatus, not only when Ready condition is updated.

But tryUpdateNodeStatus will not update the taint here (which is in node spec). Am I understanding right?

Right, I was thinking that maybe we should just starting to update whole Nodes, but you're probably right that we shouldn't... We should start batching Spec updates at some point

gmarek · 2017-04-14T10:02:36Z

plugin/pkg/scheduler/factory/factory.go

@@ -471,23 +471,6 @@ func (f *ConfigFactory) ResponsibleForPod(pod *v1.Pod) bool {

 func getNodeConditionPredicate() corelisters.NodeConditionPredicate {
 	return func(node *v1.Node) bool {
-		for i := range node.Status.Conditions {


This is turning on the beta feature which substitutes very important GA one. We need to have a flag gate to allow users to disable this change, if it proves problematic.

Sure. Where should we add the flag gate? Scheduler or Controller?

Scheduler. Controller may create those taints no matter what.

jamiehannaford · 2017-05-22T15:28:24Z

@gmarek That's fair enough. If somebody could review/merge this PR for 1.7, there's a chance we could also fix #45717 which would be great, since it'd help stabilise self-hosted. But I totally get if folks are overloaded.

gmarek · 2017-05-22T15:33:43Z

@jamiehannaford - I'm sorry, but I really need to drop everything that's not already in last rounds of reviews.

luxas · 2017-07-07T08:18:21Z

Is there gonna be a push on this PR during v1.8 @davidopp @gmarek @resouer @k82cn ?

gmarek · 2017-07-07T12:08:29Z

To do it properly @k82cn is working on proposal on how to map Conditions to Taints.

k82cn · 2017-07-09T01:37:05Z

yes :). I'll try to draft a doc for it this week.

jamiehannaford · 2017-07-10T14:41:49Z

@k82cn @gmarek Great, feel free to link here once we have a proposal. I can also help out with implementing this

resouer · 2017-07-12T16:35:39Z

Please ping me back when doc is ready

k82cn · 2017-07-17T11:49:23Z

Here's the design doc, please help to review :).

https://docs.google.com/document/d/15Mg2GgxumUh9wQjc3LgalDeTUsVsBpfjQaBK4QB66-k/edit?usp=sharing

Use patch instead of update Use tainter instead for actions Add test to taint node noschedule Remove condition check in integration test Add safe gate to controller based taint Update gen files Chang condition format

k8s-github-robot · 2017-07-25T21:05:16Z

@resouer PR needs rebase

resouer · 2017-07-31T05:41:10Z

Dead PR, just close it.

resouer requested a review from gmarek March 2, 2017 09:01

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 2, 2017

k8s-github-robot assigned aveshagarwal Mar 2, 2017

k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-none Denotes a PR that doesn't merit a release note. labels Mar 2, 2017

resouer force-pushed the node-controller branch from 0b0e7b1 to de06115 Compare March 2, 2017 14:17

aveshagarwal reviewed Mar 3, 2017

View reviewed changes

resouer force-pushed the node-controller branch from 5143ca9 to 97a2f96 Compare March 6, 2017 07:16

wanghaoran1988 suggested changes Mar 16, 2017

View reviewed changes

resouer force-pushed the node-controller branch from 97a2f96 to 6ca4d64 Compare April 3, 2017 05:15

gmarek reviewed Apr 3, 2017

View reviewed changes

jayunit100 reviewed Apr 3, 2017

View reviewed changes

resouer force-pushed the node-controller branch from 6ca4d64 to 5594e03 Compare April 5, 2017 13:04

gmarek self-assigned this Apr 14, 2017

gmarek suggested changes Apr 14, 2017

View reviewed changes

resouer force-pushed the node-controller branch from dcc8e0e to 37c0076 Compare May 23, 2017 01:17

gmarek modified the milestones: next-candidate, v1.7 May 30, 2017

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 30, 2017

resouer force-pushed the node-controller branch from 37c0076 to f16fc29 Compare June 4, 2017 08:19

k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 4, 2017

resouer force-pushed the node-controller branch from f16fc29 to 7f7201c Compare June 4, 2017 09:06

k8s-github-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 4, 2017

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 22, 2017

jamiehannaford mentioned this pull request Jul 10, 2017

Add nodeName field to PodPolicy coreos/etcd-operator#1269

Closed

resouer force-pushed the node-controller branch 2 times, most recently from 11c404f to b938099 Compare July 18, 2017 03:59

k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 18, 2017

resouer force-pushed the node-controller branch from b938099 to b1a99a3 Compare July 20, 2017 02:33

Taint node NoSchedule if bad condition

e736660

Use patch instead of update Use tainter instead for actions Add test to taint node noschedule Remove condition check in integration test Add safe gate to controller based taint Update gen files Chang condition format

resouer force-pushed the node-controller branch from b1a99a3 to e736660 Compare July 20, 2017 02:53

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 25, 2017

resouer closed this Jul 31, 2017

resouer deleted the node-controller branch September 13, 2017 08:13

Use NoSchedule taint in Node controller instead of filter node in scheduler #42406

Use NoSchedule taint in Node controller instead of filter node in scheduler #42406

Conversation

resouer commented Mar 2, 2017 • edited Loading

k8s-reviewable commented Mar 2, 2017

k8s-github-robot commented Mar 2, 2017

aveshagarwal commented Mar 2, 2017

gmarek commented Mar 2, 2017

resouer commented Mar 3, 2017 • edited Loading

davidopp commented Mar 3, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gmarek left a comment

Choose a reason for hiding this comment

gmarek Apr 3, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidopp commented Apr 5, 2017

resouer commented Apr 5, 2017 • edited Loading

bgrant0607 commented Apr 5, 2017 • edited Loading

davidopp commented Apr 5, 2017 • edited Loading

vishh commented Apr 6, 2017

bgrant0607 commented Apr 6, 2017

gmarek commented Apr 6, 2017

davidopp commented Apr 13, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

resouer Apr 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamiehannaford commented May 22, 2017

gmarek commented May 22, 2017

luxas commented Jul 7, 2017

gmarek commented Jul 7, 2017

k82cn commented Jul 9, 2017

jamiehannaford commented Jul 10, 2017

resouer commented Jul 12, 2017

k82cn commented Jul 17, 2017

k8s-github-robot commented Jul 25, 2017

resouer commented Jul 31, 2017

resouer commented Mar 2, 2017 •

edited

Loading

resouer commented Mar 3, 2017 •

edited

Loading

gmarek Apr 3, 2017 •

edited

Loading

resouer commented Apr 5, 2017 •

edited

Loading

bgrant0607 commented Apr 5, 2017 •

edited

Loading

davidopp commented Apr 5, 2017 •

edited

Loading

resouer Apr 27, 2017 •

edited

Loading