-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow admin user to explicitly unschedule the node #4585
Conversation
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project, in which case you'll need to sign a Contributor License Agreement (CLA) at https://cla.developers.google.com/. If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check the information on your CLA or see this help article on setting the email on your git commits. Once you've done that, please reply here to let us know. If you signed the CLA as a corporation, please let us know the company's name. |
Related discussion on #3885 |
Covered under Red Hat corporate CLA |
We shouldn't have the custom logic in apiserver. Also, as @bgrant0607 mentioned, status should be reconstructable and be observed by k8s components, we can't have user update status I think. An alternative proposal is to update node spec, have node controller check the spec and update status accordingly. |
93006dc
to
c1435b2
Compare
c1435b2
to
edbc031
Compare
Discussed w/ @ddysher today. There should be 2 fields in NodeSpec: Schedulable and Runnable. These reflect disabling of scheduling new pods and execution of any user pods, respectively, by administrative action, independent of unplanned failures. There should also be 2 new NodeConditions, Schedulable and Runnable. The spec fields and other conditions (e.g., Ready) would be combined to provide the status values. The scheduler would then only need to look at the Schedulable condition, in the normal case. |
Should these be label selectors rather than binary values? Internally have a number of use cases where the administrator wants to make some subset of pods un-{schedulable,runnable} as opposed to it being a binary decision.
I'm not completely opposed to this, but it seems a little weird to put the logic for determining whether a node is schedulable into a controller rather than in the scheduler. Runnable seems less weird since node controller is already making decisions based on health checks. |
Yes, this model was deliberately simpler than we know it will eventually need to be. We don't have any notion of priority yet, nor forgiveness, nor disruption SLOs, nor pre-stop hook timeouts (graceful termination deadlines). I also thought about making nodes fit into the model for graceful termination of all objects: #1535. That's intended for preparation for object deletion, which wouldn't normally be the case for maintenance, but we could do something similar. We could add a stop time goal, expected duration, and reason to the Spec, and no new conditions, and leave it entirely to the scheduler to assess schedulability and to the controller to determine runnability for any particular pod. That would provide enough information for forgiveness, disruption SLOs, and respecting graceful termination expectations. The scenarios where one wants to selectively disable certain classes of workloads in the cluster or on individual machines for other reasons are more rare. However, I'm concerned that a more sophisticated model would be harder for users to understand, and we don't yet have any of the features that would make it necessary. If we wanted the scheduler to only need to look at the node's status and not the spec, we could simply reflect the administrative bits into the conditions, without combining with other conditions, which is what I had previously proposed in the original NodeCondition PR. |
@pravisankar Some summary and backgroud: First, we need two new specs: Runnable and Schedulable. We can start with simple bool, but eventually, we add more information to support pod forgiveness (#1574), etc. Second, we have three conditions: Runnable, Schedulable and Ready (we'll remove Reachable for now). Kubelet will periodically push Ready status; node controller will periodically check spec and Ready status to populate other two conditions. At the moment, Node controller will also do pod eviction if Ready status is not updated for a long time. Third, scheduler will only schedule pod to schedulable nodes. As a precaution, it's up to scheduler to check other condition too. As @bgrant0607 mentioned, there is a lot policies ommitted, but let's make it fancier later. The change involves kubelet, scheduler and node controller. I'll start kubelet POST status shortly, let me know what I can help at your side. @davidopp runnable and schedulable is a notion of a node, by "make subset of pods un-schedulable/runnable", I guess you mean make a node unschedulable for certain type of pods? I don't think we have enough features to do this, even if we do, I think we should also use nodespec. Label is for indetifying information, whereas schedulable/runnable is status (reflection of specification). Label can be anything, but we need a clear spec for the model. |
will not affect any existing pods on the node but it will disable creation of | ||
any new pods on the node. Node deactivate example: | ||
``` | ||
kubectl update nodes 10.1.2.3 --patch='{"apiVersion": "v1beta1", "deactivate": true}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since it only blocks scheduling, maybe call it "stopScheduling" or something like that? deactivate seems ambiguous, and might be interpreted as also killing the running pods
edbc031
to
35f2ce1
Compare
As per the feedback, |
35f2ce1
to
c139454
Compare
Capacity ResourceList `json:"capacity,omitempty"` | ||
|
||
// Unschedule controls node schedulability of new pods. By default node is schedulable. | ||
Unschedule bool `json:"unschedule,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please call this MakeUnschedulable (and make a similar change in the versioned structs)
500a50d
to
a33081b
Compare
a33081b
to
6f47ce1
Compare
@@ -704,6 +706,8 @@ type Minion struct { | |||
NodeResources NodeResources `json:"resources,omitempty" description:"characterization of node resources"` | |||
// Pod IP range assigned to the node | |||
PodCIDR string `json:"cidr,omitempty" description:"IP range assigned to the node"` | |||
// Unschedulable controls node schedulability of new pods. By default node is schedulable. | |||
Unschedulable bool `json:"unschedulable,omitempty" description:"enable or disable pod scheduling on the node"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, true => disable, so I'd omit "enable or".
3eee36f
to
fb5d8f4
Compare
@pravisankar Thanks! Few minor comments. |
fb5d8f4
to
99c3580
Compare
Needs rebase. |
99c3580
to
0652774
Compare
rebased PTAL |
Needs rebase. LGTM |
0652774
to
86d4769
Compare
rebased and updated swagger spec using hack/update-swagger-spec.sh |
34bbf66
to
c348f63
Compare
/cc @nikhiljindal Can you take a look at the swagger spec? |
@ddysher Unfortunately, the swagger contents are rearranged every time that script is run, so it's hard to inspect carefully. Hopefully we'll fix that soon. For now, we should just merge if we're happy with the changes. |
Unfortunately it needs to be rebased again, though, and probably needs to re-update the swagger. |
Setting Unschedulable on the node will not touch any existing pods on the node but will block scheduling of new pods on the node.
c348f63
to
b0efb7a
Compare
rebased (updated swagger spec one more time) |
LGTM, thanks for your work! |
Allow admin user to explicitly unschedule the node
Node deactivation will block scheduling of new pods on the node.