Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow admin user to explicitly unschedule the node #4585

Merged
merged 2 commits into from
Mar 12, 2015

Conversation

pravisankar
Copy link

Node deactivation will block scheduling of new pods on the node.

@googlebot
Copy link

Thanks for your pull request.

It looks like this may be your first contribution to a Google open source project, in which case you'll need to sign a Contributor License Agreement (CLA) at https://cla.developers.google.com/.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check the information on your CLA or see this help article on setting the email on your git commits.

Once you've done that, please reply here to let us know. If you signed the CLA as a corporation, please let us know the company's name.

@pravisankar
Copy link
Author

Related discussion on #3885
@bgrant0607 @ddysher @smarterclayton

@pravisankar
Copy link
Author

Covered under Red Hat corporate CLA

@ddysher
Copy link
Contributor

ddysher commented Feb 20, 2015

We shouldn't have the custom logic in apiserver. Also, as @bgrant0607 mentioned, status should be reconstructable and be observed by k8s components, we can't have user update status I think.

An alternative proposal is to update node spec, have node controller check the spec and update status accordingly.

@bgrant0607
Copy link
Member

Discussed w/ @ddysher today.

There should be 2 fields in NodeSpec: Schedulable and Runnable. These reflect disabling of scheduling new pods and execution of any user pods, respectively, by administrative action, independent of unplanned failures.

There should also be 2 new NodeConditions, Schedulable and Runnable. The spec fields and other conditions (e.g., Ready) would be combined to provide the status values. The scheduler would then only need to look at the Schedulable condition, in the normal case.

@davidopp
Copy link
Member

There should be 2 fields in NodeSpec: Schedulable and Runnable. These reflect disabling of
scheduling new pods and execution of any user pods, respectively, by administrative action,
independent of unplanned failures.

Should these be label selectors rather than binary values? Internally have a number of use cases where the administrator wants to make some subset of pods un-{schedulable,runnable} as opposed to it being a binary decision.

There should also be 2 new NodeConditions, Schedulable and Runnable

I'm not completely opposed to this, but it seems a little weird to put the logic for determining whether a node is schedulable into a controller rather than in the scheduler. Runnable seems less weird since node controller is already making decisions based on health checks.

@bgrant0607
Copy link
Member

Yes, this model was deliberately simpler than we know it will eventually need to be. We don't have any notion of priority yet, nor forgiveness, nor disruption SLOs, nor pre-stop hook timeouts (graceful termination deadlines).

I also thought about making nodes fit into the model for graceful termination of all objects: #1535. That's intended for preparation for object deletion, which wouldn't normally be the case for maintenance, but we could do something similar.

We could add a stop time goal, expected duration, and reason to the Spec, and no new conditions, and leave it entirely to the scheduler to assess schedulability and to the controller to determine runnability for any particular pod. That would provide enough information for forgiveness, disruption SLOs, and respecting graceful termination expectations. The scenarios where one wants to selectively disable certain classes of workloads in the cluster or on individual machines for other reasons are more rare.

However, I'm concerned that a more sophisticated model would be harder for users to understand, and we don't yet have any of the features that would make it necessary.

If we wanted the scheduler to only need to look at the node's status and not the spec, we could simply reflect the administrative bits into the conditions, without combining with other conditions, which is what I had previously proposed in the original NodeCondition PR.

@ddysher
Copy link
Contributor

ddysher commented Feb 21, 2015

@pravisankar Some summary and backgroud: First, we need two new specs: Runnable and Schedulable. We can start with simple bool, but eventually, we add more information to support pod forgiveness (#1574), etc. Second, we have three conditions: Runnable, Schedulable and Ready (we'll remove Reachable for now). Kubelet will periodically push Ready status; node controller will periodically check spec and Ready status to populate other two conditions. At the moment, Node controller will also do pod eviction if Ready status is not updated for a long time. Third, scheduler will only schedule pod to schedulable nodes. As a precaution, it's up to scheduler to check other condition too.

As @bgrant0607 mentioned, there is a lot policies ommitted, but let's make it fancier later. The change involves kubelet, scheduler and node controller. I'll start kubelet POST status shortly, let me know what I can help at your side.

@davidopp runnable and schedulable is a notion of a node, by "make subset of pods un-schedulable/runnable", I guess you mean make a node unschedulable for certain type of pods? I don't think we have enough features to do this, even if we do, I think we should also use nodespec. Label is for indetifying information, whereas schedulable/runnable is status (reflection of specification). Label can be anything, but we need a clear spec for the model.

will not affect any existing pods on the node but it will disable creation of
any new pods on the node. Node deactivate example:
```
kubectl update nodes 10.1.2.3 --patch='{"apiVersion": "v1beta1", "deactivate": true}'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since it only blocks scheduling, maybe call it "stopScheduling" or something like that? deactivate seems ambiguous, and might be interpreted as also killing the running pods

@ddysher ddysher added cla: yes and removed cla: no labels Feb 23, 2015
@pravisankar pravisankar changed the title Support node deactivation Allow admin user to explicitly unschedule the node Feb 23, 2015
@pravisankar
Copy link
Author

As per the feedback,
Renamed node spec field 'deactivate' to 'unschedule'.
Moved the node unschedule logic from apiserver to node controller.
Runnable condition will be introduced in a seperate pull request.
@ddysher please review the changes

Capacity ResourceList `json:"capacity,omitempty"`

// Unschedule controls node schedulability of new pods. By default node is schedulable.
Unschedule bool `json:"unschedule,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please call this MakeUnschedulable (and make a similar change in the versioned structs)

@pravisankar pravisankar force-pushed the deactivate-node branch 2 times, most recently from 500a50d to a33081b Compare February 23, 2015 22:35
@pravisankar
Copy link
Author

@davidopp @ddysher renamed 'unschedule' to 'makeUnschedulable'.

@@ -704,6 +706,8 @@ type Minion struct {
NodeResources NodeResources `json:"resources,omitempty" description:"characterization of node resources"`
// Pod IP range assigned to the node
PodCIDR string `json:"cidr,omitempty" description:"IP range assigned to the node"`
// Unschedulable controls node schedulability of new pods. By default node is schedulable.
Unschedulable bool `json:"unschedulable,omitempty" description:"enable or disable pod scheduling on the node"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, true => disable, so I'd omit "enable or".

@ddysher
Copy link
Contributor

ddysher commented Mar 6, 2015

@pravisankar Thanks! Few minor comments.

@pravisankar
Copy link
Author

@ddysher @davidopp updated

@erictune
Copy link
Member

erictune commented Mar 9, 2015

Needs rebase.

@pravisankar
Copy link
Author

rebased PTAL

@ddysher
Copy link
Contributor

ddysher commented Mar 10, 2015

Needs rebase. LGTM

@pravisankar
Copy link
Author

rebased and updated swagger spec using hack/update-swagger-spec.sh

@pravisankar pravisankar force-pushed the deactivate-node branch 2 times, most recently from 34bbf66 to c348f63 Compare March 11, 2015 23:04
@ddysher
Copy link
Contributor

ddysher commented Mar 12, 2015

/cc @nikhiljindal Can you take a look at the swagger spec?

@bgrant0607
Copy link
Member

@ddysher Unfortunately, the swagger contents are rearranged every time that script is run, so it's hard to inspect carefully. Hopefully we'll fix that soon. For now, we should just merge if we're happy with the changes.

@bgrant0607
Copy link
Member

Unfortunately it needs to be rebased again, though, and probably needs to re-update the swagger.

Ravi Sankar Penta added 2 commits March 12, 2015 14:27
Setting Unschedulable on the node will not touch any existing pods
on the node but will block scheduling of new pods on the node.
@pravisankar
Copy link
Author

rebased (updated swagger spec one more time)

@ddysher
Copy link
Contributor

ddysher commented Mar 12, 2015

LGTM, thanks for your work!

@ddysher ddysher added lgtm "Looks good to me", indicates that a PR is ready to be merged. cla: yes and removed cla: no labels Mar 12, 2015
ddysher added a commit that referenced this pull request Mar 12, 2015
Allow admin user to explicitly unschedule the node
@ddysher ddysher merged commit ae162fd into kubernetes:master Mar 12, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm "Looks good to me", indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants