Mark node to be decommissioned and act accordingly #3885

pires · 2015-01-28T19:12:44Z

I haven't found a way of pausing/decommissioning a node, have all its containers stopped and recreated elsewhere in the cluster.

This would be great for node upgrades (hardware, OS, etc.).

Obviously, the node would have to be blacklisted so that no new containers are scheduled to it.

/cc @jmreicha

erictune · 2015-01-28T19:19:37Z

The pods would have to have replication controllers for that to work.
@ddysher can say if decommisioning is currently possible, or what is planned, and how that would interact with stopping of pods.

pires · 2015-01-28T19:34:52Z

Yes, having replication controllers is always the recommended way, but since Kubernetes allows for manual pod scheduling (with a pod descriptor instead of a replica controller descriptor) I think it would also be nice to support something like:

Node is marked as scheduled for decommission
Scheduler (?) is informed about this and reschedules pod(s) to other node(s)

ddysher · 2015-01-28T22:21:18Z

We are not there yet. The best way is to use replication controllers for now.

After PR #3733 gets landed, I'll work on node lifecycle management and decommission. Node has a Terminated phase which is when k8s tries to re-schedule pods. This is for your first point.

For the second point, it'll be a little subtle to have all other components to understand node phase, especially the interactions between node controller, replication controller, and scheduler. I think the best way is to have node controller marks all pods as unscheduled, which triggers scheduler to re-schedule them. From the perspective of replication controller, the pod never fails, so we don't have to distinguish manual pod vs. controller-managed pods. There are probably more involved here, like restart policy, volumn, etc, will think it more.

erictune · 2015-01-28T23:04:21Z

I don't think we want to do the second point. Pods should not come back to life on a different node after they have been stopped on one node. We want the phase state machine to be a DAG not a general graph.

alex-mohr · 2015-01-28T23:40:10Z

I personally think there's a reasonably-big usability gain from allowing pods to declare a restart policy that includes node failure and little downside. We already allow pods that are scheduled on a machine to restart as new containers if they crash, so the phase state machine already has cycles. And it seems strange to allow restarts only if containers on a node fail, but ban them when they'd start on a new node.

As a not-so-strawman example, setting "onNodeFailure: reschedule" seems like a reasonable user request.

ddysher · 2015-01-29T02:53:20Z

Do we have cycles in current state machine? Restarting containers seems to be a cycle, by containers are not primitive in k8s. We do create Pod on failure from replication controller, but the newly created Pods are not the same entities as previous ones, they have different UIDs at least.

What the second point (re-schedule) really sounds like is migrating the Pods from failed node to new node. This is tricky, not to mention volumn, sessions. I have no objections to the DAG approach, but I agree with @alex-mohr that we need to do something here for usability. To do this, we'll at least need to distinguish manually-created Pod and replication-created Pod, maybe reverse label lookup? I don't know if that ever works.

bgrant0607 · 2015-01-29T07:58:35Z

We need to be able to drain machines. We should use an approach consistent with other resources. See #1535 re. graceful termination.

bgrant0607 · 2015-01-29T08:06:34Z

@alex-mohr We've discussed this before. Pods are replaced by replication controllers. They are not rescheduled. In fact, I'd like the replication controller to be able to replace pods ahead of killing the ones they are replacing in the case of a planned move. Let's not derail this issue with the "forever pod" discussion.

At the moment, the system doesn't have reasonable behavior in the cases of either planned or unplanned node outages. Let's fix that -- #1366 covers the unplanned case. This issue can cover the planned case.

bgrant0607 · 2015-01-29T08:08:59Z

@ddysher Why do you want reverse label lookup? The node controller needn't be aware of pod controllers -- replication controller, job controller, etc.

jmreicha · 2015-01-29T17:24:15Z

It would be great to have something in kubectl that allowed you to mark a node to be pulled out of rotation (and then showed up with kubectl get minion) and then drained after being marked for deletion or maintenance.

I ran into an issue like this the other day where I needed to rotate out some hosts for maintenance and had to manually remove pods after stopping the servers.

ddysher · 2015-02-02T20:51:35Z

@bgrant0607 The reason I'm trying to do so is the possible overlapping functions between node controller and replication controller or job controller.

If all pods are started with replication/job controller, then node controller just needs to remove the pods. But cases like the issue where pods are started without any controller, node controller should be responsible to remove them and recreate them elsewhere. The restart part seems to be a duplicate function, ie. node controller and replication controller will alll try to create a new pod.

If node controller just removes the pod, then this seems to break our restart policy. A user would want a pod with restartalway to always restart even in case of node failure.

Did I interpret it correctly? I can't recall of any component that would claim 'ownership' of those pods.

bgrant0607 · 2015-02-03T04:48:33Z

@ddysher No, the node controller should never recreate pods elsewhere. That's not its job. Users that want that behavior need to run the pods under a pod controller, such as the replication controller. No, it doesn't invalidate restart policy -- separation of concerns. Individual pods are vulnerable to node failure -- that's reality and the model. See #3949 for more details.

ddysher · 2015-02-04T01:59:29Z

The model makes node controller much easier, and that's definitely a good thing. But from a user's perspective (not from how we design/simplify the system), node failure without Pod restart is really confusing. We haven't stressed enough about pod controller, even our classic example creates naked pod, as you mentioned in #1603.

Here I'm not saying we should recreate pods, just bring up a potential issue if we don't do so :)

bgrant0607 · 2015-02-04T03:47:04Z

Yes. we should fix our broken examples.

pires · 2015-02-05T01:56:27Z

I think you should remove the notion of a pod without a replication
controller. It would simplify the possible scenarios.
On Feb 4, 2015 4:47 AM, "Brian Grant" notifications@github.com wrote:

Yes. we should fix our broken examples.

—
Reply to this email directly or view it on GitHub
#3885 (comment)
.

bgrant0607 · 2015-02-05T02:46:49Z

@pires Been there, done that. Pod needs to be an available primitive. One reason is that we plan to support multiple types of controllers.

pravisankar · 2015-02-12T23:28:46Z

I think ability to mark the node as deactived/decommissioned using kubectl gives flexibility to the user/admin to do node upgrades(security patches, software upgrades), node evacuation/custom pod migration.
To support this use case, may be we can add a new condition on the node, say 'NodeDeactivate'. When 'NodeDeactivate' status is set(Full), irrespective of NodeReady/NodeReachable condition status, scheduler can ignore this node for new pod creation.
CLI can be:
Node Deactivation:
kubectl update nodes --patch={'apiversion': , 'status': {'conditions': [{'kind': 'Deactivate', 'status': 'Full'}]}}

Node Activation:
kubectl update nodes --patch={'apiversion': , 'status': {'conditions': [{'kind': 'Deactivate', 'status': 'None'}]}}

I'm planning to implement this feature, let me know if you see any issues with this approach.
@bgrant0607 @smarterclayton @ddysher @alex-mohr

bgrant0607 · 2015-02-13T05:04:12Z

@pravisankar Discussion on #1535, #2315, and #2726 is relevant.

Status must be completely reconstructable based on observations. In order to express that the desired state is "deactivated", there would need to be a field in NodeSpec that indicates this. There can additionally be a NodeCondition that reflects the status.

I imagine we'll eventually want several flavors of this:

stop: gracefully terminate and remove the node -- we need to define what that means, but I'd like some reasonably consistent, useful definition of this for every object; we'll at least want to wait for pod pre-stop hooks to complete
unschedulable: simply don't schedule new pods/volumes, but don't delete existing ones
uninhabitable: evict existing pods/volumes
shutdown: actually do something physical to the machine, like invoking shutdown

Note that there's some amount of nuance in some of the above. We will eventually want to differentiate between different types of workloads, such as daemons vs. user pods.

I eventually want a custom control subresource to set whatever such fields we add, as discussed in #2726, but patch would work initially.

It cordons (marks unschedulable) the given node, and then deletes every pod on it, optionally using a grace period. It will not delete pods managed by neither a ReplicationController nor a DaemonSet without the use of --force. Also add cordon/uncordon, which just toggle node schedulability.

therc · 2016-01-10T03:45:34Z

For the next iteration, it would be nice to have a way to provide specific resources to drain. In the discussion about GPUs in #19049, I mentioned specific kernel driver ABIs. Draining a whole machine works, but if all we need is kicking out GPU users, maybe we could evict just the pods using the resource. Similar reasoning if you wanted to reformat attached SSDs, etc.

mikedanese · 2016-01-10T04:18:55Z

That could be achieved with taints and a rescheduler. At least taints will make probably it into 1.3. I think what we planned for 1.2 is complete.

#17190

davidopp · 2016-03-01T08:14:26Z

ref/ #22217

leecalcote · 2016-03-09T06:05:32Z

When considering the maintenance mode use case, it'd be good to account for the ability to schedule the node drain. In this way, administrators may set a predefined maintenance window for specific nodes.

Are Jobs a good candidate to orchestrate a maintenance window once node drain is implemented?

davidopp · 2016-03-09T07:38:33Z

Your analogy is reasonable, that scheduling node drains is similar to scheduling a workflow of run-to-completion Jobs. But I don't think the Job abstraction can be directly used for the former. More generally, I'm not sure how much sophistication for scheduling maintenance workflows we want to build into core Kubernetes, vs. suggest people build it on top. My initial thought is that we want to support simple server-side drains (something that marches through the cell at a specified rate and respecting disruption budgets) but complicated maintennce workflow scheduling shouldn't be part of core Kubernetes.

0xmichalis · 2017-03-20T02:04:54Z

Is this covered by kubectl drain?

paralin · 2017-03-20T02:06:29Z

Where are we on Docker checkpoint and restore? It would be super ideal if we could combine the auto drain with a pod migration procedure.

davidopp · 2017-03-20T03:40:38Z

kubectl drain does a lot of this, but there are some ideas in this issue that were not really ever implemented and we might want to refer back to.

mml · 2017-06-02T17:36:27Z

I am closing this as fixed. We can easily search for and refer to this issue for ideas later.

The property was added in c6e9ad0 (Initial node drain implementation for kubernetes#3885, 2015-10-30, kubernetes#16698), but beb5ea6 (remove mapper dependency - PrintSuccess, 2018-02-01, kubernetes#59227) removed the only initializer.

When it initially landed in kubernetes/kubernetes@c6e9ad066e (Initial node drain implementation for kubernetes#3885, 2015-08-30, kubernetes#16698), the drain logic looked in a created-by annotation for recognized kinds [1], so listing the set of recognized kinds was a clear approach. Sometime later, the source moved into ownerReferences, but the hard-coded set of recognized controller kinds remained. When kubernetes/kubernetes@2f1108451f (Remove hard-coded pod-controller check, 2017-12-05, kubernetes#56864) removed the hard-coded set of recognized controller kinds, it should have also updated these messages to remove stale references to the previous hard-coded values. This commit catches the message strings up with that commit. [1]: kubernetes@c6e9ad0#diff-211259b8a8ec42f105264c10897dad48029badb538684e60e43eaead68c3d219R216

alex-mohr added area/api Indicates an issue on api area. priority/backlog Higher priority than priority/awaiting-more-evidence. labels Jan 28, 2015

bgrant0607 added the area/nodecontroller label Jan 29, 2015

bgrant0607 added the area/usability label Jan 30, 2015

ddysher mentioned this issue Feb 4, 2015

Use replication controller for redis master. #4133

Merged

bgrant0607 added priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. and removed area/usability priority/backlog Higher priority than priority/awaiting-more-evidence. labels Feb 5, 2015

pravisankar mentioned this issue Feb 19, 2015

Allow admin user to explicitly unschedule the node #4585

Merged

bgrant0607 added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Feb 28, 2015

ddysher mentioned this issue Mar 20, 2015

Add timeout to kubelet client #5635

Merged

bgrant0607 added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Jun 20, 2015

davidopp mentioned this issue Jan 5, 2016

DaemonSet: requirements for graduation to beta and then to v1 #15310

Closed

4 tasks

mikedanese removed this from the v1.2-candidate milestone Jan 10, 2016

sjpotter mentioned this issue Feb 10, 2016

kubernetes integration flannel-io/flannel#402

Closed

tomwilkie mentioned this issue May 5, 2016

Drain a node from scope weaveworks/scope#1432

Open

davidopp mentioned this issue May 15, 2016

move "kubectl drain" into the server #25625

Closed

0xmichalis removed the team/control-plane (deprecated - do not use) label Mar 20, 2017

bgrant0607 added the area/node-lifecycle Issues or PRs related to Node lifecycle label Mar 23, 2017

mml closed this as completed Jun 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mark node to be decommissioned and act accordingly #3885

Mark node to be decommissioned and act accordingly #3885

pires commented Jan 28, 2015

erictune commented Jan 28, 2015

pires commented Jan 28, 2015

ddysher commented Jan 28, 2015

erictune commented Jan 28, 2015

alex-mohr commented Jan 28, 2015

ddysher commented Jan 29, 2015

bgrant0607 commented Jan 29, 2015

bgrant0607 commented Jan 29, 2015

bgrant0607 commented Jan 29, 2015

jmreicha commented Jan 29, 2015

ddysher commented Feb 2, 2015

bgrant0607 commented Feb 3, 2015

ddysher commented Feb 4, 2015

bgrant0607 commented Feb 4, 2015

pires commented Feb 5, 2015

bgrant0607 commented Feb 5, 2015

pravisankar commented Feb 12, 2015

bgrant0607 commented Feb 13, 2015

therc commented Jan 10, 2016

mikedanese commented Jan 10, 2016

davidopp commented Mar 1, 2016

leecalcote commented Mar 9, 2016

davidopp commented Mar 9, 2016

0xmichalis commented Mar 20, 2017

paralin commented Mar 20, 2017

davidopp commented Mar 20, 2017

mml commented Jun 2, 2017

Mark node to be decommissioned and act accordingly #3885

Mark node to be decommissioned and act accordingly #3885

Comments

pires commented Jan 28, 2015

erictune commented Jan 28, 2015

pires commented Jan 28, 2015

ddysher commented Jan 28, 2015

erictune commented Jan 28, 2015

alex-mohr commented Jan 28, 2015

ddysher commented Jan 29, 2015

bgrant0607 commented Jan 29, 2015

bgrant0607 commented Jan 29, 2015

bgrant0607 commented Jan 29, 2015

jmreicha commented Jan 29, 2015

ddysher commented Feb 2, 2015

bgrant0607 commented Feb 3, 2015

ddysher commented Feb 4, 2015

bgrant0607 commented Feb 4, 2015

pires commented Feb 5, 2015

bgrant0607 commented Feb 5, 2015

pravisankar commented Feb 12, 2015

bgrant0607 commented Feb 13, 2015

therc commented Jan 10, 2016

mikedanese commented Jan 10, 2016

davidopp commented Mar 1, 2016

leecalcote commented Mar 9, 2016

davidopp commented Mar 9, 2016

0xmichalis commented Mar 20, 2017

paralin commented Mar 20, 2017

davidopp commented Mar 20, 2017

mml commented Jun 2, 2017