Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark node to be decommissioned and act accordingly #3885

Closed
pires opened this issue Jan 28, 2015 · 55 comments
Closed

Mark node to be decommissioned and act accordingly #3885

pires opened this issue Jan 28, 2015 · 55 comments
Assignees
Labels
area/api Indicates an issue on api area. area/node-lifecycle Issues or PRs related to Node lifecycle area/nodecontroller priority/backlog Higher priority than priority/awaiting-more-evidence. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle.

Comments

@pires
Copy link
Contributor

pires commented Jan 28, 2015

I haven't found a way of pausing/decommissioning a node, have all its containers stopped and recreated elsewhere in the cluster.

This would be great for node upgrades (hardware, OS, etc.).

Obviously, the node would have to be blacklisted so that no new containers are scheduled to it.

/cc @jmreicha

@erictune
Copy link
Member

The pods would have to have replication controllers for that to work.
@ddysher can say if decommisioning is currently possible, or what is planned, and how that would interact with stopping of pods.

@pires
Copy link
Contributor Author

pires commented Jan 28, 2015

Yes, having replication controllers is always the recommended way, but since Kubernetes allows for manual pod scheduling (with a pod descriptor instead of a replica controller descriptor) I think it would also be nice to support something like:

  • Node is marked as scheduled for decommission
  • Scheduler (?) is informed about this and reschedules pod(s) to other node(s)

@ddysher
Copy link
Contributor

ddysher commented Jan 28, 2015

We are not there yet. The best way is to use replication controllers for now.

After PR #3733 gets landed, I'll work on node lifecycle management and decommission. Node has a Terminated phase which is when k8s tries to re-schedule pods. This is for your first point.

For the second point, it'll be a little subtle to have all other components to understand node phase, especially the interactions between node controller, replication controller, and scheduler. I think the best way is to have node controller marks all pods as unscheduled, which triggers scheduler to re-schedule them. From the perspective of replication controller, the pod never fails, so we don't have to distinguish manual pod vs. controller-managed pods. There are probably more involved here, like restart policy, volumn, etc, will think it more.

@erictune
Copy link
Member

I don't think we want to do the second point. Pods should not come back to life on a different node after they have been stopped on one node. We want the phase state machine to be a DAG not a general graph.

@alex-mohr alex-mohr added area/api Indicates an issue on api area. priority/backlog Higher priority than priority/awaiting-more-evidence. labels Jan 28, 2015
@alex-mohr
Copy link
Contributor

I personally think there's a reasonably-big usability gain from allowing pods to declare a restart policy that includes node failure and little downside. We already allow pods that are scheduled on a machine to restart as new containers if they crash, so the phase state machine already has cycles. And it seems strange to allow restarts only if containers on a node fail, but ban them when they'd start on a new node.

As a not-so-strawman example, setting "onNodeFailure: reschedule" seems like a reasonable user request.

@ddysher
Copy link
Contributor

ddysher commented Jan 29, 2015

Do we have cycles in current state machine? Restarting containers seems to be a cycle, by containers are not primitive in k8s. We do create Pod on failure from replication controller, but the newly created Pods are not the same entities as previous ones, they have different UIDs at least.

What the second point (re-schedule) really sounds like is migrating the Pods from failed node to new node. This is tricky, not to mention volumn, sessions. I have no objections to the DAG approach, but I agree with @alex-mohr that we need to do something here for usability. To do this, we'll at least need to distinguish manually-created Pod and replication-created Pod, maybe reverse label lookup? I don't know if that ever works.

@bgrant0607
Copy link
Member

We need to be able to drain machines. We should use an approach consistent with other resources. See #1535 re. graceful termination.

@bgrant0607
Copy link
Member

@alex-mohr We've discussed this before. Pods are replaced by replication controllers. They are not rescheduled. In fact, I'd like the replication controller to be able to replace pods ahead of killing the ones they are replacing in the case of a planned move. Let's not derail this issue with the "forever pod" discussion.

At the moment, the system doesn't have reasonable behavior in the cases of either planned or unplanned node outages. Let's fix that -- #1366 covers the unplanned case. This issue can cover the planned case.

@bgrant0607
Copy link
Member

@ddysher Why do you want reverse label lookup? The node controller needn't be aware of pod controllers -- replication controller, job controller, etc.

@jmreicha
Copy link

It would be great to have something in kubectl that allowed you to mark a node to be pulled out of rotation (and then showed up with kubectl get minion) and then drained after being marked for deletion or maintenance.

I ran into an issue like this the other day where I needed to rotate out some hosts for maintenance and had to manually remove pods after stopping the servers.

@ddysher
Copy link
Contributor

ddysher commented Feb 2, 2015

@bgrant0607 The reason I'm trying to do so is the possible overlapping functions between node controller and replication controller or job controller.

If all pods are started with replication/job controller, then node controller just needs to remove the pods. But cases like the issue where pods are started without any controller, node controller should be responsible to remove them and recreate them elsewhere. The restart part seems to be a duplicate function, ie. node controller and replication controller will alll try to create a new pod.

If node controller just removes the pod, then this seems to break our restart policy. A user would want a pod with restartalway to always restart even in case of node failure.

Did I interpret it correctly? I can't recall of any component that would claim 'ownership' of those pods.

@bgrant0607
Copy link
Member

@ddysher No, the node controller should never recreate pods elsewhere. That's not its job. Users that want that behavior need to run the pods under a pod controller, such as the replication controller. No, it doesn't invalidate restart policy -- separation of concerns. Individual pods are vulnerable to node failure -- that's reality and the model. See #3949 for more details.

@ddysher
Copy link
Contributor

ddysher commented Feb 4, 2015

The model makes node controller much easier, and that's definitely a good thing. But from a user's perspective (not from how we design/simplify the system), node failure without Pod restart is really confusing. We haven't stressed enough about pod controller, even our classic example creates naked pod, as you mentioned in #1603.

Here I'm not saying we should recreate pods, just bring up a potential issue if we don't do so :)

@bgrant0607
Copy link
Member

Yes. we should fix our broken examples.

@bgrant0607 bgrant0607 added priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. and removed area/usability priority/backlog Higher priority than priority/awaiting-more-evidence. labels Feb 5, 2015
@pires
Copy link
Contributor Author

pires commented Feb 5, 2015

I think you should remove the notion of a pod without a replication
controller. It would simplify the possible scenarios.
On Feb 4, 2015 4:47 AM, "Brian Grant" notifications@github.com wrote:

Yes. we should fix our broken examples.


Reply to this email directly or view it on GitHub
#3885 (comment)
.

@bgrant0607
Copy link
Member

@pires Been there, done that. Pod needs to be an available primitive. One reason is that we plan to support multiple types of controllers.

@pravisankar
Copy link

I think ability to mark the node as deactived/decommissioned using kubectl gives flexibility to the user/admin to do node upgrades(security patches, software upgrades), node evacuation/custom pod migration.
To support this use case, may be we can add a new condition on the node, say 'NodeDeactivate'. When 'NodeDeactivate' status is set(Full), irrespective of NodeReady/NodeReachable condition status, scheduler can ignore this node for new pod creation.
CLI can be:
Node Deactivation:
kubectl update nodes --patch={'apiversion': , 'status': {'conditions': [{'kind': 'Deactivate', 'status': 'Full'}]}}

Node Activation:
kubectl update nodes --patch={'apiversion': , 'status': {'conditions': [{'kind': 'Deactivate', 'status': 'None'}]}}

I'm planning to implement this feature, let me know if you see any issues with this approach.
@bgrant0607 @smarterclayton @ddysher @alex-mohr

@bgrant0607
Copy link
Member

@pravisankar Discussion on #1535, #2315, and #2726 is relevant.

Status must be completely reconstructable based on observations. In order to express that the desired state is "deactivated", there would need to be a field in NodeSpec that indicates this. There can additionally be a NodeCondition that reflects the status.

I imagine we'll eventually want several flavors of this:

  • stop: gracefully terminate and remove the node -- we need to define what that means, but I'd like some reasonably consistent, useful definition of this for every object; we'll at least want to wait for pod pre-stop hooks to complete
  • unschedulable: simply don't schedule new pods/volumes, but don't delete existing ones
  • uninhabitable: evict existing pods/volumes
  • shutdown: actually do something physical to the machine, like invoking shutdown

Note that there's some amount of nuance in some of the above. We will eventually want to differentiate between different types of workloads, such as daemons vs. user pods.

I eventually want a custom control subresource to set whatever such fields we add, as discussed in #2726, but patch would work initially.

@bgrant0607 bgrant0607 added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Feb 28, 2015
@bgrant0607 bgrant0607 added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Jun 20, 2015
mml added a commit to mml/kubernetes that referenced this issue Jan 7, 2016
It cordons (marks unschedulable) the given node, and then deletes every
pod on it, optionally using a grace period.  It will not delete pods
managed by neither a ReplicationController nor a DaemonSet without the
use of --force.

Also add cordon/uncordon, which just toggle node schedulability.
@therc
Copy link
Member

therc commented Jan 10, 2016

For the next iteration, it would be nice to have a way to provide specific resources to drain. In the discussion about GPUs in #19049, I mentioned specific kernel driver ABIs. Draining a whole machine works, but if all we need is kicking out GPU users, maybe we could evict just the pods using the resource. Similar reasoning if you wanted to reformat attached SSDs, etc.

@mikedanese
Copy link
Member

That could be achieved with taints and a rescheduler. At least taints will make probably it into 1.3. I think what we planned for 1.2 is complete.

#17190

@davidopp
Copy link
Member

davidopp commented Mar 1, 2016

ref/ #22217

@leecalcote
Copy link
Contributor

When considering the maintenance mode use case, it'd be good to account for the ability to schedule the node drain. In this way, administrators may set a predefined maintenance window for specific nodes.

Are Jobs a good candidate to orchestrate a maintenance window once node drain is implemented?

@davidopp
Copy link
Member

davidopp commented Mar 9, 2016

Your analogy is reasonable, that scheduling node drains is similar to scheduling a workflow of run-to-completion Jobs. But I don't think the Job abstraction can be directly used for the former. More generally, I'm not sure how much sophistication for scheduling maintenance workflows we want to build into core Kubernetes, vs. suggest people build it on top. My initial thought is that we want to support simple server-side drains (something that marches through the cell at a specified rate and respecting disruption budgets) but complicated maintennce workflow scheduling shouldn't be part of core Kubernetes.

@0xmichalis
Copy link
Contributor

Is this covered by kubectl drain?

@paralin
Copy link
Contributor

paralin commented Mar 20, 2017

Where are we on Docker checkpoint and restore? It would be super ideal if we could combine the auto drain with a pod migration procedure.

@davidopp
Copy link
Member

kubectl drain does a lot of this, but there are some ideas in this issue that were not really ever implemented and we might want to refer back to.

@bgrant0607 bgrant0607 added the area/node-lifecycle Issues or PRs related to Node lifecycle label Mar 23, 2017
@mml
Copy link
Contributor

mml commented Jun 2, 2017

I am closing this as fixed. We can easily search for and refer to this issue for ideas later.

@mml mml closed this as completed Jun 2, 2017
wking added a commit to wking/kubernetes that referenced this issue Jul 17, 2018
The property was added in c6e9ad0 (Initial node drain implementation
for kubernetes#3885, 2015-10-30, kubernetes#16698), but beb5ea6 (remove mapper dependency
- PrintSuccess, 2018-02-01, kubernetes#59227) removed the only initializer.
vithati pushed a commit to vithati/kubernetes that referenced this issue Oct 25, 2018
The property was added in c6e9ad0 (Initial node drain implementation
for kubernetes#3885, 2015-10-30, kubernetes#16698), but beb5ea6 (remove mapper dependency
- PrintSuccess, 2018-02-01, kubernetes#59227) removed the only initializer.
wking added a commit to wking/kubernetes that referenced this issue Feb 24, 2022
When it initially landed in kubernetes/kubernetes@c6e9ad066e (Initial
node drain implementation for kubernetes#3885, 2015-08-30,
kubernetes#16698), the drain logic looked in a created-by
annotation for recognized kinds [1], so listing the set of recognized
kinds was a clear approach.

Sometime later, the source moved into ownerReferences, but the
hard-coded set of recognized controller kinds remained.

When kubernetes/kubernetes@2f1108451f (Remove hard-coded
pod-controller check, 2017-12-05, kubernetes#56864) removed
the hard-coded set of recognized controller kinds, it should have also
updated these messages to remove stale references to the previous
hard-coded values.  This commit catches the message strings up with
that commit.

[1]: kubernetes@c6e9ad0#diff-211259b8a8ec42f105264c10897dad48029badb538684e60e43eaead68c3d219R216
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api Indicates an issue on api area. area/node-lifecycle Issues or PRs related to Node lifecycle area/nodecontroller priority/backlog Higher priority than priority/awaiting-more-evidence. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle.
Projects
None yet
Development

No branches or pull requests