-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mark node to be decommissioned and act accordingly #3885
Comments
The pods would have to have replication controllers for that to work. |
Yes, having replication controllers is always the recommended way, but since Kubernetes allows for manual pod scheduling (with a pod descriptor instead of a replica controller descriptor) I think it would also be nice to support something like:
|
We are not there yet. The best way is to use replication controllers for now. After PR #3733 gets landed, I'll work on node lifecycle management and decommission. Node has a For the second point, it'll be a little subtle to have all other components to understand node phase, especially the interactions between node controller, replication controller, and scheduler. I think the best way is to have node controller marks all pods as unscheduled, which triggers scheduler to re-schedule them. From the perspective of replication controller, the pod never fails, so we don't have to distinguish manual pod vs. controller-managed pods. There are probably more involved here, like restart policy, volumn, etc, will think it more. |
I don't think we want to do the second point. Pods should not come back to life on a different node after they have been stopped on one node. We want the phase state machine to be a DAG not a general graph. |
I personally think there's a reasonably-big usability gain from allowing pods to declare a restart policy that includes node failure and little downside. We already allow pods that are scheduled on a machine to restart as new containers if they crash, so the phase state machine already has cycles. And it seems strange to allow restarts only if containers on a node fail, but ban them when they'd start on a new node. As a not-so-strawman example, setting "onNodeFailure: reschedule" seems like a reasonable user request. |
Do we have cycles in current state machine? Restarting containers seems to be a cycle, by containers are not primitive in k8s. We do create Pod on failure from replication controller, but the newly created Pods are not the same entities as previous ones, they have different UIDs at least. What the second point (re-schedule) really sounds like is migrating the Pods from failed node to new node. This is tricky, not to mention volumn, sessions. I have no objections to the DAG approach, but I agree with @alex-mohr that we need to do something here for usability. To do this, we'll at least need to distinguish manually-created Pod and replication-created Pod, maybe reverse label lookup? I don't know if that ever works. |
We need to be able to drain machines. We should use an approach consistent with other resources. See #1535 re. graceful termination. |
@alex-mohr We've discussed this before. Pods are replaced by replication controllers. They are not rescheduled. In fact, I'd like the replication controller to be able to replace pods ahead of killing the ones they are replacing in the case of a planned move. Let's not derail this issue with the "forever pod" discussion. At the moment, the system doesn't have reasonable behavior in the cases of either planned or unplanned node outages. Let's fix that -- #1366 covers the unplanned case. This issue can cover the planned case. |
@ddysher Why do you want reverse label lookup? The node controller needn't be aware of pod controllers -- replication controller, job controller, etc. |
It would be great to have something in kubectl that allowed you to mark a node to be pulled out of rotation (and then showed up with I ran into an issue like this the other day where I needed to rotate out some hosts for maintenance and had to manually remove pods after stopping the servers. |
@bgrant0607 The reason I'm trying to do so is the possible overlapping functions between node controller and replication controller or job controller. If all pods are started with replication/job controller, then node controller just needs to remove the pods. But cases like the issue where pods are started without any controller, node controller should be responsible to remove them and recreate them elsewhere. The restart part seems to be a duplicate function, ie. node controller and replication controller will alll try to create a new pod. If node controller just removes the pod, then this seems to break our restart policy. A user would want a pod with restartalway to always restart even in case of node failure. Did I interpret it correctly? I can't recall of any component that would claim 'ownership' of those pods. |
@ddysher No, the node controller should never recreate pods elsewhere. That's not its job. Users that want that behavior need to run the pods under a pod controller, such as the replication controller. No, it doesn't invalidate restart policy -- separation of concerns. Individual pods are vulnerable to node failure -- that's reality and the model. See #3949 for more details. |
The model makes node controller much easier, and that's definitely a good thing. But from a user's perspective (not from how we design/simplify the system), node failure without Pod restart is really confusing. We haven't stressed enough about pod controller, even our classic example creates naked pod, as you mentioned in #1603. Here I'm not saying we should recreate pods, just bring up a potential issue if we don't do so :) |
Yes. we should fix our broken examples. |
I think you should remove the notion of a pod without a replication
|
@pires Been there, done that. Pod needs to be an available primitive. One reason is that we plan to support multiple types of controllers. |
I think ability to mark the node as deactived/decommissioned using kubectl gives flexibility to the user/admin to do node upgrades(security patches, software upgrades), node evacuation/custom pod migration. Node Activation: I'm planning to implement this feature, let me know if you see any issues with this approach. |
@pravisankar Discussion on #1535, #2315, and #2726 is relevant. Status must be completely reconstructable based on observations. In order to express that the desired state is "deactivated", there would need to be a field in NodeSpec that indicates this. There can additionally be a NodeCondition that reflects the status. I imagine we'll eventually want several flavors of this:
Note that there's some amount of nuance in some of the above. We will eventually want to differentiate between different types of workloads, such as daemons vs. user pods. I eventually want a custom |
It cordons (marks unschedulable) the given node, and then deletes every pod on it, optionally using a grace period. It will not delete pods managed by neither a ReplicationController nor a DaemonSet without the use of --force. Also add cordon/uncordon, which just toggle node schedulability.
For the next iteration, it would be nice to have a way to provide specific resources to drain. In the discussion about GPUs in #19049, I mentioned specific kernel driver ABIs. Draining a whole machine works, but if all we need is kicking out GPU users, maybe we could evict just the pods using the resource. Similar reasoning if you wanted to reformat attached SSDs, etc. |
That could be achieved with taints and a rescheduler. At least taints will make probably it into 1.3. I think what we planned for 1.2 is complete. |
ref/ #22217 |
When considering the maintenance mode use case, it'd be good to account for the ability to schedule the node drain. In this way, administrators may set a predefined maintenance window for specific nodes. Are Jobs a good candidate to orchestrate a maintenance window once node drain is implemented? |
Your analogy is reasonable, that scheduling node drains is similar to scheduling a workflow of run-to-completion Jobs. But I don't think the Job abstraction can be directly used for the former. More generally, I'm not sure how much sophistication for scheduling maintenance workflows we want to build into core Kubernetes, vs. suggest people build it on top. My initial thought is that we want to support simple server-side drains (something that marches through the cell at a specified rate and respecting disruption budgets) but complicated maintennce workflow scheduling shouldn't be part of core Kubernetes. |
Is this covered by |
Where are we on Docker checkpoint and restore? It would be super ideal if we could combine the auto drain with a pod migration procedure. |
|
I am closing this as fixed. We can easily search for and refer to this issue for ideas later. |
The property was added in c6e9ad0 (Initial node drain implementation for kubernetes#3885, 2015-10-30, kubernetes#16698), but beb5ea6 (remove mapper dependency - PrintSuccess, 2018-02-01, kubernetes#59227) removed the only initializer.
The property was added in c6e9ad0 (Initial node drain implementation for kubernetes#3885, 2015-10-30, kubernetes#16698), but beb5ea6 (remove mapper dependency - PrintSuccess, 2018-02-01, kubernetes#59227) removed the only initializer.
When it initially landed in kubernetes/kubernetes@c6e9ad066e (Initial node drain implementation for kubernetes#3885, 2015-08-30, kubernetes#16698), the drain logic looked in a created-by annotation for recognized kinds [1], so listing the set of recognized kinds was a clear approach. Sometime later, the source moved into ownerReferences, but the hard-coded set of recognized controller kinds remained. When kubernetes/kubernetes@2f1108451f (Remove hard-coded pod-controller check, 2017-12-05, kubernetes#56864) removed the hard-coded set of recognized controller kinds, it should have also updated these messages to remove stale references to the previous hard-coded values. This commit catches the message strings up with that commit. [1]: kubernetes@c6e9ad0#diff-211259b8a8ec42f105264c10897dad48029badb538684e60e43eaead68c3d219R216
I haven't found a way of pausing/decommissioning a node, have all its containers stopped and recreated elsewhere in the cluster.
This would be great for node upgrades (hardware, OS, etc.).
Obviously, the node would have to be blacklisted so that no new containers are scheduled to it.
/cc @jmreicha
The text was updated successfully, but these errors were encountered: