Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-4212: Declarative Node Maintenance #4213

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

atiratree
Copy link
Member

@atiratree atiratree commented Sep 15, 2023

  • One-line PR description: introduce a NodeMaintenance API to solve various node drain issues
  • Other comments:

TODO:

  • explore permission model for NodeMaintenance, Node, Pods (RBAC, ValidatingAdmissionPolicy) and all of the actors
  • explore whether NodeMaintenance can be used as a replacement or superset of the Graceful Node Shutdown feature
  • explore possible interactions with Kubelet
  • explore the descheduling aspects of this feature
  • explore the DaemonSets and StatefulSets use case
  • figure out the best format/implementation for the EvacuationRequest

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 15, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: atiratree
Once this PR has been reviewed and has the lgtm label, please assign janetkuo for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Sep 15, 2023
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Sep 15, 2023
@atiratree atiratree marked this pull request as draft September 15, 2023 10:46
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 15, 2023
@atiratree atiratree mentioned this pull request Sep 15, 2023
4 tasks
Copy link
Contributor

@sftim sftim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for opening the issue and PR!

I know this is draft; I've got some feedback already, and I hope it's helpful.

keps/sig-apps/4212-improve-node-maintenance/README.md Outdated Show resolved Hide resolved
keps/sig-apps/4212-improve-node-maintenance/README.md Outdated Show resolved Hide resolved
keps/sig-apps/4212-improve-node-maintenance/README.md Outdated Show resolved Hide resolved
keps/sig-apps/4212-improve-node-maintenance/README.md Outdated Show resolved Hide resolved
keps/sig-apps/4212-improve-node-maintenance/README.md Outdated Show resolved Hide resolved
keps/sig-apps/4212-improve-node-maintenance/README.md Outdated Show resolved Hide resolved
keps/sig-apps/4212-improve-node-maintenance/README.md Outdated Show resolved Hide resolved
keps/sig-apps/4212-improve-node-maintenance/README.md Outdated Show resolved Hide resolved
keps/sig-apps/4212-improve-node-maintenance/README.md Outdated Show resolved Hide resolved
keps/sig-apps/4212-improve-node-maintenance/README.md Outdated Show resolved Hide resolved
keps/sig-apps/4212-improve-node-maintenance/README.md Outdated Show resolved Hide resolved
keps/sig-apps/4212-improve-node-maintenance/README.md Outdated Show resolved Hide resolved
keps/sig-apps/4212-improve-node-maintenance/README.md Outdated Show resolved Hide resolved
keps/sig-apps/4212-improve-node-maintenance/README.md Outdated Show resolved Hide resolved
keps/sig-apps/4212-improve-node-maintenance/README.md Outdated Show resolved Hide resolved
keps/sig-apps/4212-improve-node-maintenance/README.md Outdated Show resolved Hide resolved
keps/sig-apps/4212-improve-node-maintenance/README.md Outdated Show resolved Hide resolved
keps/sig-apps/4212-improve-node-maintenance/README.md Outdated Show resolved Hide resolved
keps/sig-apps/4212-improve-node-maintenance/README.md Outdated Show resolved Hide resolved
@atiratree atiratree force-pushed the improve-node-maintenance branch 7 times, most recently from 0ed835a to 81b4589 Compare September 22, 2023 18:10
@atiratree atiratree force-pushed the improve-node-maintenance branch 2 times, most recently from 6194058 to 00dc7c5 Compare September 25, 2023 14:09
@atiratree atiratree changed the title KEP-4212: Improve Node Maintenance KEP-4212: Declarative Node Maintenance Sep 25, 2023
@atiratree atiratree force-pushed the improve-node-maintenance branch from 00dc7c5 to 686227c Compare September 26, 2023 15:59
Copy link
Contributor

@sftim sftim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 should this be SIG Node rather than SIG Apps?

@atiratree
Copy link
Member Author

atiratree commented Oct 13, 2023

🤔 should this be SIG Node rather than SIG Apps?

Not sure yet, the API itself is SIG Node I guess, but it has a lot of implications for SIG- Apps. Let's delay this decision after I redo the KEP and we have additional rounds of discussions.

@tuibeovince
Copy link
Contributor

@atiratree
I'm not sure if it was indicated anywhere but I'll ask anyway. With NodeMaintenance, how is a single node cluster draining behavior going to be? thanks

@atiratree
Copy link
Member Author

@tuibeovince it depends on how your control plane is deployed. Ideally it should have the highest priority and pod type. So that all your workloads drain first and control plane last. You can also pull the plug earlier if you do not need the control plane to shut down gracefully.

Do you have a special case in mind that you would like to see solved for a single node cluster?

@atiratree atiratree force-pushed the improve-node-maintenance branch from d537832 to 4db6894 Compare June 14, 2024 11:47
@atiratree
Copy link
Member Author

  • Updated goals.
  • Moved partial node maintenance statuses to each node to make the NodeMaintenance status more ligtweight.
  • Introduced a common .status.drainStatus to the NodeMaintenance for all nodes.

@tuibeovince
Copy link
Contributor

@tuibeovince it depends on how your control plane is deployed. Ideally it should have the highest priority and pod type. So that all your workloads drain first and control plane last. You can also pull the plug earlier if you do not need the control plane to shut down gracefully.

Do you have a special case in mind that you would like to see solved for a single node cluster?

@atiratree Thank you! and sorry if I might still be getting lost in the concepts (such as evacuation) being introduced within the scope of DNM and EvacuationAPI in general. In my understanding given a multiple worker node cluster, pretty much like drain, DNM will instigate an evacuation of pods in a node to be maintained, and have them evacuated to an available node.

I guess for now what I wish to know about is the intended behavior of when the cluster is a single worker node and that node gets maintained. Like how are the pods handled before, during, and after the maintenance is completed in this case? Or are they simply just shutdown? are they gracefully killing the pods? and such

Thank you in advance.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 15, 2024
@thockin
Copy link
Member

thockin commented Sep 16, 2024

What's the statuys on this KEP? Is it coming back for 32?

@SergeyKanzhelev
Copy link
Member

What's the statuys on this KEP? Is it coming back for 32?

It is in scope definitions stage at this moment. There are many open questions that needs to be answered before moving forward on this.

// PodSelector selects pods according to their labels.
// This can help to select which pods of the same priority should be evacuated first.
// +optional
PodSelector *metav1.LabelSelector
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it also be possible to select by namespace?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO: not for alpha!

When you maintain a node, you expect to take all the Pods offline or put them at risk. Let's ship something useful and then see if we want to iterate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Begin this a global CRD, the lack of a namespace selector seems a relevant GAP for adoption

Copy link
Member Author

@atiratree atiratree Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the intended use case and benefit for the namespace selector?

The main differentiator should be the pod priority. I have added the label selector to mainly differentiate between different components running for example the system-cluster-critical or the system-node-critical priority. For example, if you are running some special etcd/apiserver configuration, you could control how the instances downscale during an upgrade. So, basically to have an additional ordering to just a priority. But in most cases the priority should be enough, and downscaling by namespaces could slow down the process considerably.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it seems unrealistic that ordering for Pod drain will be configured by setting Pod priorities on all Pods in a cluster, especially if not a single team has control over an entire cluster and can set Pod priorities on all relevant Pods.

We had this case here in Cluster API where folks where using Portworx which uses an operator to manage Pods: kubernetes-sigs/cluster-api#11024 (not a DaemonSet)

I assume that if there is a use-case for a PodSelector the same probably also applies for a namespace selector (I think outside of very limited use cases it's not safe to assume Pod labels are unique across namespaces)


When a `Cordon` or `Drain` stage is detected on the NodeMaintenance object, the controller
will set (and reconcile) `.spec.Unschedulable` to `true` on all nodes that satisfy
`.spec.nodeSelector`. It should alert via events if too many occur appear and a race to change
Copy link
Member

@sbueringer sbueringer Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Is there a word missing after many?

#### Pod Selection

The pods for evacuation would first be selected by node (`.spec.nodeSelector`). NodeMaintenance
should eventually remove all the pods from each node. To do this in a graceful manner, the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eventually remove all the pods from each node

A few questions about this as this has a few implications

Compared to kubectl drain this proposal also proposes to terminate DaemonSet and static Pods.
I'm not entirely sure if the order is configurable (see questions above), but my current assumption is that we always drain Default, then DaemonSets then static Pods.

This means that e.g. monitoring agents (logging, metrics, ...) that are usually deployed via DaemonSets will be terminated before static pods. Which in turn means that we won't have visibility during the last stages of the drain. I think this would make it hard to detect / troubleshoot the last parts of the drain.

I'm also wondering in general if it would make sense to allow to skip drain for certain Pods:

  • As already mentioned, it might be desirable to keep monitoring agents running until the Node shuts down entirely (this is possible today with kubectl drain by deploying monitoring agents via DaemonSets)
  • What about Pods that are tolerating all taints. I think this is a valid use case as sometimes folks want to enforce that Pods are running on all Nodes independent of whatever taints might be on that Node (for vendors it's impossible to tell which custom taints folks might use).
    • I think in that scenario a Node drain will be just stuck indefinitely. We are able to evict/evacuate these Pods, but the kube-scheduler will just reschedule them on the same Node.
    • Because there can be different personas involved it's also not always possible for the cluster-admin to simply modify tolerations of "third-party" Pods

Copy link
Member

@fabriziopandini fabriziopandini Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+💯 to add support for rules allowing to intentionally skip drain for a set of Pods, it is a request that popped up several time in Cluster API, mostly due to folks deploying stuff tolerating all taints or just the ones applied by drain.

Notably, it is also worth to notice that users don't always have the chance to change Pods, e.g. because they are managed by some operator, because the app is installed by something that keep the spec in sync etc etc.
In those cases, skip drain could provide a nice exit path for operators trying to unblock node deletion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure if the order is configurable (see questions above),
it is not

This means that e.g. monitoring agents (logging, metrics, ...) that are usually deployed via DaemonSets will be terminated before static pods. Which in turn means that we won't have visibility during the last stages of the drain. I think this would make it hard to detect / troubleshoot the last parts of the drain.

This is similar to today: DaemonSets are also terminated before static pods.

I'm also wondering in general if it would make sense to allow to skip drain for certain Pods:

+💯 to add support for rules allowing to intentionally skip drain for a set of Pods, it is a request that popped up several time in Cluster API, mostly due to folks deploying stuff tolerating all taints or just the ones applied by drain.

I think the presumption is to better tune the types of workloads and their priorities. The goal of the NodeMaintenance is to remove all pods from a node and most likely to take the node offline.

Notably, it is also worth to notice that users don't always have the chance to change Pods, e.g. because they are managed by some operator, because the app is installed by something that keep the spec in sync etc etc.
In those cases, skip drain could provide a nice exit path for operators trying to unblock node deletion.

What about Pods that are tolerating all taints. I think this is a valid use case as sometimes folks want to enforce that Pods are running on all Nodes independent of whatever taints might be on that Node (for vendors it's impossible to tell which custom taints folks might use).

  • I think in that scenario a Node drain will be just stuck indefinitely. We are able to evict/evacuate these Pods, but the kube-scheduler will just reschedule them on the same Node.
  • Because there can be different personas involved it's also not always possible for the cluster-admin to simply modify tolerations of "third-party" Pods

These are the problems that the NodeMaintenance+Evacuation API tries to solve. Operators and any other actor should be aware of NodeMaintenance and Evacuation objects and they should offer the exit path automatically. And not block the drain and wait for manual intervention.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is similar to today: DaemonSets are also terminated before static pods.

Is this a part of the kubelet shutdown feature?

A misconfigured .spec.nodeSelector could select all the nodes (or just all master nodes) in the
cluster. This can cause the cluster to get into a degraded and unrecoverable state.

An admission plugin ([NodeMaintenance Admission](#nodemaintenance-admission)) is introduced to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Static Pods don't stop during a node drain, right? So even if you do mark every node as under maintenance, you may still have a control plane. Perhaps, no working cluster network and no DNS, but you still have a control plane

kubectl drain today, yes. But you're proposing to also shutdown static Pods. I assume this includes API server, etcd, ... (at least for clusters setup via kubeadm).

I think because of the Kubernetes skew policy for kubelet <=> kube-apiserver, kubelets on control plane nodes have to use the local apiserver (more context: kubeadm: kubernetes/kubeadm#2271). Thus, as soon as we start terminating apiserver and etcd the kubelet won't be able to communicate with the apiserver anymore. Which basically means it's only partially functional at this point.

command. However, we will deprecate it along with all the library functions. We can print a
deprecation warning when this command is used, and promote the NodeMaintenance. Additionally, pods
that support evacuation and have `evacuation.coordination.k8s.io/priority_${EVACUATOR_CLASS}`
annotations will block eviction requests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, pods that support evacuation and have evacuation.coordination.k8s.io/priority_${EVACUATOR_CLASS} annotations will block eviction requests.

Isn't this a breaking change for the eviction API?

I think all tools built on top of the eviction API won't be able to drain Nodes anymore as soon as a single Pod opts into evacuation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I think that this overlap between eviction and evacuation could be very confusing for users.
If someone defines a PDB for an application, IMO they are entitled to expect that the application is protected during drain.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all tools built on top of the eviction API won't be able to drain Nodes anymore as soon as a single Pod opts into evacuation.

Not true, as the current design of the Evacuation API stands, the eviction API will still fully work as before and will be fully supported.

But you can opt into using NodeMaintenace+Evacuation API for more graceful pod terminations + node drain.
We could easily enable this for example in the cluster autoscaler to observe the benefit of this feature. And it would also use eviction for pods who do not have evacuators assigned.

Copy link
Member

@sbueringer sbueringer Oct 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that as soon as a Pod opts into evacuation, eviction won't work anymore for this Pod ("Additionally, pods
that support evacuation and have evacuation.coordination.k8s.io/priority_${EVACUATOR_CLASS}
annotations will block eviction requests").

This means as soon as a single Pod in a cluster opts into evacuation, all tools that have been build on top of eviction today (Cluster API, kubectl drain, many others) won't be able to drain the Node with this Pod anymore.

The consequence is that all of these tools will have to implement the evacuation API as the eviction API is not sufficient anymore to be able to drain Nodes. It's not in the hands of these tools if end users will opt into the Evacuation API when they eventually deploy workload on clusters managed by these tools.

// DrainPlanEntry priority fields should be in ascending order for each podType.
// If the priority and podType are the same, concrete selectors are executed first.
//
// The following entries are injected into the drainPlan on the NodeMaintenance admission:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a user sets drainPlan. How is the user value merged with this?

(related to the question right above, of course merging would be straightforward if we always enforce a certain order, i.e. Default -> DaemonSet -> Static, podPriority low -> high, and entires with poldSelector first)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, by podType, then by podPriority and then by podSelector. As also seen in the type documentation.

// podType: "Static"
// - podPriority: 2000000000 // system-cluster-critical priority class
// podType: "Static"
// - podPriority: 2000001000 // system-node-critical priority class
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm interpreting this correctly, this is the stage where kube-apiserver and etcd will be terminated.

This means that at this point kubelets that use the local kube-apiserver (more context: kubeadm: kubernetes/kubeadm#2271) won't be able to communicate with the apiserver anymore. Are we still able to complete the NodeMaintenance if the communication between kubelet and kube-apiserver is broken at this stage?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#4213 (comment)

I would expect the apiserver to terminate last. Then the kubelet will lose the connection. Nevertheless this is a kubeadm issue, because kubelet is not connected to the load balancer. I do not see how the NodeMaintenace can help here?

Copy link
Member

@sbueringer sbueringer Oct 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevertheless this is a kubeadm issue, because kubelet is not connected to the load balancer.

I think kubelet cannot comunicate via the loadbalancer with kube-apiservers without violating the Kubernetes skew policy (example: kubelet 1.31 is not allowed to communicate with 1.30 apiservers). The only option that kubeadm has to not violate the Kubernetes skew policy during upgrades is to communicate with the local apiserver (which has the same version as the kubelet).

podType: Default
...
---
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading through these examples I wonder how easy it will be for users to understand why Node maintenances / drains are not making progress.

I think a significant amount of complexity comes from the m:n mapping between NodeMaintenances and Nodes.

Copy link
Member

@fabriziopandini fabriziopandini Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
It could really help having an example that shows how a user can figure out why a node drain is blocked when multiple node maintenance apply.
(also considering all the moving parts 😅, or at least Evacuation, Eviction, DaemonSet controller, kubelet)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each Node will have its status and the NodeMaintenace should be the aggregate. Please take into account that these are not final designs and may change considerably from the final API. There is still no consensus if there should be m:n mapping, for example.

We can add the debugging parts once we move into the PRR phase.

drainTargets:
- podPriority: 1000
podType: Default
- podPriority: 1000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example is a bit confusing:

  • Could it be that this does not match the types above (anymore)? (I couldn't find nodeStatuses)
  • nit: this target doesn't exist in the drainPlan above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have moved nodeStatuses to each node #4213 (comment). These places require an update of that situation.

...
```

To fulfil the Evacuation API, the DaemonSet controller should register itself as a controller
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To fulfil the Evacuation API, the DaemonSet controller should register itself as a controller
To fulfil the Evacuation API, the kubelet controller should register itself as a controller

nit

// The default value is Idle.
Stage NodeMaintenanceStage

// DrainPlan is executed from the first entry to the last entry during the Drain stage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions about the DrainPlan ordering and admission

executed from the first entry to the last entry during the Drain stage

This implies that the drainPlan is executed in order of the array

DrainPlanEntry podType fields should be in the following order: nil -> DaemonSet -> Static

This implies the DrainPlanEntires should be in this order. Is this actually a must?
(There are a few places further down in the doc that seem to implly that the order is always nil/Default -> DaemonSet -> Static)

If this is a must, what if there are dependencies between Pods that don't follow this order? (e.g. a static Pod depending on a DaemonSet, or a DaemonSet depending on a Default Pod)

DrainPlanEntry priority fields should be in ascending order for each podType.

Similar question. Is this just a should or will the ordering based on priority per podType be enforced?

If the priority and podType are the same, concrete selectors are executed first.

This implies that the stages are executed based on this order and not the order of the entries. (But this would make sense if the order of the entries are enforced to have this ordering)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a must, what if there are dependencies between Pods that don't follow this order? (e.g. a static Pod depending on a DaemonSet, or a DaemonSet depending on a Default Pod)

It is must. And I think basically all clusters follow this ordering today. Do you have an example where this is not the case?

Similar question. Is this just a should or will the ordering based on priority per podType be enforced?

enforced

This implies that the stages are executed based on this order and not the order of the entries. (But this would make sense if the order of the entries are enforced to have this ordering)

the order of the entries should conform to the type and priority ordering. A lower index entry with a podSelector and the same priority will be drained first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is must. And I think basically all clusters follow this ordering today. Do you have an example where this is not the case?

Sorry no concrete example. I was only wondering if it's safe to assume that the order in which Pods should be drained can simply be determined by the way that they are deployed (Deployments/operator/... vs DaemonSet). The way that Pods are deployed and drain order seem orthogonal to me.

Comment on lines 1525 to 1529
For the above node maintenance, the controller should not react to Evacuations of DaemonSet pods
with a priority greater than 5000. This state should not normally occur, as Evacuation requests
should be coordinated with NodeMaintenance. If it does occur, we should not encourage this flow by
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm. Does this mean that for DaemonSet Pods the evacuation API cannot be used without also using the NodeMaintenance API? (Just wondering because I think it's different for the eviction API today)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be used if you implement a custom Evacuator. I think it does not make sense for the DaemonSet controller to implement it separately without the NodeMaintenance ATM.

Copy link
Member

@sbueringer sbueringer Oct 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we would be able to evict DaemonSet Pods independent of the NodeMaintenance API if the NodeMaintenance API cannot be used.

Example, I could imagine that NodeMaintenance API will not be compatible with what we want to be able to do in Cluster API, so it would have been just nice to use it without the NodeMaintenance API.

(Context: https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20240930-machine-drain-rules.md)

Just to clarify. We're aiming to provide folks the means to configure drain order in a very flexible way (and also a way to skip draining Pods). The problem that we would run into with the current state of the NodeMaintenance API is that it also takes care of draining Nodes, but in a way that is less flexibel then what we want to achieve.

If NodeMaintenance would be only about declaring the maintenance while still allowing other components to take over the drain / drain order / the decision which Pods to drain and which ones not to drain, we would be good :)

Static PodType = "Static"
)

type DrainPlanEntry struct {
Copy link
Member

@fabriziopandini fabriziopandini Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have question/doubt about this API.

  1. If I got it right, in order to define a rule that targets a specific set of Pods, it is required that those Pods have an identifying PodPriority. Is this correct?

If true, what if a user needs to define a drain rule for an application where they cannot define a pod priority? (the users not always are allowed to change pods in applications, e.g. if the Pods are managed by some operator)

Also, is it correct to introduce a dependency between PodPriority and drain? what if users wants to define a drain rule for an application, but they don't want that this application triggers preemption for other apps when scheduled? Are we mixing up two orthogonal problems?

  1. It seems also that each rule Must be specifically defined for a single PodType, Is this correct?

if true, I'm worried that we are surfacing in the API some internal detail of the of the system, the fact that different K8s components takes care of different type of pods.

As a user, I could find really useful to define a specific drain rule for my application, but when I try to image such rule I think at something like "I want to evict Portworx after evicting all the other pods (match labels app=portworx, drain priority=x)".

I would also expect it to be a responsibility of the system to figure out what kind of Pods my application has (in most cases a mix of Default pods and DaemonSets), and do the right thing.

But if instead we expect users to dig into applications, usually designed by other teams, might be not even in the same company, figure out what kind of pods the application has, and then map this to all the intricacies of Pod types and how K8s components handle node drain, this could really impact the usability of this feature.

Going one step further, ideally it would be really nice if drain rules are something that developers can craft and ship with the application (like CM, Deployments, PDB etc), instead of being something in the realm of the cluster operator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good feedback. I'd be very happy to see an initial alpha API (either in-tree or a CRD) where we can practice triggering declarative draining, and encourage early adopters to give us their thoughts whilst the new API is still plastic (mouldable, fixable).

Copy link
Contributor

@sftim sftim Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, declarative node maintenance is / should be much more about:

  • I intend to make disruptive changes to this node / this rack / everything with this storage atttached to it
    (“disruptive change” can include taking it out of the datacenter and sending it for recycling)
  • here's when it will happen, and maybe here's a set of taints that are likely to get applied to the Node
    • might be good to know in case your app intends to tolerate these
    • node autoscalers can note this information without doing anything at Pod level
  • controllers are welcome to handle this gracefully before k-c-m (I assume) starts telling the kubelet to delete Pods

and out of all of that, the most key detail to capture is that it is an intent to maintain, rather than a verb masquerading as a noun (in other words, we aren't making a NodeDrainRequest or a PodRemovalSchedule).

Copy link
Member

@fabriziopandini fabriziopandini Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.
Might be the best way to bridge the two worlds is to think to a two step process/two CRDs:

  • Allow application developers to declare Drain rules for their applications, like they already do for PDBs
  • Allow cluster operators to declare node maintenance intent, which will then collect and use Drain rules if necessary

Also, FYI, in Cluster API we are thinking about something to make our machine drain process more flexible, and due to a few constraints, we need it in the short term; however, I can easily image how Cluster API users might benefit from something similar to what we are discussing in this thread (if and when it becomes available).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the point of the DrainPlan is not to define rules on how to gracefully terminate an application. This is an admin level tool that is more coarse and basically just looks at different classes of applications (e.g. user vs system). And can differentiate between different control-plane components.

The user does not have to tinker with the priorities to achieve the desired result. They also have no guarantee that the NodeMaintenance will be the same in every cluster (e.g. all of the user app pods can fall into the same termination bucket).

To solve this, users can simply run all of their pods at e.g. priority 0 an utilize the Evacuation API logic. Users can add an evacuator to their app pods and downscale the app according to their needs. For example, the evacuator can keep pod that provides a service alive until a pod that depends on that service terminates.

NodeMaintenance and Evacuation are intended to be extensible and for users to be able to depend on the presence of these APIs in the cluster (similar to eviction + PDB). This is the main reason why it is hard to introduce this as a CRD.

Copy link
Member

@fabriziopandini fabriziopandini Oct 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree declaring node maintenance is an admin operation.
But if node maintenance will only be about corse configuration (user vs system or deamon set vs static pods vs everything else), IMO it will fail to address the friction that exists between cluster admin and application teams when a node must be drained.

WRT to the evacuation API, my personal take is that this is more a solution for advanced use cases like operators. The majority of folks need some more generic, and DrainRules have the potential to fill this gap but in order to do so, they must be decoupled by NodeMaintenance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes there is a friction between the cluster admin and applicaiton teams. But it is quite a difficult task to offer a strictly declarative API to support all possible graceful draining scenarios. There could be a lot of dependencies and edge cases. I am not convinced we want to support that.

The Evacuation API allows you to do anything, but I agree that it is a quite advanced/handful for a normal user. That is why we will try to offer clever draining behaviors for some workloads (e.g. Deployment, HPA, configurable StatefulSet, etc.). And also, the Evacuation API can be used as a building block. So other components in the ecosystem can implement DrainRules and offer it as an Operator/CRD that normal users can use for a subset of applications.

Copy link
Member

@fabriziopandini fabriziopandini Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the explanation, I appreciate your patience in following up to my questions.

If what we need to solve this problem for both cluster admin and application teams is to eventually build yet another layer on top of the evacuation API, frankly speaking IMO the complexity is just too much (it is hard to figure out how m:n rules works in this proposal, let's image another layer of rules on top).

If I combine this with proposed breaking changes like draining daemonst sets and static pods, and a few other thread of discussion, it is hard for me to figure out how would look like the final state with a cohesive solution that helps both cluster admin and application teams, and also when this will be available.

Considering this, I'm starting to lean towards the opinion already expressed by other contributors providing feedback on this proposal that NodeMaintenance in the current form should be incubated as an external CRD, and then move to k/k when it is mature. But this depends also by how other thread are addressed.

In the meantime, I kindly ask you to add the idea discussed in this thread as alternatives in the proposal:

  • decoupling drain rules definition from the node maintenance initiation (the split is what enables multiple personas to play part in the solution of this problem)
  • make drain rule more application friendly (also avoiding to surface K8s internals in the user facing API)

cc @vincepri @neolit123 @sbueringer @chrischdi given that this popped up in a few SIG Cluster lifecycle related discussions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If what we need to solve this problem for both cluster admin and application teams is to eventually build yet another layer on top of the evacuation API, frankly speaking IMO the complexity is just too much (it is hard to figure out how m:n rules works in this proposal, let's image another layer of rules on top).

Another layer (whatever it is) above the Evacuation API is necessary as the Evacuation API servers as a base and implements graceful drain which is mostly not possible today.

If I combine this with proposed breaking changes like draining daemonst sets and static pods,

Can you please let me know which part do you consider a breaking change?

decoupling drain rules definition from the node maintenance initiation (the split is what enables multiple personas to play part in the solution of this problem)

I think it makes sense to decouple the rules. We probably still want to keep the admission logic. We could also introduce a Default DrainPlan that would be selected if none was supplied to the NodeMaintenance as I would expect it might be hard to select the correct one for some drainers.

make drain rule more application friendly (also avoiding to surface K8s internals in the user facing API)

Application user facing API could be a nice followup if the community finds it useful.

@atiratree
Copy link
Member Author

What's the statuys on this KEP? Is it coming back for 32?

It is in scope definitions stage at this moment. There are many open questions that needs to be answered before moving forward on this.

Agree, we need to build a stronger consensus first.

@sftim
Copy link
Contributor

sftim commented Oct 10, 2024

@evrardjp
Copy link

kured (https://kured.dev/ or https://github.com/kubereboot/kured) would also benefit from this, especially for the statefulset pdb case. Following this. If there is any way I can help, don't hesitate to ask, I just subscribed to be notified on this topic.

@sftim
Copy link
Contributor

sftim commented Oct 20, 2024

@evrardjp would you be willing to prototype a schema (as in CustomResourceDefinition) for an out-of-tree prototype, of the basic alpha API?

I think that would help, even with the KEP still in its early form. What you learn from starting that endeavor could give us good feedback on how to progress this KEP.


## Motivation

The goal of this KEP is to analyze and improve node maintenance in Kubernetes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think "node maintenance" is descriptive enough for motivation. I would agree that most of the time when admins are doing maintenance it is for a Node. However, a Node is just a Kubernetes API, like a Pod. So what is to stop me for saying, "I want to do maintenance on some hardware that these 3 Pods are using"? The same use-case applies: I want to do maintenance and Kubernetes API objects x, y, and z are blocking me from doing it (I don't want to impact users).

The motivation should be Kubernetes API agnostic.

I think a better perspective is we want to describe maintenance in Kubernetes. For example, there are numerous Kubernetes APIs to enable replication. Where these APIs come up short, is when you have 1 replica. This is because there's a collision between replication requirements and maintenance requirements. Maintenance is ideal to do when you have a low replica count (off-peak) but replication always wants to defend high-availability. Therefore, the missing piece is we need to teach the replication APIs, and the end-user, about a new desired state: maintenance.


The goal of this KEP is to analyze and improve node maintenance in Kubernetes.

Node maintenance is a request from a cluster administrator to remove all pods from a node(s) so

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Node maintenance is a request from a cluster administrator to remove all pods from a node(s)

I don't agree with. We don't actually know the cluster administrators perspective. We only know this cluster admin declared the Node API should be in "Maintenance", whatever that means. We also don't know how the admin defines "draining" the Node.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @rthallisey here -- and I believe this is a key.

- delay the shutdown pending graceful termination of remaining pods
- terminate remaining pods in reverse priority order (see [pod-priority-graceful-node-shutdown](https://kubernetes.io/docs/concepts/architecture/nodes/#pod-priority-graceful-node-shutdown))

The main problem is that the current approach tries to solve this in an application agnostic way

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to disagree here too. I don't think kubectl drain being "application agnostic" is a problem. I think the problem is, kubectl drain is another Kubernetes API that doesn't understand what Maintenance is. And the admin has no way to express what Maintenance is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubectl drain is not an API @rthallisey

Draining is largely client side / isn't directly exposed as intent via the API. Hence this KEP.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splitting hairs a bit. arguably kubectl drain is an api, just a pretty horrible one. Applications do code against it. :/

The goal, is to make a "proper" api for it I think.

Copy link

@evrardjp evrardjp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't finish reading this kep from end to end, and I realise the best would be a call with the current authors of the kep to progress better.

As requested, I started to draft not one but multiple CRDs to match this with something tangible.

Yet, I believe my opinion is just that -- an opinion. I would prefer bouncing ideas before sharing them to make sure I got it right.

Who can help me and what would the media be? Maybe a chat on slack with you @sftim and @atiratree ?

- https://github.com/medik8s/node-maintenance-operator uses a declarative approach that tries to
mimic `kubectl drain` (and uses kubectl implementation under the hood).
- https://github.com/kubereboot/kured performs automatic node reboots and relies on `kubectl drain`
implementation to achieve that.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kured doesn't simply drain and calls it a day. taints, annotations, block mechanisms, ... are included. kubectl rollout restart was also on the table on top of that. Yet, the architecture is overall simply a node based one, with lock (as of today, working on a change though).

Yet, I believe your mention is accurate and complete.

What I mean is that I can expand, if you wish.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@evrardjp sure, all contributions are welcomed! We can add a more detailed description.

- The kubelet controls the shutdown process using Dbus and systemd, and can delay (but not entirely
block) it using the systemd inhibitor. However, if Dbus or the kubelet is restarted during the
node shutdown, the shutdown might not be registered again, and pods might be terminated
ungracefully. Also, new workloads can get scheduled on the node while the node is shutting down.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's a simple "node" vs other object/api/... story. The problem has two sides: the general maintenance of node(s) and its impact on a node.

There are side benefits of storing the information of the node before maintenance into the node itself: it cleans itself up on autoscaling without the need of ownerreferences, it is not changing any api, etc. But this won't cover everything a node maintenance really needs, as mentioned. The sentence could be improved.

However, I would say let's not focus on the solution in this part of the kep...

maintenance phases (Planning, Cordon, Drain, Drain Complete, Maintenance Complete). I also want to
observe the node drain via the API and check on its progress. I also want to be able to discover
workloads that are blocking the node drain.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want flexibility over what can block on the node, or only pods? Let's say someone wants to block when a file is present on the node. I assume this is not covered. Okay for everyone?

Asking because it has a large impact on design. One could be almost pure api, the other forces the involvement of multiple components.

A misconfigured .spec.nodeSelector could select all the nodes (or just all master nodes) in the
cluster. This can cause the cluster to get into a degraded and unrecoverable state.

An admission plugin ([NodeMaintenance Admission](#nodemaintenance-admission)) is introduced to

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am stupid but I am not understanding the deep focus on static pods. If we provide a high level api which helps composition, couldn't another operator take that information and react appropriately to handle such information?

Consider including folks who also work outside the SIG or subproject.
-->

A misconfigured .spec.nodeSelector could select all the nodes (or just all master nodes) in the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this be solved by an exception in the cluster level controller? If node is master (or a predetermined annotation/label, tbd) and other masters are also scheduled for maintenance, do not trigger the maintenance simultaneously and wait for the master to come back online first?

`kubectl cordon` and `kubectl uncordon` commands will be enhanced with a warning that will warn
the user if a node is made un/schedulable, and it collides with an existing NodeMaintenance object.
As a consequence the node maintenance controller will reconcile the node back to the old value.
Because of this we can make these commands noop when the node is under the NodeMaintenance.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want that?
Use case: as an admin, i detect some node needs temporarily cordoning (hosted on a bad node on bare metal let's say). So if someone/something decides to have a node maintenance, I want both to be recorded. The simple way would be to have a "state before node maintenance action" recorded in the status. After the maintenance, the controller inspecting the state ("completed") could then restore the previous state into the spec.unschedulable? Noop isn't necessarily the result.

Because of this we can make these commands noop when the node is under the NodeMaintenance.

### NodeMaintenance API

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can someone clarify for me why someone wants to declare the state of the node, instead of observing it after making clear what was requested?

Here what's requested is to have 0 pods, not what the actions are?

Or maybe I am being pedantic and we mean the same.

I am confused, maybe I am looking at it differently due to the wording.

@atiratree atiratree force-pushed the improve-node-maintenance branch from 4db6894 to 783a510 Compare December 4, 2024 23:11
@atiratree
Copy link
Member Author

There has been a major update to the underlying Evacuation API KEP: #4565 (comment)

The API has been renamed to EvictionRequest, so I have updated the NodeMaintenance KEP to reflect that.

I see a lot of feedback on similar issues. I will get back to them, but first I need to update the KEP, mainly to add new sections to explain/analyze these common issues. I will then respond and add relevant references.

@atiratree atiratree force-pushed the improve-node-maintenance branch from 783a510 to 7644ae3 Compare December 4, 2024 23:24

Cluster or node autoscalers that take on the role of `kubectl drain` want to signal the intent to
drain a node using the same API and provide a similar experience to the CLI counterpart.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### Story 4
Some daemonset pods can only be safely recreated after the node is drained.
There's updateStrategy OnDelete and RollingUpdate, but an OnNodeEvactuation would be really useful for some workloads, and could be enabled on top of this api.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.