Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-4563: EvictionRequest API (fka Evacuation) #4565

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

atiratree
Copy link
Member

@atiratree atiratree commented Mar 28, 2024

  • One-line PR description: introduce EvictionRequest API to allow managed graceful pod removal

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Mar 28, 2024
@k8s-ci-robot k8s-ci-robot requested review from kow3ns and soltysh March 28, 2024 15:05
@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Mar 28, 2024
Comment on lines 228 to 245
We will introduce a new term called evacuation. This is a contract between the evacuation instigator,
the evacuee, and the evacuator. The contract is enforced by the API and an evacuation controller.
We can think of evacuation as a managed and safer alternative to eviction.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Watch out for the risk of confusing end users.

We already have preemption and eviction and people confuse the two. Or three, because there are two kinds of eviction. And there's disruption in the mix.

Do we want to rename Scheduling, Preemption and Eviction to Scheduling, Preemption, Evacuation and Eviction?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I added a mention which of what kind of eviction I mean here.

Do we want to rename Scheduling, Preemption and Eviction to Scheduling, Preemption, Evacuation and Eviction?

Yes, I think we want to add a new concept there and generally update the docs once we have an alpha.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm with Tim here.

Preemption vs Eviction is already quite confusing. And TBH, I couldn't fully understand what the "evacuation" is supposed to solve by reading the summary or motivation.

From Goals:

Changes to the eviction API to support the evacuation process.

If this is already going to be part of the Eviction API, maybe it should be named as a form of eviction. Something like "cooperative eviction" or "eviction with ack" or something along those lines?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm all for framing it as another type of eviction; we already have two, so the extra cognitive load for users is not so much a problem.

Copy link
Member Author

@atiratree atiratree May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alculquicondor I have updated the summary and goals, I hope it makes more sense now.

I think the name should make the most sense to the person creating the Evacuation (Evacuation Instigator ATM). So CooperativeEviction or EvictionWithAck is a bit misleading IMO. Because from that person's perpective there is no additional step required of them. Only the evacuators and the evacuation controller implement the cooperative evacuation process but this is hidden from the normal user.

My suggestions:

  • GracefulEviction (might confuse people if it is associated with graceful pod termination, which it is not)
  • SafeEviction (*safer than the API-initiated one for some pods)
  • Or just call it Eviction? And tell people to use it instead of the Eviction API endpoint? This might be a bit confusing (at least in the beginning)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can bikeshed names for the API kind; I'd throw a few of my own into the hat:

  • EvictionRequest
  • PodEvictionRequest

Copy link
Member Author

@atiratree atiratree Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have renamed the API to EvictionRequest to make the term recognizable. A minor disadvantage is that we have to clarify what type of eviction we mean if we say evict (API-initiate eviction, or EvictionRequest)

The rest of the renames are as follows:

Evacuation (noun) -> EvictionRequest / Eviction Process
evacuation (verb) -> request an eviction / terminate / evict / process eviction
Evacuator -> Interceptor
Evacuee -> Pod
Evacuator Class -> Interceptor Class
Evacuation Instigator -> Eviction Requester
Evacuation Controller -> Eviction Request Controller
ActiveEvacuatorClass -> ActiveInterceptorClass
ActiveEvacuatorCompleted -> ActiveInterceptorCompleted
EvacuationProgressTimestamp -> ProgressTimestamp
ExpectedEvacuationFinishTime -> ExpectedInterceptorFinishTime
EvacuationCancellationPolicy -> EvictionRequestCancellationPolicy
FailedEvictionCounter -> FailedAPIEvictionCounter

Comment on lines 1130 to 1302
<!--
What other approaches did you consider, and why did you rule them out? These do
not need to be as detailed as the proposal, but should include enough
information to express the idea and why it was not acceptable.
-->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a number of examples of having a SomethingRequest or SomethingClaim API that then causes a something (certificate signing, node provisioning, etc).
Think of TokenRequest (a subresource for a ServiceAccount), or CertificateSigningRequest.

I would confidently and strongly prefer to have an EvictionRequest or PodEvictionRequest API, rather than an Evacuation API kind.

It's easy to teach that we have evictions and than an EvictionRequest is asking for one to happen; it's hard to teach the difference between an eviction and an evacuation.

As a side effect, this makes the feature gate easier to name (eg PodEvictionRequests).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you have mentioned, we already have different kinds of eviction. So I think it would be good to use a completely new term to distinguish it from the others.

Also, Evacuation does not always result in eviction (and PDB consultation). It depends on the controller/workload. For some workloads like DaemonSets and static pods, API eviction has never worked before. This could also be very confusing if we name it the same way.

I think Evacuation fits this better because

  1. The name is shorter. If we go with EvacuationRequest then the evacuation will become just an abstract term and less recognizable.
  2. It seems it will have quite a lot of functionality included (state synchronization between multiple instigators and multiple evacuators, state of the evacuee and evacuation). TokenRequest and CertificateSigningRequest are simpler and not involved in a complex process.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested EvictionRequest so that we don't have to have a section with the (too long) title: Scheduling, Preemption, Evacuation and Eviction. Not EvacuationRequest.

Adding another term doesn't scale so well: it means helping n people understand the difference between evacuation and eviction. It's a scaling challenge where n is not only large, it probably includes quite a few Kubernetes maintainers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for CertificateSigningRequest being simple: I don't buy it. There are three controllers, custom signers, an integration with trust bundles, the horrors of ASN.1 and X.509… trust me, it's complicated enough.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that it will be confusing for people, but that will happen regardless of what term we will use.

My main issue is that evacuation does not directly translate to eviction. Therefore, I think it would be preferable to choose a new term (not necessarily evacuation).

I would like to get additional opinions from people about this. And we will definitely have to come back to this in the API review.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be resolved now: #4565 (comment)

keps/sig-apps/4563-evacuation-api/README.md Outdated Show resolved Hide resolved
keps/sig-apps/4563-evacuation-api/README.md Outdated Show resolved Hide resolved
keps/sig-apps/4563-evacuation-api/README.md Outdated Show resolved Hide resolved
keps/sig-apps/4563-evacuation-api/README.md Outdated Show resolved Hide resolved
keps/sig-apps/4563-evacuation-api/README.md Outdated Show resolved Hide resolved
keps/sig-apps/4563-evacuation-api/README.md Outdated Show resolved Hide resolved
keps/sig-apps/4563-evacuation-api/README.md Outdated Show resolved Hide resolved
keps/sig-apps/4563-evacuation-api/README.md Outdated Show resolved Hide resolved
@atiratree atiratree force-pushed the evacuation-api branch 8 times, most recently from 13611ce to 2d15b79 Compare March 28, 2024 20:48
@atiratree
Copy link
Member Author

  • I have updated the KEP to include support for multiple evacuators
  • Evacuators can now advertise which pods they are able to evacuate, even before the evacuation. The advantage of this approach is that we can trigger an eviction immediately without a delay (was known asacceptDeadlineSeconds) if we do not find an evacuator. I have added a bunch of restrictions to ensure the API cannot be misused.
  • Clarified how the Evacuation objects should be named and how the admission should work in general (also for pods). This will ensure a 1:1 mapping between pods and Evacuation.
  • Removed the ability to add a full reference of the evacuator because it would be a hassle to synchronize considering the evacuator leader election and multiple evacuatorsin play.


Example evacuation triggers:
- Node maintenance controller: node maintenance triggered by an admin.
- Descheduler: descheduling triggered by a descheduling rule.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the descheduler requests an eviction, what thing is being evacuated?

(the node maintenance and cluster autoscaler examples are easier: you're evacuating an entire node)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A single pod or multiple pods. The descheduler can use it as a new mechanism instead of eviction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so people typically think of “evacuate” as a near synonym of “drain” - you drain a node, you evacuate a rack or zone full of servers. Saying that you can evacuate a Pod might make people think its containers all get stopped, or just confuse readers. We do need to smooth out how we can teach this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it can be used in both scenarios https://www.merriam-webster.com/grammar/can-you-evacuate-people.
Evacuation of containers doesn't make sense because they are tied to the pod lifecycle. But, I guess it could be confusing if we do not make it explicitly clear what we are targeting.

Copy link
Contributor

@sftim sftim Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thing is, Kubernetes users typically - before we accept this KEP - use “evacuate” as a synonym for drain.

I'm (still) fine with the API idea, and still concerned about the naming.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to +1 the potential confusion of the term "evacuation".
Is it okay to have a glossary of terms for "evacuation", "eviction", and "drain" (or any other potentially confusing terms) added somewhere in this KEP?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can include it in the KEP. And yes, we are going to change the name to something more suitable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"To evacuate a person" implies "get them out of trouble, to safety" as opposed to "to empty" (as in vacuum). It's not ENTRIELY wrong in this context, but it's not entirely right either.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change the name to EvictionRequest. It was originally chosen to distinguish it from eviction, but there is value in making it familiar to existing concepts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API is renamed to EvictionRequest now, see #4565 (comment) for more details

@atiratree atiratree force-pushed the evacuation-api branch 6 times, most recently from 1900020 to 085672a Compare April 10, 2024 21:47
Copy link
Member Author

@atiratree atiratree left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thockin Thanks a lot for the feedback. I see there are still quite a lot of places to improve.

Can we pin down some very concrete examples of what REAL evacuators might really do, and how the lifecycle works, especially WRT cancellation of an eviction?

I guess the user stories are still pretty abstract and it is best to show the flow with real examples.

I will work on the updates soon.


Example evacuation triggers:
- Node maintenance controller: node maintenance triggered by an admin.
- Descheduler: descheduling triggered by a descheduling rule.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change the name to EvictionRequest. It was originally chosen to distinguish it from eviction, but there is value in making it familiar to existing concepts.


The major issues are:

1. Without extra manual effort, an application running with a single replica has to settle for
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we made it possible and people do it. And also utilize validation webhooks/controllers to implement their desired behavior.

- After a partial cleanup (e.g. storage migrated, notification sent) and if the application is
still in an available state, the eviction API can be used to respect PodDisruptionBudgets.

In the end, the evacuation should always end with a pod being deleted (evict or delete call) by one
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we should make it clear that the pod might not end up deleted/terminated. I also want to add a mechanism that would result in the pod termination, but not deletion.

There can be many evacuation instigators for a single Evacuation.

When an instigator decides that a pod needs to be evacuated, it should create an Evacuation:
- With a predictable and consistent name in the following format: `${POD_UID}-${POD_NAME_PREFIX}`.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name of the Evacuation object must be unique and predictable for each pod to prevent the creation of multiple Evacuations for the same pod. We do not expect the evacuators to support interaction with multiple Evacuations for a pod.

I was trying to combine ease of use (POD_NAME) and uniqueness of each pod instance (POD_UID).

  • Having name only: we can solve the uniqueness by an admission plugin and checking .spec.podRef.
  • Having ID only: we can let clients (kubectl) print out the name from .spec.podRef.

Even though we do not expect a big number of instigators, I agree that ${POD_UID}-${POD_NAME_PREFIX} is overly complicated for the end user. Especially the number of characters limit.

I (maybe wrongly) presume there's a cleanup loop that removes evacuations which have no matching pod?

Yes there is a GC, but that can have a delay. Nevertheless I am mainly concerned about having the 1:1 mapping, to make the implementation easier for all the parties (mainly Evacuators)

after the previous pod is removed.
- `.spec.progressDeadlineSeconds` should be set to a reasonable value. It is recommended to leave
it at the default value of 1800 (30m) to allow for potential evacuator disruptions.
- `.spec.evacuators` value will be resolved on the Evacuation admission from the pod and should
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will go over the whole KEP and try to make it more accessible.

characters (`63 - len("priority_")`)
- `PRIORITY` and `ROLE`
- `controller` should always set a `PRIORITY=10000` and `ROLE=controller`
- other evacuators should set `PRIORITY` according to their own needs (minimum value is 0,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 is "lowest priority"

What does this priority govern?

The order of Evacuator evaluation. The highest priority evacuators are evaluated first.

Can two of them have the same prio?

I think they can, but then their order is not defined in the API and most likely just order alpabetically by their class.

One exception is 10000/controller which should be unique and set by the controller.

keps/sig-apps/4563-evacuation-api/README.md Outdated Show resolved Hide resolved
keps/sig-apps/4563-evacuation-api/README.md Outdated Show resolved Hide resolved
It should start the evacuation only if it observes `.status.activeEvacuatorClass` that matches the
`EVACUATOR_CLASS` it previously set in the annotation.

If the evacuator is not interested in evacuating the pod, it should set
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elsewhere you talked about an evacuation not being necessary and being cancelled.

I am sorry, I will make this part more clear.

Can it be cancelled once evacuators have started running? What if they have done something irreversible? Are they all expected to be reversible? Idempotent? Do they get called to reverse an evacuation?

A @sftim has said, one of the consumers/inspirations is the NodeMaintenance or other projects that might want to stop the maintenance.

The cancellation should be under the control of the current Evacuator. Evacuators should offer the ability to cancel if possible, but they can forbid it if the termination is irreversible. Please see .status.EvacuationCancellationPolicy.

`.spec.progressDeadlineSeconds` and make sure to update the status periodically in less than that
duration. For example, if `.spec.progressDeadlineSeconds` is set to the default value of 1800 (30m),
it may update the status every 3 minutes. The status updates should look as follows:
- Check that `.status.activeEvacuatorClass` still matches the `EVACUATOR_CLASS` previously set in
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.status.activeEvacuatorClass is only reconciled/coordinated by the the evacuation controller (single instance).

The reaction time can be tweaked with .spec.progressDeadlineSeconds.

I think the following values might be reasonable, but welcome any other suggestions.

	// The minimum value is 600 (10m) and the maximum value is 21600 (6h).
	// The default value is 1800 (30m).

FYI, this mechanism is very quick if there is no evacuator advertised on the pod. It will instantly fallback to eviction.

needs high availability either, due to a preference for a simpler deployment model, lack of
application support for HA, or to minimize compute costs. Also, any automated solution needs
to edit the PDB to account for the additional pod that needs to be spun to move the workload
from one node to another. This has been discussed in issue [kubernetes/kubernetes#66811](https://github.com/kubernetes/kubernetes/issues/66811)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I have read about EvacuationAPI, it seems to respect PDBs always.
But I wish to know, will EvacuationAPI be able to disable PDBs or ignore PDBs to help the case where the operator expects zero living nodes in the cluster (i.e. shutting down cluster during the weekends/holidays for maintenance and starting them again after the weekend)?
The above case requires all nodes to shut down, and having PDBs with minAvailable: 1 or maxUnavailable: 0 will definitely block the draining the nodes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the feature is not limited by PDBs, the workloads can use them but they can also bypass them if there is a controller that can make more advanced decisions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But for the k8s built-in controllers like deployments, statefulsets, etc; if a PDB was configured, how will EvacuationAPI be able to bypass the disruption budget and allow node termination below the PDB configured values?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update the KEP soon. But the mechanism of optionally bypassing the PDB should not change, and requires action from the evacuator.. PTAL at the KEP or just the attached diagram: https://github.com/kubernetes/enhancements/pull/4565/files#diff-560381a7190e130fded9e8758d53d2e5c4b69c39838485fc7a3fb13be096ae47

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at the diagram, where in the innermost loop, it says the Evacuator updates the Evacuation status. I just want to clarify the meaning of the part where it says

"make sure the Evacuation status is up to date, evict the pod if it is not".

Does it mean if the evacuator failed to update the Evacuation status for a specific time, the pod will be terminated regardless?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, this is implemented via the .spec.progressDeadlineSeconds, but the min/default values should be forgiving to allow for evacuator disruptions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh thank you. As I see it, does .spec.progressDeadlineSeconds double as the polling duration counting from the most recent evacuator update, waiting for other potential instigators?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the .spec.progressDeadlineSeconds is a static value that cannot be changed. Every interceptor/evacuator abides by the same deadline to produce an update tick. By making the value immutable, we can make life easier for the interceptors to not miss the deadline.

- move ${POD_UID}-${POD_NAME_PREFIX} format to alternatives
- better describe Evacuator responsibilities
- add new PDB issues
- add a follow-up to the beta graduation to track evacuations in pods
- Update the followups.
- Add proposal summary.
- Add practical use cases to the proposal summary and to the followups.
  - Deployment Pod Surge Example
  - HorizontalPodAutoscaler Pod Surge Example
  - Descheduling and Downscaling

All renames for a reference:

Evacuation (noun) -> EvictionRequest / Eviction Process
evacuation (verb) -> request an eviction / terminate / evict / process eviction
Evacuator -> Interceptor
Evacuee -> Pod
Evacuator Class -> Interceptor Class
Evacuation Instigator -> Eviction Requester
Evacuation Controller -> Eviction Request Controller
ActiveEvacuatorClass -> ActiveInterceptorClass
ActiveEvacuatorCompleted -> ActiveInterceptorCompleted
EvacuationProgressTimestamp -> ProgressTimestamp
ExpectedEvacuationFinishTime -> ExpectedInterceptorFinishTime
EvacuationCancellationPolicy -> EvictionRequestCancellationPolicy
FailedEvictionCounter -> FailedAPIEvictionCounter
@atiratree atiratree changed the title KEP-4563: Evacuation API KEP-4563: EvictionRequest API (fka Evacuation API) Dec 3, 2024
@atiratree atiratree changed the title KEP-4563: EvictionRequest API (fka Evacuation API) KEP-4563: EvictionRequest API (fka Evacuation) Dec 3, 2024
@atiratree
Copy link
Member Author

KEP Update; notable changes:

  • Evacuation has been renamed to EvictionRequest and Evacuator to Interceptor.
  • The EvictionRequest is considered complete when the pod terminates (phase Succeeded or Failed). The pod does not need to be deleted if the restartPolicy allows it.
  • Improved proposal summary + many small improvements.
  • EvictionRequest name should equal to the pod UID.
  • Increased constraints for pod interceptor annotations.
  • Updated issues.

Added practical use cases/example scenarios:

@atiratree
Copy link
Member Author

atiratree commented Dec 3, 2024

I am out of time for now, but I have to admit I am worried about how this will play out. Coordinating between N arbitrary actors via the API is not somethign we really do anywhere else (AFAIK "classic" DRA was the one place and that is being removed).

Can we pin down some very concrete examples of what REAL evacuators might really do, and how the lifecycle works, especially WRT cancellation of an eviction? You can lose the details and just tell a story. This is significantly complicated and I don't feel able to defend it right now.

@thockin, in the workloads space there are cross-cutting concern issues where such coordination is necessary (PTAL at examples above).

  • For example, in point 3 of the Motivation section we see the need to coordinate the ReplicaSet controller (most likely also the Deployment controller), the HPA and the descheduler (or similar component) just on the interceptor side. And this does not take into account other custom interceptors that users may want to implement.
  • In the EvictionRequest Cancellation Examples section, we can also see the need for coordination between eviction requesters.

Could this be positioned like "readiness gates". That sort of feels similar.

Such deletion gates could work in a system with well-meaning and diligent actors, but difficult to achieve in practice:

  • One of the main goals is not just to help application developers and application-oriented components to implement their use cases. But to also to help cluster administrators to detect misbehaving actors and have a well-defined path to achieve pod termination. So this is why we require a proof of work (status update) to prevent people from using it as just another PDB that blocks cluster upgrades.
  • As above, it is difficult to achieve coordination between multiple actors through the use of gates.

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read through the initial requirements and assumptions, I haven't read carefully through the Eviction protocol and the contract described in the latter part of this document.

to edit the PDB to account for the additional pod that needs to be spun to move the workload
from one node to another. This has been discussed in issue [kubernetes/kubernetes#66811](https://github.com/kubernetes/kubernetes/issues/66811)
and in issue [kubernetes/kubernetes#114877](https://github.com/kubernetes/kubernetes/issues/114877).
2. Similar to the first point, it is difficult to use PDBs for applications that can have a variable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case seems identical with the first one. In both cases you're covering a single-replica application. How it got there (always has been, or was scaled down) doesn't matter, imo. So I'd advice to squash the two together.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HPAs are fun here though because they can heal in ways are otherwise left to either deployment/statefulset controllers. So maybe it should be positioned as a consumer of this api?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is similar that it covers scaling of a single replica application, but the impact and followups are different. HPA covers 1-n replicas by default. So I think it is okay to keep it separate.

So maybe it should be positioned as a consumer of this api?

HPA is interesting because it can be both a requester (e.g. during scaling down) and an interceptor (e.g. during node drain). As is shown in the examples in the KEP.

keps/sig-apps/4563-eviction-request-api/README.md Outdated Show resolved Hide resolved
keps/sig-apps/4563-eviction-request-api/README.md Outdated Show resolved Hide resolved

We should discourage the creation of preventive EvictionRequests, so that they do not end up as
another PDB. So we should design the API appropriately and also not allow behaviors that do not
conform to the eviction request contract.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't enforce it in any way, and the API will get used in various, unpredictable ways (see PDB use cases, which you nicely outlined in the sections above). I understand the sentiment, but I don't think it will help here in any way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

answered in #4565 (comment)


Its responsibility is to observe eviction requests from requesters and periodically check that
interceptors are making progress in evicting/terminating pods. It is important to see a consistent
effort by the interceptors to reconcile the progress of the eviction request. This is important to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are you planning to enforce the progress here? How can you tell that a progress stuck for several minutes is good or bad? Every application will define that differently, so it's hard to define one-size fits all. Even if you require an update every X amount of time, that can be overused by dummy updates, just to block longer/indefinitely.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need to enforce the progress. This is modelled after leader election. If an interceptor fails to update the EvictionRequest status to report the progress we will choose another interceptor or evict the pod.

Leader election works fine with much lower timeouts. I think we can be more forgiving here, but the 10m mark (allowed minimum) should be sufficient.

that can be overused by dummy updates, just to block longer/indefinitely.

The same could be said about the leader election, which could be used to hijack an existing controller.

The idea behind this is that we should make it very unpleasant to block eviction (proof of work instead of just creating a PDB). And if someone implements this, they might as well implement graceful termination.

Another benefit of this approach is that an administrator can take a look at a very long EvictionRequest and see what InterceptorClass is set on it and can easily track down the misbehaving actor. We can also hook into this with metrics/alerts.

Finally, since this is a normal k8s object, we can use RBAC to restrict and observer access to the EvictionRequest.

keps/sig-apps/4563-eviction-request-api/README.md Outdated Show resolved Hide resolved

Tracking the eviction status in a pod would be feasible, but it would be hard to coordinate between
the actors as mentioned. So it would be better to decouple the eviction request process from the pod
as well.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered improving PDB to provide more insightful status about eviction, which would allow external controllers to watch that status and react accordingly (either by scaling the app up, or doing the actions you're describing above)? This way the whole process could be tracked and synchronized outside of the core resources, and k8s would only expose sufficient information to provide that kind of automation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The main problems are cardinality and synchronization, we have a set of pods guarded by a single PDB and there is a group of actors (requesters and interceptors) interested in the lifecycle of these pods. So tracking the state of the pods in the status would not scale well. An aggregate status would be less useful and would still run into update conflicts.
  • High API server load: as this status would only be used on eviction, it will have outdated information. Actors interacting with a PDB will want to have the latest information about the pods (e.g. recalculating descheduling rules according to the cpu/mem utilization) and might do frequent status updates, congesting the API server.
  • Not all applications use PDBs. The EvictionRequest API tries to serve as a replacement for eviction and as the main descheduling primitive. This has use cases in cluster and pod autoscaling. For example when used with HPA, we would like to downscale any application/set of pods.

See some practical use cases for this feature:
1. Ability to upscale first before terminating the pods with a Deployment: [Deployment Pod Surge Example](#deployment-pod-surge-example)
based on the [EvictionRequest Process](#evictionrequest-process).
2. Ability to upscale first before terminating the pods with HPA: [HorizontalPodAutoscaler Pod Surge Example](#horizontalpodautoscaler-pod-surge-example)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess dragging along the pdb to have minAvailable == HPA's current target is a pretty hacky thus the need for a non pdb blocked signal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there are a bunch of problems here, user friendliness, atomicity of actions, and responding to any descheduling (e.g. node drain).

@txomon
Copy link

txomon commented Dec 10, 2024

Not sure if it's relevant to the conversation, however I would expect that after this change, the user scenario 2 is solved without the user doing anything, as the maxSurge parameter and the fact that it's a rollout deployment are already specified in the deployment spec.

@atiratree
Copy link
Member Author

Yes, the implementation of the follow-ups is only outlined, here and there should be additional KEPs for each improvement. And the majority of the things proposed should result in an immediate benefit without any user action.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.