-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-4563: EvictionRequest API (fka Evacuation) #4565
base: master
Are you sure you want to change the base?
Conversation
atiratree
commented
Mar 28, 2024
•
edited
Loading
edited
- One-line PR description: introduce EvictionRequest API to allow managed graceful pod removal
- Issue link: EvictionRequest API #4563
- Other comments: part of the functionality has been split from KEP-4212: Declarative Node Maintenance #4213 into this KEP
We will introduce a new term called evacuation. This is a contract between the evacuation instigator, | ||
the evacuee, and the evacuator. The contract is enforced by the API and an evacuation controller. | ||
We can think of evacuation as a managed and safer alternative to eviction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Watch out for the risk of confusing end users.
We already have preemption and eviction and people confuse the two. Or three, because there are two kinds of eviction. And there's disruption in the mix.
Do we want to rename Scheduling, Preemption and Eviction to Scheduling, Preemption, Evacuation and Eviction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I added a mention which of what kind of eviction I mean here.
Do we want to rename Scheduling, Preemption and Eviction to Scheduling, Preemption, Evacuation and Eviction?
Yes, I think we want to add a new concept there and generally update the docs once we have an alpha.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm with Tim here.
Preemption vs Eviction is already quite confusing. And TBH, I couldn't fully understand what the "evacuation" is supposed to solve by reading the summary or motivation.
From Goals:
Changes to the eviction API to support the evacuation process.
If this is already going to be part of the Eviction API, maybe it should be named as a form of eviction. Something like "cooperative eviction" or "eviction with ack" or something along those lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm all for framing it as another type of eviction; we already have two, so the extra cognitive load for users is not so much a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alculquicondor I have updated the summary and goals, I hope it makes more sense now.
I think the name should make the most sense to the person creating the Evacuation (Evacuation Instigator ATM). So CooperativeEviction
or EvictionWithAck
is a bit misleading IMO. Because from that person's perpective there is no additional step required of them. Only the evacuators and the evacuation controller implement the cooperative evacuation process but this is hidden from the normal user.
My suggestions:
GracefulEviction
(might confuse people if it is associated with graceful pod termination, which it is not)SafeEviction
(*safer than the API-initiated one for some pods)- Or just call it
Eviction
? And tell people to use it instead of the Eviction API endpoint? This might be a bit confusing (at least in the beginning)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can bikeshed names for the API kind; I'd throw a few of my own into the hat:
- EvictionRequest
- PodEvictionRequest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have renamed the API to EvictionRequest
to make the term recognizable. A minor disadvantage is that we have to clarify what type of eviction we mean if we say evict (API-initiate eviction, or EvictionRequest)
The rest of the renames are as follows:
Evacuation (noun) -> EvictionRequest / Eviction Process
evacuation (verb) -> request an eviction / terminate / evict / process eviction
Evacuator -> Interceptor
Evacuee -> Pod
Evacuator Class -> Interceptor Class
Evacuation Instigator -> Eviction Requester
Evacuation Controller -> Eviction Request Controller
ActiveEvacuatorClass -> ActiveInterceptorClass
ActiveEvacuatorCompleted -> ActiveInterceptorCompleted
EvacuationProgressTimestamp -> ProgressTimestamp
ExpectedEvacuationFinishTime -> ExpectedInterceptorFinishTime
EvacuationCancellationPolicy -> EvictionRequestCancellationPolicy
FailedEvictionCounter -> FailedAPIEvictionCounter
<!-- | ||
What other approaches did you consider, and why did you rule them out? These do | ||
not need to be as detailed as the proposal, but should include enough | ||
information to express the idea and why it was not acceptable. | ||
--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a number of examples of having a SomethingRequest or SomethingClaim API that then causes a something (certificate signing, node provisioning, etc).
Think of TokenRequest (a subresource for a ServiceAccount), or CertificateSigningRequest.
I would confidently and strongly prefer to have an EvictionRequest or PodEvictionRequest API, rather than an Evacuation API kind.
It's easy to teach that we have evictions and than an EvictionRequest is asking for one to happen; it's hard to teach the difference between an eviction and an evacuation.
As a side effect, this makes the feature gate easier to name (eg PodEvictionRequests
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you have mentioned, we already have different kinds of eviction. So I think it would be good to use a completely new term to distinguish it from the others.
Also, Evacuation does not always result in eviction (and PDB consultation). It depends on the controller/workload. For some workloads like DaemonSets and static pods, API eviction has never worked before. This could also be very confusing if we name it the same way.
I think Evacuation fits this better because
- The name is shorter. If we go with EvacuationRequest then the evacuation will become just an abstract term and less recognizable.
- It seems it will have quite a lot of functionality included (state synchronization between multiple instigators and multiple evacuators, state of the evacuee and evacuation). TokenRequest and CertificateSigningRequest are simpler and not involved in a complex process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggested EvictionRequest so that we don't have to have a section with the (too long) title: Scheduling, Preemption, Evacuation and Eviction. Not EvacuationRequest.
Adding another term doesn't scale so well: it means helping n people understand the difference between evacuation and eviction. It's a scaling challenge where n is not only large, it probably includes quite a few Kubernetes maintainers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for CertificateSigningRequest being simple: I don't buy it. There are three controllers, custom signers, an integration with trust bundles, the horrors of ASN.1 and X.509… trust me, it's complicated enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that it will be confusing for people, but that will happen regardless of what term we will use.
My main issue is that evacuation does not directly translate to eviction. Therefore, I think it would be preferable to choose a new term (not necessarily evacuation).
I would like to get additional opinions from people about this. And we will definitely have to come back to this in the API review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be resolved now: #4565 (comment)
13611ce
to
2d15b79
Compare
2d15b79
to
c292415
Compare
|
|
||
Example evacuation triggers: | ||
- Node maintenance controller: node maintenance triggered by an admin. | ||
- Descheduler: descheduling triggered by a descheduling rule. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the descheduler requests an eviction, what thing is being evacuated?
(the node maintenance and cluster autoscaler examples are easier: you're evacuating an entire node)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A single pod or multiple pods. The descheduler can use it as a new mechanism instead of eviction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so people typically think of “evacuate” as a near synonym of “drain” - you drain a node, you evacuate a rack or zone full of servers. Saying that you can evacuate a Pod might make people think its containers all get stopped, or just confuse readers. We do need to smooth out how we can teach this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems it can be used in both scenarios https://www.merriam-webster.com/grammar/can-you-evacuate-people.
Evacuation of containers doesn't make sense because they are tied to the pod lifecycle. But, I guess it could be confusing if we do not make it explicitly clear what we are targeting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thing is, Kubernetes users typically - before we accept this KEP - use “evacuate” as a synonym for drain.
I'm (still) fine with the API idea, and still concerned about the naming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to +1 the potential confusion of the term "evacuation".
Is it okay to have a glossary of terms for "evacuation", "eviction", and "drain" (or any other potentially confusing terms) added somewhere in this KEP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can include it in the KEP. And yes, we are going to change the name to something more suitable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"To evacuate a person" implies "get them out of trouble, to safety" as opposed to "to empty" (as in vacuum). It's not ENTRIELY wrong in this context, but it's not entirely right either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will change the name to EvictionRequest. It was originally chosen to distinguish it from eviction, but there is value in making it familiar to existing concepts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API is renamed to EvictionRequest now, see #4565 (comment) for more details
1900020
to
085672a
Compare
085672a
to
f81cc8c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thockin Thanks a lot for the feedback. I see there are still quite a lot of places to improve.
Can we pin down some very concrete examples of what REAL evacuators might really do, and how the lifecycle works, especially WRT cancellation of an eviction?
I guess the user stories are still pretty abstract and it is best to show the flow with real examples.
I will work on the updates soon.
|
||
Example evacuation triggers: | ||
- Node maintenance controller: node maintenance triggered by an admin. | ||
- Descheduler: descheduling triggered by a descheduling rule. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will change the name to EvictionRequest. It was originally chosen to distinguish it from eviction, but there is value in making it familiar to existing concepts.
|
||
The major issues are: | ||
|
||
1. Without extra manual effort, an application running with a single replica has to settle for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, we made it possible and people do it. And also utilize validation webhooks/controllers to implement their desired behavior.
- After a partial cleanup (e.g. storage migrated, notification sent) and if the application is | ||
still in an available state, the eviction API can be used to respect PodDisruptionBudgets. | ||
|
||
In the end, the evacuation should always end with a pod being deleted (evict or delete call) by one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, we should make it clear that the pod might not end up deleted/terminated. I also want to add a mechanism that would result in the pod termination, but not deletion.
There can be many evacuation instigators for a single Evacuation. | ||
|
||
When an instigator decides that a pod needs to be evacuated, it should create an Evacuation: | ||
- With a predictable and consistent name in the following format: `${POD_UID}-${POD_NAME_PREFIX}`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name of the Evacuation object must be unique and predictable for each pod to prevent the creation of multiple Evacuations for the same pod. We do not expect the evacuators to support interaction with multiple Evacuations for a pod.
I was trying to combine ease of use (POD_NAME) and uniqueness of each pod instance (POD_UID).
- Having name only: we can solve the uniqueness by an admission plugin and checking
.spec.podRef
. - Having ID only: we can let clients (kubectl) print out the name from
.spec.podRef
.
Even though we do not expect a big number of instigators, I agree that ${POD_UID}-${POD_NAME_PREFIX}
is overly complicated for the end user. Especially the number of characters limit.
I (maybe wrongly) presume there's a cleanup loop that removes evacuations which have no matching pod?
Yes there is a GC, but that can have a delay. Nevertheless I am mainly concerned about having the 1:1 mapping, to make the implementation easier for all the parties (mainly Evacuators)
after the previous pod is removed. | ||
- `.spec.progressDeadlineSeconds` should be set to a reasonable value. It is recommended to leave | ||
it at the default value of 1800 (30m) to allow for potential evacuator disruptions. | ||
- `.spec.evacuators` value will be resolved on the Evacuation admission from the pod and should |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will go over the whole KEP and try to make it more accessible.
characters (`63 - len("priority_")`) | ||
- `PRIORITY` and `ROLE` | ||
- `controller` should always set a `PRIORITY=10000` and `ROLE=controller` | ||
- other evacuators should set `PRIORITY` according to their own needs (minimum value is 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0 is "lowest priority"
What does this priority govern?
The order of Evacuator evaluation. The highest priority evacuators are evaluated first.
Can two of them have the same prio?
I think they can, but then their order is not defined in the API and most likely just order alpabetically by their class.
One exception is 10000/controller
which should be unique and set by the controller.
It should start the evacuation only if it observes `.status.activeEvacuatorClass` that matches the | ||
`EVACUATOR_CLASS` it previously set in the annotation. | ||
|
||
If the evacuator is not interested in evacuating the pod, it should set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Elsewhere you talked about an evacuation not being necessary and being cancelled.
I am sorry, I will make this part more clear.
Can it be cancelled once evacuators have started running? What if they have done something irreversible? Are they all expected to be reversible? Idempotent? Do they get called to reverse an evacuation?
A @sftim has said, one of the consumers/inspirations is the NodeMaintenance or other projects that might want to stop the maintenance.
The cancellation should be under the control of the current Evacuator. Evacuators should offer the ability to cancel if possible, but they can forbid it if the termination is irreversible. Please see .status.EvacuationCancellationPolicy
.
`.spec.progressDeadlineSeconds` and make sure to update the status periodically in less than that | ||
duration. For example, if `.spec.progressDeadlineSeconds` is set to the default value of 1800 (30m), | ||
it may update the status every 3 minutes. The status updates should look as follows: | ||
- Check that `.status.activeEvacuatorClass` still matches the `EVACUATOR_CLASS` previously set in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.status.activeEvacuatorClass
is only reconciled/coordinated by the the evacuation controller (single instance).
The reaction time can be tweaked with .spec.progressDeadlineSeconds
.
I think the following values might be reasonable, but welcome any other suggestions.
// The minimum value is 600 (10m) and the maximum value is 21600 (6h).
// The default value is 1800 (30m).
FYI, this mechanism is very quick if there is no evacuator advertised on the pod. It will instantly fallback to eviction.
needs high availability either, due to a preference for a simpler deployment model, lack of | ||
application support for HA, or to minimize compute costs. Also, any automated solution needs | ||
to edit the PDB to account for the additional pod that needs to be spun to move the workload | ||
from one node to another. This has been discussed in issue [kubernetes/kubernetes#66811](https://github.com/kubernetes/kubernetes/issues/66811) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I have read about EvacuationAPI, it seems to respect PDBs always.
But I wish to know, will EvacuationAPI be able to disable PDBs or ignore PDBs to help the case where the operator expects zero living nodes in the cluster (i.e. shutting down cluster during the weekends/holidays for maintenance and starting them again after the weekend)?
The above case requires all nodes to shut down, and having PDBs with minAvailable: 1
or maxUnavailable: 0
will definitely block the draining the nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the feature is not limited by PDBs, the workloads can use them but they can also bypass them if there is a controller that can make more advanced decisions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But for the k8s built-in controllers like deployments, statefulsets, etc; if a PDB was configured, how will EvacuationAPI be able to bypass the disruption budget and allow node termination below the PDB configured values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update the KEP soon. But the mechanism of optionally bypassing the PDB should not change, and requires action from the evacuator.. PTAL at the KEP or just the attached diagram: https://github.com/kubernetes/enhancements/pull/4565/files#diff-560381a7190e130fded9e8758d53d2e5c4b69c39838485fc7a3fb13be096ae47
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking at the diagram, where in the innermost loop, it says the Evacuator updates the Evacuation status. I just want to clarify the meaning of the part where it says
"make sure the Evacuation status is up to date, evict the pod if it is not".
Does it mean if the evacuator failed to update the Evacuation status for a specific time, the pod will be terminated regardless?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, this is implemented via the .spec.progressDeadlineSeconds
, but the min/default values should be forgiving to allow for evacuator disruptions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh thank you. As I see it, does .spec.progressDeadlineSeconds
double as the polling duration counting from the most recent evacuator update, waiting for other potential instigators?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the .spec.progressDeadlineSeconds
is a static value that cannot be changed. Every interceptor/evacuator abides by the same deadline to produce an update tick. By making the value immutable, we can make life easier for the interceptors to not miss the deadline.
b833a5c
to
24e034e
Compare
indicate that the pod Evacuation is complete if the pod `restartPolicy` allows it.
- move ${POD_UID}-${POD_NAME_PREFIX} format to alternatives
- better describe Evacuator responsibilities - add new PDB issues - add a follow-up to the beta graduation to track evacuations in pods
24e034e
to
b6f4d01
Compare
- Update the followups. - Add proposal summary. - Add practical use cases to the proposal summary and to the followups. - Deployment Pod Surge Example - HorizontalPodAutoscaler Pod Surge Example - Descheduling and Downscaling All renames for a reference: Evacuation (noun) -> EvictionRequest / Eviction Process evacuation (verb) -> request an eviction / terminate / evict / process eviction Evacuator -> Interceptor Evacuee -> Pod Evacuator Class -> Interceptor Class Evacuation Instigator -> Eviction Requester Evacuation Controller -> Eviction Request Controller ActiveEvacuatorClass -> ActiveInterceptorClass ActiveEvacuatorCompleted -> ActiveInterceptorCompleted EvacuationProgressTimestamp -> ProgressTimestamp ExpectedEvacuationFinishTime -> ExpectedInterceptorFinishTime EvacuationCancellationPolicy -> EvictionRequestCancellationPolicy FailedEvictionCounter -> FailedAPIEvictionCounter
b6f4d01
to
3c33535
Compare
9df74ad
to
7772213
Compare
KEP Update; notable changes:
Added practical use cases/example scenarios: |
@thockin, in the workloads space there are cross-cutting concern issues where such coordination is necessary (PTAL at examples above).
Such deletion gates could work in a system with well-meaning and diligent actors, but difficult to achieve in practice:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've read through the initial requirements and assumptions, I haven't read carefully through the Eviction protocol and the contract described in the latter part of this document.
to edit the PDB to account for the additional pod that needs to be spun to move the workload | ||
from one node to another. This has been discussed in issue [kubernetes/kubernetes#66811](https://github.com/kubernetes/kubernetes/issues/66811) | ||
and in issue [kubernetes/kubernetes#114877](https://github.com/kubernetes/kubernetes/issues/114877). | ||
2. Similar to the first point, it is difficult to use PDBs for applications that can have a variable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case seems identical with the first one. In both cases you're covering a single-replica application. How it got there (always has been, or was scaled down) doesn't matter, imo. So I'd advice to squash the two together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HPAs are fun here though because they can heal in ways are otherwise left to either deployment/statefulset controllers. So maybe it should be positioned as a consumer of this api?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is similar that it covers scaling of a single replica application, but the impact and followups are different. HPA covers 1-n replicas by default. So I think it is okay to keep it separate.
So maybe it should be positioned as a consumer of this api?
HPA is interesting because it can be both a requester (e.g. during scaling down) and an interceptor (e.g. during node drain). As is shown in the examples in the KEP.
|
||
We should discourage the creation of preventive EvictionRequests, so that they do not end up as | ||
another PDB. So we should design the API appropriately and also not allow behaviors that do not | ||
conform to the eviction request contract. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't enforce it in any way, and the API will get used in various, unpredictable ways (see PDB use cases, which you nicely outlined in the sections above). I understand the sentiment, but I don't think it will help here in any way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
answered in #4565 (comment)
|
||
Its responsibility is to observe eviction requests from requesters and periodically check that | ||
interceptors are making progress in evicting/terminating pods. It is important to see a consistent | ||
effort by the interceptors to reconcile the progress of the eviction request. This is important to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are you planning to enforce the progress here? How can you tell that a progress stuck for several minutes is good or bad? Every application will define that differently, so it's hard to define one-size fits all. Even if you require an update every X amount of time, that can be overused by dummy updates, just to block longer/indefinitely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not need to enforce the progress. This is modelled after leader election. If an interceptor fails to update the EvictionRequest status to report the progress we will choose another interceptor or evict the pod.
Leader election works fine with much lower timeouts. I think we can be more forgiving here, but the 10m mark (allowed minimum) should be sufficient.
that can be overused by dummy updates, just to block longer/indefinitely.
The same could be said about the leader election, which could be used to hijack an existing controller.
The idea behind this is that we should make it very unpleasant to block eviction (proof of work instead of just creating a PDB). And if someone implements this, they might as well implement graceful termination.
Another benefit of this approach is that an administrator can take a look at a very long EvictionRequest and see what InterceptorClass is set on it and can easily track down the misbehaving actor. We can also hook into this with metrics/alerts.
Finally, since this is a normal k8s object, we can use RBAC to restrict and observer access to the EvictionRequest.
|
||
Tracking the eviction status in a pod would be feasible, but it would be hard to coordinate between | ||
the actors as mentioned. So it would be better to decouple the eviction request process from the pod | ||
as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered improving PDB to provide more insightful status about eviction, which would allow external controllers to watch that status and react accordingly (either by scaling the app up, or doing the actions you're describing above)? This way the whole process could be tracked and synchronized outside of the core resources, and k8s would only expose sufficient information to provide that kind of automation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The main problems are cardinality and synchronization, we have a set of pods guarded by a single PDB and there is a group of actors (requesters and interceptors) interested in the lifecycle of these pods. So tracking the state of the pods in the status would not scale well. An aggregate status would be less useful and would still run into update conflicts.
- High API server load: as this status would only be used on eviction, it will have outdated information. Actors interacting with a PDB will want to have the latest information about the pods (e.g. recalculating descheduling rules according to the cpu/mem utilization) and might do frequent status updates, congesting the API server.
- Not all applications use PDBs. The EvictionRequest API tries to serve as a replacement for eviction and as the main descheduling primitive. This has use cases in cluster and pod autoscaling. For example when used with HPA, we would like to downscale any application/set of pods.
See some practical use cases for this feature: | ||
1. Ability to upscale first before terminating the pods with a Deployment: [Deployment Pod Surge Example](#deployment-pod-surge-example) | ||
based on the [EvictionRequest Process](#evictionrequest-process). | ||
2. Ability to upscale first before terminating the pods with HPA: [HorizontalPodAutoscaler Pod Surge Example](#horizontalpodautoscaler-pod-surge-example) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess dragging along the pdb to have minAvailable == HPA's current target is a pretty hacky thus the need for a non pdb blocked signal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there are a bunch of problems here, user friendliness, atomicity of actions, and responding to any descheduling (e.g. node drain).
Not sure if it's relevant to the conversation, however I would expect that after this change, the user scenario 2 is solved without the user doing anything, as the maxSurge parameter and the fact that it's a rollout deployment are already specified in the deployment spec. |
Yes, the implementation of the follow-ups is only outlined, here and there should be additional KEPs for each improvement. And the majority of the things proposed should result in an immediate benefit without any user action. |