Replicaset scaling down takes inter-pods scheduling constraints into consideration #124306

sanposhiho · 2024-04-14T10:17:56Z

/sig apps
/sig scheduling
/kind feature
/assign

Want to get feedback on this proposal. I can proceed to KEP process later, if it looks worthy for it to everyone.

Summary

Refactor Replicaset scaling down implementation for better extendability,
and take Inter-pod scheduling constraints into account during scaling down.

Motivation

When scaling down happens, the replicaset controller determines Pod(s) to remove based on starting time etc (ref), but not scheduling constraints.

Inter-Pod scheduling constraints (PodAffinity/PodTopologySpread) depend on other Pods’ placement, but they’re honored only at the scheduling time.
It’d be more awesome if we can, at least, keep spreading or affinity to some extent during scaling down in a vanilla kubernetes, that is, without installing descheduler etc.
Actually, we’ve seen many issues hoping this feature; the oldest one as far as I found is #4301!

Goal

Refactor the replicaset controller implementation around scaling down and make it easier to implement any scaling down behavior.
- Introduce ScaleDownScorePlugin/ScaleDownNormalizeScorePlugin interface and each scaling down factor is implemented as a plugin.
Take Inter-pod scheduling constraints into account during replicaset scaling down.
- Inter-pod scheduling constraints are only PodAffinity and PodTopologySpread for now.

Non Goal

This proposal doesn't mean -

Always honor those scheduling constraints during scaling down.
- It'll just be best-effort.
Take ALL Pods in a cluster into consideration when calculating the best Pod to remove.
- If we did that, it'd be huge calculation cost.
Take custom plugins into consideration.
- Technically, it'd be possible to allow custom ScaleDownScorePlugin/ScaleDownNormalizeScorePlugin from users, like the scheduling framework does. But, that's out-of-scope in this proposal.

(An elaborated explanation of these Non Goal: #124306 (comment))

Proposal

We’ll introduce a new interface named ScaleDownScorePlugin/ScaleDownNormalizeScorePlugin.

type ScaleDownScorePlugin interface {
  // Score scores Pods; The higher score a Node gets, the more preferable to remove the Node, from this plugin’s pov.
  // candidatesToDelete is Pods from the replicaset, which the replicaset controller is scaling down.
  // relatedPods is all pods that are owned by any ReplicaSets that is owned by the ReplicaSet's owner.
  // ref: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/replicaset/replica_set.go#L796 
  //
  // Each plugin’s Score() is executed concurrently.
  Score(ctx context.Context, candidatesToDelete []*v1.Pod, relatedPods []*v1.Pod) ([]*PodWithScore, error)
}

type ScaleDownNormalizeScorePlugin interface {
  // NormalizeScore is called after all ScaleDownScorePlugin Score() are called. 
  // A successful run of NormalizeScore will update the scores list.
  // Each plugin's NormalizeScore() is executed one by one.
  // candidatesToDelete is the aggregated result from all Score plugins.
  NormalizeScore(ctx context.Context, candidatesToDelete []*PodWithScore) error
}

type PodWithScore struct {
  Pod *v1.Pod
  TotalScore int64
}

Calculate how many Pods should be removed in the same Replicaset.
Call Score of each plugin concurrently.
Call NormalizeScore of each plugin one by one.
Remove Pods with higher score.

How each plugin is implemented

We’ll implement plugins for PodAffinity / PodTopologySpread ( / existing logic to determine which Pods to delete).

Each implementation would be like the following:

PodAffinity

It'll be implemented as ScaleDownScorePlugin.
It gives a high score to candidate Pod if a Node that the candidate Pod is running on doesn’t have a Pod matching PodAffinity of the candidate Pod.
In order to lower the cost of calculation, it’s probably better to use the scheduler’s snapshot in the replicaset controller too.

PodTopologySpread

It'll be implemented as ScaleDownNormalizeScorePlugin.

Group relatedPods based on domain of topology spread
Calculate the total score of all Pods’ TotalScore in each domain.
Normalize the scores so that each domain’s total score would be similar.

Example:

Before PodTopologySpread NormalizeScore():

Domain-a (total: 600)	Domain-b (total: 300)	Domain-c (total: 3000)
PodA: 100	PodD: 100	PodG: 500
PodB: 200	PodE: 100	PodH: 1000
PodC: 300	PodF: 100	PodI: 1500

After PodTopologySpread NormalizeScore():

Domain-a (total: 300)	Domain-b (total: 300)	Domain-c (total: 300)
PodA: 50	PodD: 100	PodG: 50
PodB: 100	PodE: 100	PodH: 100
PodC: 150	PodF: 100	PodI: 150

Before PodTopologySpread, all Pods from Domain-c could be removed, but afterwards it’s not likely that all Pods from the same domain are removed.

When actually implementing it, we can consider more; for example, in above case, if the controller is trying to remove 5 Pods, all Pods in domain-b could be removed at the same time. So, maybe we want to avoid "all Pods have the same score" after the normalization. (e.g., change to PodD:99 PodE:100 PodF:101)

Existing sorting logic

And also note that existing logic to determine which Pod(s) to remove (starting time, deletion-cost, etc) can be also implemented as ScaleDownScorePlugin.

Risk(Downside) and Mitigation

Negative impact to throughput of replicaset controller

Scaling down would be slower than the current implementation because of an additional calculation.

I believe it's not the hard/critical requirement to keep the current throughput of scaling down.
But, also, given the controller has workers that are run concurrently and are shared between scaling up/scaling down processes, we cannot say there'd be zero impact on scaling up throughput.

We have to make sure whether the throughput (both scale up and down) is really acceptable after the implementation change.

Backwards compatibility

#124306 (comment)

How will the new scores interact with pod-deletion-cost?

What about backwards-compatibility? What if users are currently relying on the fact that the first Pod to be deleted is roughly the newest one?

If we just implement the proposal and make it the default behavior, we would break an existing scaling-down behavior.

If we concern about such breaking by this proposal, we can do, for example,

add a new field (or annotation) somewhere to make this behaviour non-default.
simplify PodTopologySpread's NormalizeScore() and make the scores returned from PodAffinity smaller than other existing algorithm's plugins.
- simplify PodTopologySpread's NormalizeScore() > e.g., modify only scores of one Pod per domain so that at least one Pod will remain per domain.

I'd prefer (1). If we choose (2), the benefit from this proposal would be small.

Alternative

Somehow use controller.kubernetes.io/pod-deletion-cost, but in anyways, we'd likely end up having to update Pods frequently and I didn't choose that way.

Users deploy controllers that update the annotation frequently causing a significant load on the api server.
https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/2255-pod-cost#risks-and-mitigations

(An elaborated explanation why I didn't choose deletion-cost idea: #124306 (comment))

Reference

Current implementation of picking Pods when scaling down

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2024-04-14T10:18:04Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

toVersus · 2024-04-14T12:52:32Z

We’ll implement ScaleDownPlugins for PodAffinity / PodTopologySpread ( / existing logic to determine which Pods to delete).

I think in the existing deletion process, various aspects such as startup time, Pod status, and restart count are prioritized and checked sequentially. If you were to break the existing process down into finer plugins, it seems like you would need to assign priorities or weights to each plugin. How do you plan to handle this aspect?

Always honor those scheduling constraints during scaling down.

In what scenarios would the scheduling constraints not be considered when deciding which Pods to scale down? What exactly does "best-effort" mean here in this context?

Take ALL Pods in a cluster into consideration when calculating the best Pod to remove.

Does this mean targeting only Pods belonging to a specific ReplicaSet?

We'll configure PodTopologySpread's Score() to be executed finally.

If PodTopologySpread were to support the uneven Pod spreading as proposed in #124306, it would mean that the PodTopologySpread plugin in ReplicaSet controller would start considering uneven Pod placement when deleting Pods. Sounds great!

Scaling down would be slower than the current implementation because of an additional calculation.

The ReplicaSet controller doesn't perform Pod creation and deletion processes concurrently, so wouldn't a delay in the scale-down process lead to a delay in the start of the scale-out process? I feel like this would only have an impact in special cases where HPA behavior settings and Pod loads spike at short intervals.

sanposhiho · 2024-04-15T00:36:26Z

it seems like you would need to assign priorities or weights to each plugin. How do you plan to handle this aspect?

We can just, for example, return very high scores if a Pod is not scheduled or not in running state.
I don't think we need a weight configuration per plugin (for now).

In what scenarios would the scheduling constraints not be considered when deciding which Pods to scale down?

There are many scenarios that would be missed by the implementation of this proposal, e.g.,

Replicaset-A has affinity for Pods from Replicaset-B. But Pods from Replicaset-B don't have any affinity. In this case, Pods from Replicaset-B would be removed without any consideration to PodAffinity of Replicaset-A.
One topology spread is composed of Pods from various replicasets. In this case, in each replicaset's scaling down, topology spread is honored (best-effort), but the controller doesn't look at overall situation of topology spreading. Consequently, supposing each replicaset has one Pod per domain, the controller may end up deleting all Pods from domain-c during scaling down of those replicasets.

Does this mean targeting only Pods belonging to a specific ReplicaSet?

It means the controller doesn't try to keep the scheduling constraint of Pods that aren't owned by the Replicaset being scaled down.
Taking example (1) above, during scaling down of Replciaset-B, PodAffinity of Replicaset-A isn't considered at all. If users want to make the controller consider them, then they have to add PodAffinity in Replicaset-B too.

If PodTopologySpread were to support ....

Yes exactly. We can easily support that by changing the way of normalizing in PodTopologySpread ScaleDownPlugin.

The ReplicaSet controller doesn't perform Pod creation and deletion processes concurrently, so wouldn't a delay in the scale-down process lead to a delay in the start of the scale-out process?

Actually, the controller has workers that are run concurrently. But, your concern is to the point either way; Given the workers are shared between scaling up/scaling down processes, we cannot say there'd be zero impact on scaling up throughput. I'm not sure whether the impact would be so big that people have to concern, though.
If some people found it critical, they'd have to increase ComponentConfig.ReplicaSetController.ConcurrentRSSyncs and maybe the resource of controller-manager too.

nayihz · 2024-04-16T06:41:27Z

Nice proposal.

Before PodTopologySpread, all Pods from Domain-c could be removed,

Can you elaborate more about the example? Why all pods from domain-c will be removed?

Do we need to break the existing process (start time/ pod status/ restart count) down into individual plugins? If so, what is the correct order for executing these plugins? If there are multiple plugins, can they be executed concurrently?

sanposhiho · 2024-04-16T09:03:33Z

Can you elaborate more about the example? Why all pods from domain-c will be removed?

The higher score a Node gets, the more preferable to remove the Node.
In the example, if the replicaset controller is trying to remove three (or more) Pods, then all Pods from domain-c will be selected. (if we didn't implement a plugin for topology spread)

Do we need to break the existing process (start time/ pod status/ restart count) down into individual plugins? If so, what is the correct order for executing these plugins?

Technically, we don't need. We can have separate plugins for each existing logic, which I think would look great, but if we want, we can have one plugin for all existing logic.
The order doesn't matter, I think.

If there are multiple plugins, can they be executed concurrently?

I updated the interface design a bit while thinking about the answer to this part.
So, in the latest design, all Score can be called concurrently, while NormalizeScore is called one by one because it mutates the list. (We'll have only one NormalizeScore plugin for now, so the performance isn't concern here.)

toVersus · 2024-04-16T12:50:57Z

So, in the latest design, all Score can be called concurrently

You've made changes to call the Score plugin concurrently, so now candidatesToDelete has been changed to []*v1.Pod from the results of the Score plugin execution ([]*PodWithScore). Will you pass a list of excluded Pods, such as those without a DeletionTimestamp set, as candidatesToDelete, just like in the current implementation? Will there be additional processing at the end to aggregate the results of the Score plugin?

sanposhiho · 2024-04-18T02:28:34Z

Will you pass a list of excluded Pods

Yes.

Will there be additional processing at the end to aggregate the results of the Score plugin?

No. It'd be just like: call all Score() plugins → aggregate all scores from them → call NormalizeScore() plugins → pick up the higher score nodes to delete.

Any reason/idea particular you asked this?

toVersus · 2024-04-18T05:57:03Z

call all Score() plugins → aggregate all scores from them → call NormalizeScore() plugins → pick up the higher score nodes to delete.

Thanks! My words may have been clumsy, but the processing flow was just as I had imagined.

Any reason/idea particular you asked this?

If each Score plugin is executed in parallel, we'll hold a list of PodWithScore (pointer of Pod object and its score) in memory for each Score plugin, right? This is necessary to aggregate the lists of PodWithScore scored by each plugin before calling NormalizeScore plugins. I was a little concerned about whether memory usage would be high due to the lists of PodWithScore if there are many ReplicaSets to scale down and a large number of Pods associated with those ReplicaSets.

sanposhiho · 2024-04-19T04:22:46Z

I was a little concerned about whether memory usage would be high due to the lists of PodWithScore if there are many ReplicaSets to scale down and a large number of Pods associated with those ReplicaSets.

Yes, higher memory could likely be consumed.

sanposhiho · 2024-04-20T06:21:06Z

@kubernetes/sig-apps-proposals @kubernetes/sig-scheduling-leads

Can anyone give input on this proposal?

alculquicondor · 2024-04-22T13:26:22Z

There have been a number of proposals in this area. Have you reviewed them?
#124149
#123541

sanposhiho · 2024-04-23T01:25:14Z

Yes, both proposals are actually similar, trying to use pod-deletion-cost. But, I believe that'd be a hard way.

First, for scalability, we have to implement an algorithm that does not update a deletion-cost value every time another Pod under the same replicaset is added/removed. But, actually every Pod addition/removal directly means a change in the situation of each domain and thus possibly a desired order of deleting Pods under the same replicaset. I couldn't come up with any awesome idea myself and also I don't see clear solution in both issues.

Furthermore, if we bring Pod Affinity in addition to that (or maybe other more stuff in the future), it's nearly impossible to calculate the best deletion-cost for each Pod, without a frequent update to the cost value.

A general problem is that we don't know when scaling down happens from outside, and then we have to keep putting a deletion-cost to represent a desired order of pod deletion. It's much more straightforward to calculate them in the replicaset controller when the scaling down actually happens. That is an idea where this proposal came from.

tskillian · 2024-04-24T14:36:19Z

I think something like this would be really helpful. My use case is that I don't want the replica set controller to prioritize removing pods that are colocated - it makes sense to me why it's the safest default, but I want to optimize for the # of pods packed on each node during scale down.

This solution would definitely allow me to do that. I'd also be open to helping with the implementation of whatever solution is chosen if needed.

alculquicondor · 2024-04-25T18:21:20Z

It gives a high score to candidate Pod if a Node that the candidate Pod is running on doesn’t have a Pod matching PodAffinity of the candidate Pod.

What if there are multiple pods in the same topology? Which one will be picked? Will we recalculate every time we delete one pod? IIUC, HPA might remove multiple pods in one iteration, but I could be wrong.

In general, the proposed solution is very narrow, so not problematic from an API point-of-view. But this also implies that it wouldn't solve other use cases where the "deletion cost" depends on what's running on the Pod.

So two questions:

How will the new scores interact with pod-deletion-cost?
What about backwards-compatibility? What if users are currently relying on the fact that the first Pod to be deleted is roughly the newest one?

alculquicondor · 2024-04-25T18:21:52Z

cc @atiratree for a SIG Apps perspective.

sanposhiho · 2024-04-26T12:12:12Z

@alculquicondor

It gives a high score to candidate Pod if a Node that the candidate Pod is running on doesn’t have a Pod matching PodAffinity of the candidate Pod.

What if there are multiple pods in the same topology? Which one will be picked?

Supposing we're going to remove only one Pod and two candidate Pods are on the same Node.
They'd just get the same score from PodAffinity plugin. And, then other plugins (like starting time etc) would probably give them different scores and the one with a higher total score will be picked eventually.

Will we recalculate every time we delete one pod? IIUC, HPA might remove multiple pods in one iteration, but I could be wrong.

Yes, HPA controller could. And, we wouldn't recalculate every time we delete one Pod.
We remove top-X Pods with calculated scores at one iteration. (X: <current pod num> - <desired pod num> )

How will the new scores interact with pod-deletion-cost?

What about backwards-compatibility? What if users are currently relying on the fact that the first Pod to be deleted is roughly the newest one?

pod-deletion-cost will be implemented as a plugin, and starting time will be too. So, those two questions have the similar point; how we can integrate new factors (pod affinity / topology spreading) with existing factors.

Regarding versus PodAffinity plugin, it depends on how high a score we give based on pod-deletion-cost/starting time is relative to the PodAffinity plugin. i.e., if we want to weigh pod-deletion-cost (or starting time) more than PodAffinity, then we can let pod-deletion-cost (starting time) plugin return a higher score than PodAffinity's one.

But, regarding versus PodTopologySpread plugin, PodTopologySpread's NormalizeScore() could result in, for example, one newer Pod in zone-a is removed instead of one old Pod in zone-b when zone-a has more Pods than zone-b.

If we concern about such breaking by PodTopologySpread plugin, we can do, for example,

add a new field (or annotation) somewhere to make this behaviour non-default.
simplify PodTopologySpread's NormalizeScore(); e.g., modify only scores of one Pod per domain so that at least one Pod will remain per domain.

alculquicondor · 2024-04-26T15:03:06Z

We remove top-X Pods with calculated scores at one iteration.

Let's say you have 8 pods in one topology, and 7 in the other. Now we want to remove 4 pods. Assuming that the remaining scores are equal, wouldn't this result in (4, 7) pods in each topology?

add a new field (or annotation) somewhere to make this behaviour non-default.

I think something along those lines might be necessary. Maybe it can be in the HPA object.

But all those things can be discussed in a KEP :)

I'm more interested in hearing from people whether the overall proposal is enough for most needs.

alculquicondor · 2024-04-26T15:11:49Z

/sig autoscaling

Actually, I'm starting to think that sig-scheduling doesn't have much ownership on this topic.

sanposhiho · 2024-04-27T05:22:31Z

Let's say you have 8 pods in one topology, and 7 in the other. Now we want to remove 4 pods. Assuming that the remaining scores are equal, wouldn't this result in (4, 7) pods in each topology?

I believe you also assume one Pod matching PodAffinity exists in the second topology (the one with 7 Pods).
Then, Yes: All 4 Pods would be picked from the first topology (the one which had 8 Pods first).

I think something along those lines might be necessary. Maybe it can be in the HPA object.
But all those things can be discussed in a KEP :)

I'm not yet very sure about the right place for adding a new field or annotation (if we do).
Alternatively, it could be in Replicaset, or maybe directly in PodAffinity/TopologySpreadConstraints spec.
Yup, we can continue this topic when proceeding to KEP.

I'm starting to think that sig-scheduling doesn't have much ownership on this topic.

The main group in charge of this proposal should definitely be SIG-Apps though,
I think that SIG-Scheduling would be the owner of PodAffinity / PodTopologySpread plugin (other plugins would be owned by SIG-Apps), since we're the owner of those features?

sanposhiho · 2024-05-23T04:07:33Z

my outlined algorithm in Replicaset scaling down takes inter-pods scheduling constraints into consideration #124306 (comment) should basically cover your first point, but you claim there is no.

You mentioned -

because on POD actions CREATE/UPDATE we can basically update the value
#124306 (comment)

and that directly violates my first point ↓.

First, for scalability, we have to implement an algorithm that does not update a deletion-cost value every time another Pod under the same replicaset is added/removed.

Again, updating the value every time Pod create/update/delete happens is definitely not scalable.

you don't show why PodAffinity (or likely more used is podAntiAffinity) is a problem at all. At least I can not think of an example that would show that one of the affinity configurations could have an interest in downscaling behavior, but maybe I miss something.

Let's say -

you have two kinds of Pods, app1 and app2, which send requests to each other.
you want to put app1 Pod and app2 Pod closer so that you can reduce the network overhead. (very typical scenario for people to use PodAffinity.)
app1-pod-A and app2-pod-A are on Node-A, and app1-pod-B and app2-pod-B are on Node-B.
But, app2-pod-B is removed for some reason (e.g., scale down of app2, preemption, node pressure etc)
As a result, app1-pod-A and app2-pod-A are on Node-A, and only app1-pod-B is on Node-B.

In this case, when we want to scale down app1, it's better to delete app1-pod-B because PodAffinity on app1-pod-A is still satisfied while app1-pod-B is not.

I don't understand what you mean by scaling down happens from outside.

In #124306 (comment), I meant that:

The replicaset controller is an actor who actually pick Pod(s) and delete them.
From outside of the replicaset controller, you cannot tell when the replicaset controller starts to delete Pods.
So, you have to keep putting pod-deletion-cost every time Pod create/update/delete happens so that it's OK that scale down happens any time.

szuecs · 2024-05-27T15:37:26Z

Thanks for your answers it makes a lot of sense!
I am just unsure if we talk about two different things in the part below:

First, for scalability, we have to implement an algorithm that does not update a deletion-cost value every time another Pod under the same replicaset is added/removed.

Again, updating the value every time Pod create/update/delete happens is definitely not scalable.

Let's have a more detailed example why I have no scalability concern or I don't see it, yet.

If podX is created it likely has not the topology as it is not yet scheduled so for example node/zone topologies would not be there yet, so this we can just shortcut and do nothing.
Then we get an update of podX, because it's now scheduled to a node and this means we also have zone topology there derived from the zone. Now we have to set the costs for podX. We never have again to update the costs for this podX.
This does not depend on number of pods because we literally would have something like map[string]map[topology]int the string is either derived from replicaset or deployment (I am unsure about this one, because of how deployments work), so for example string might be {namespace}/{name}. Map access needs to be of course locked but this won't be an issue, because we have no update frequency that would be µs, it's more measured in seconds even if you consider a big cluster with >5k nodes and >40k pods. The int will be incremented during update of the pod that sets the cost. The int will be decremented on pod deletion event.

Let's think about the abnormal termination cases:

node termination (could be hard reset or just rolling update so kubelet will deregister the node object)
pod eviction (normal node termination will trigger pod eviction)
pod deletion (kubectl delete pod podX)

These will all boil down to a delete pod event from api-server point of view (maybe I miss the node hard reset will caught by etcd TTL for node objects not sure for this one).

So let's think about delete pod podX event:

we decrement the map[string]map[topology]int by one per pod
we have to store the "hole" in another data structure map[string]map[topology]map[int]struct{}
The next pod event with create/update checks for the pod if we have to back fill a "hole" so we could just derive the pod cost value from there

There is no LIST on ever update required.
Where do you see a scalability issue?

I hope it makes sense to you what I write. Maybe you are right and the scheduler approach is better. I want to understand if that's true also for the delete pod podX event, because that's the real problem to solve. I don't see how this can be much better in the scheduler. Maybe you can shed some more light there. :)

sanposhiho · 2024-05-30T10:20:33Z

Thanks for the detailed explanation. Now I got your idea and understand why you don't see a scalability issue.

That should work for a simple scenario, but I still see many points that need to be considered -

As already mentioned, if we want to bring Pod Affinity (or maybe other more stuff in the future), it's nearly impossible to calculate the best deletion-cost for each Pod.
Conflict if a Pod has a deletion cost annotation from users.
Conflict with other existing scaling down logic.
- This proposal considers how to integrate with existing logic, while yours doesn't.
Now I understood you don't update the cost value once it's given though, it'd still have to consume 1 API call per scheduled Pod. (= around 300 req/s, which is the scheduling throughput we have.) If we implement this directly in the replicaset controller (= my proposal), there'd be no additional API call at all.

Overall, I don't see much benefit to do this with deletion cost in the first place, and still believe implementing in the replicaset controller like this proposal would be much simpler.

szuecs · 2024-06-04T11:01:41Z

@sanposhiho thanks for your answers!
The goal of pod deletion costs would be also to integrate with replicaset controller, but you are right with the first 2-3 points of course.

I am looking forward for a solution to the problem.

PradeepKumar-15 · 2024-08-14T17:48:07Z

@sanposhiho thank you for taking up this problem which is very crucial to maintain steady and balanced state of an auto-scaled environment. I am only reaching out to check if you are still pursuing solution to this problem and this git issue is still on your to-do list. Thanks in advance.

sanposhiho · 2024-08-15T04:11:58Z

The current status here is that I have to get a sponsor from SIG-App (@kubernetes/sig-apps-misc), who can be a reviewer, to start a KEP. (Plus, may or may not need some discussion before actually starting KEP, like I did with @atiratree)
But, talking about my situation, actually I moved to another company and I haven't got a big passion to pursue this one right now, honestly. That's the reason no recent activity here.

szuecs · 2024-08-15T08:47:48Z

@sanposhiho thanks for the clear statement!

rtomadpg · 2024-08-23T09:31:32Z

We recently started using topologySpreadConstraints and we indeed notice descheduling does not take the requested spread into account.

Since I could not find a graph illustrating the issue anywhere I thought it valuable to add one. The graph shows the number of pods per zone for a Deployment with HPA enabled. After scale out the balance is usually fine (we have preferred satisfy with 1 skew), but the balance gets disturbed on most scale ins.

Note topologySpreadConstraints is still great, because without it we some times had pod less zones.

szuecs · 2024-08-23T16:18:39Z

@rtomadpg see also #124149 , I documented the problem there.

k8s-triage-robot · 2024-11-21T16:58:04Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

szuecs · 2024-11-21T17:44:48Z

/priority important-longterm

szuecs · 2024-11-21T17:45:04Z

/remove-lifecycle stale

yogeek · 2024-11-22T07:50:53Z

Same experience as @rtomadpg on our side : we were quite suprised to end up with replicas in the same AZ despite using podTopologySpreadConstraints after a scale-down by the HPA...

when looking for information about this behavior, I found multiple discussions that seems linked to the topic :
#4301
#107598
#115911
#127199
#124149
#123541

As I am not entirely sure, can you confirm they are all related to the same topic please ?
If this is the case, wouldn't it be clearer to gather every information in only one dicussion ?

szuecs · 2024-11-22T12:15:03Z

#124149 documents the bug and the others are different proposals to fix it.
We need a volunteer to create a proper KEP that summarizes the proposals to fix the bug.

YamasouA · 2024-11-22T15:04:54Z

@szuecs
Hi.
I'm very interested in this issue.
I don't have many contributions to Kubernetes yet, but is there anything I can help with?

sanposhiho · 2024-11-23T02:50:14Z

Again, the blocker is that we need a sponsor from sig-apps to review the KEP and related implementation PRs.
I actually wrote the top description very similar to KEP, so I can take a KEP author role. But, as I mentioned, I probably won't be able to do the implementation fully, bandwidth-wise, and would need some help from the community.

@kubernetes/sig-apps-misc Can anyone take a look at the proposal here?

atiratree · 2024-11-27T00:36:35Z

I still think the biggest problem is the separation of concern as I mentioned in #124306 (comment). And the sheer complexity; multiple plugins managing multiple ReplicaSets in a single decision step (e.g. balancing a Deployment during a rollout). Configuring everything as different applications have different dynamic needs (which will be increasing).

descheduler is too late in the scale down process, because that already happened.

This was only an example, there are multiple options where the logic could live.
It does not have to be late, please see the following:

The replicaset controller is an actor who actually pick Pod(s) and delete them.

See #124306 (comment), I described various reasons (not only scalability one) why I don't go with pod-deletion-cost idea there.

ReplicaSet does not trigger the scale down of its pods by itself. You can create a workload that scales down ReplicaSets only once the pod's metadata has been updated. So the pods do not necessarily have to be updated often. And you could achieve the desired behavior with additional controller and controller.kubernetes.io/pod-deletion-cost annotations. This is not user friendly, but could be done as an experimental feature.

There are also other alternatives:

As suggested earlier, we could teach the ReplicaSet to use the Evacuation API (would have to be opt-in), where any component/controller could inject its own behavior.
There is also a discussion about having a built-in descheduling behavior (or even the descheduler), which would alleviate the current problems: please see KEP4328: Affinity Based Eviction enhancements#4329

Again, the blocker is that we need a sponsor from sig-apps to review the KEP and related implementation PRs.

There needs to be a consensus from sig-apps on the feasibility/value of this approach, before the KEP can be accepted.

sanposhiho · 2024-11-27T01:48:20Z

the separation of concern as I mentioned in #124306 (comment)

#124306 (comment) doesn't answer all? (or any specific answer of mine didn't satisfy you? let me know in that case.)

multiple plugins managing multiple ReplicaSets in a single decision step (e.g. balancing a Deployment during a rollout)

The calculation scope is only pods from one Replicaset (being scaled down) in a single decision. As mentioned at Non Goal, the RS controller does not consider all pods, does not consider the global state of the clusters.
Also, the balancing a deployment during a rollout, it's a scheduler's responsibility. We recently added matchLabelKeys for a better balancing during a rollout.

ReplicaSet does not trigger the scale down of its pods by itself. You can create a workload that scales down ReplicaSets only once the pod's metadata has been updated. So the pods do not necessarily have to be updated often. And you could achieve the desired behavior with additional controller and controller.kubernetes.io/pod-deletion-cost annotations. This is not user friendly, but could be done as an experimental feature.

There're various actors who scales down the Pods (HPA, KEDA, or other custom solutions in the community), which depends on each user's cluster.
How can the vanilla kubernetes add the deletion-cost every before a scaling down happens?

It should be implemented in the RS controller given all clusters (who want this feature) must have the RS controller vs each cluster might use a different scaling component.

As suggested earlier, we could teach the ReplicaSet to use the kubernetes/enhancements#4565 (would have to be opt-in), where any component/controller could inject its own behavior.

So, who would be the Evacuation controller to achieve this feature? Are we gonna have a built-in Evacuation controller that controls the evacuation for those builtin features?
Or, are you saying that the users still have to implement the custom solution via Evacuation API? (note that this issue is arguing that the vanilla kubernetes should support this behavior.)

There is also a discussion about having a built-in descheduling behavior (or even the descheduler), which would alleviate the current problems

We know the descheduler (or a similar descheduling strategy like kubernetes/enhancements#4329) tries to solve the problem. But, if we all satisfy with them, we don't have the issue(s) here.
What we need is the deletion from the RS controller should consider the topology spreading in the first place; the current process w/ the descheduler is (1) The RS controller deletes a random Pod, (2) the deletion may or may not break the spreading, (3) every time it breaks, the descheduler rebalances the pods with paying the risk of disruption.

atiratree · 2024-11-27T12:00:18Z

#124306 (comment) doesn't answer all? (or any specific answer of mine didn't satisfy you? let me know in that case.)

I meant the whole thread from there. In short, I am not sure it fits well within the responsibilities of the ReplicaSet.

The calculation scope is only pods from one Replicaset (being scaled down) in a single decision. As mentioned at Non Goal, the RS controller does not consider all pods, does not consider the global state of the clusters.
Also, the balancing a deployment during a rollout, it's a scheduler's responsibility. We recently added matchLabelKeys for a better balancing during a rollout.

I am not entirely convinced that this is a Non Goal. This may be enough in the beginning (depending on which strategies are implemented first). But when we consider all the de/scheduling rules, we should at least consider all the pods of the same application, which may be part of different revisions/replicasets. And what do we do if a part of the application is controlled by a different controller?

How can the vanilla kubernetes add the deletion-cost every before a scaling down happens?

True, not possible in vanilla k8s.

So, who would be the Evacuation controller to achieve this feature? Are we gonna have a built-in Evacuation controller that controls the evacuation for those builtin features?
Or, are you saying that the users still have to implement the custom solution via Evacuation API? (note that this issue is arguing that the vanilla kubernetes should support this behavior.)

I do not know where it should live and who is going to maintain it. And it is not up to me to make the final decision. My opinion is just that it would be better for this functionality not to be tightly coupled with the replicaset controller.

What we need is the deletion from the RS controller should consider the topology spreading in the first place; the current process w/ the descheduler is (1) The RS controller deletes a random Pod, (2) the deletion may or may not break the spreading, (3) every time it breaks, the descheduler rebalances the pods with paying the risk of disruption.

Agree, it is not an ideal solution.

sanposhiho · 2024-11-27T14:07:09Z

I am not entirely convinced that this is a Non Goal. This may be enough in the beginning (depending on which strategies are implemented first). But when we consider all the de/scheduling rules, we should at least consider all the pods of the same application, which may be part of different revisions/replicasets. And what do we do if a part of the application is controlled by a different controller?

So, it would end up the question how much we could accept the degradation in RS controller reconciliation. If the performance is within the acceptable range even with the heavy calculation to consider all pods from different RS, then that's the best, of course. If not, we have to compromise, considering a single RS only.
But, actually, I believe the spreading with multiple RSs would be protected, not 1oo% perfectly, but to some extent, even with a single RS consideration.

I do not know where it should live and who is going to maintain it. And it is not up to me to make the final decision. My opinion is just that it would be better for this functionality not to be tightly coupled with the replicaset controller.

I don't much care where this feature is implemented, directly in RS controller or via Evactuation API like you proposed. We would be happy if it's implemented in any ways. The only hope is having it implemented within the vanilla kubernetes with any option.

And, maintenance wise, Yes, I agree that it'd be nicer for this feature to be decoupled from the RS controller. Then, maybe we - sig-scheduling - could have control of those kind of scaling down considering the scheduling constraints.

One off topic that comes up my mind: It might be an interesting idea to implement Evacuation API controller within the scheduler. For example, let's say the cluster is packed and one RS ie being scaled down (= we can delete one Pod).
The scheduler might want to choose the best RS Pod to delete based on the unschedulable Pods, similar to what it does at the preemption process. e.g., If we delete this RS Pod on Node-a, some unsched Pods could go there.

soltysh · 2024-11-28T15:43:35Z

Having read through this entire issue and all the linked ones, the one theme that comes up is to add additional functionality to core RS controller, which as many has mentioned is currently in the critical path for a lot of applications. I can understand that desire, but at the same time as one of the owners of RS controller I need to make sure we don't bloat the functionality of controllers unnecessarily before fully considering all possible options.

I've seen several comments quickly negating the usefulness of PodDeletionCost which is already available, although theoretically not fully stable, which means we have some room for improvements based on the feedback. There's a valuable suggestion from @bgrant0607 in #4301 (comment):

To avoid continuous renumbering, the values wouldn't need to be dense, but they could be periodically compacted, perhaps leaving some gaps, like BASIC program line numbers :-).

I'm also against mixing the functionality of a controller with scheduler, so the idea of injecting PodAffinity, PodTopologySpread, other, strictly scheduler functionality into any of the existing controllers is definitely not desirable from maintenance perspective.

Having said that, I welcome anyone interested to jump on any of the next sig-apps calls to discuss this topic more.

atiratree · 2024-11-29T16:07:59Z

I have added this issue to the Motivation section (see number 3) of the Evacuation API. And added a Descheduling and Downscaling section that describes how this problem could be solved.

Please let me know if you see any issues with the proposed solution.

sanposhiho · 2024-11-30T06:52:36Z

I've seen several comments quickly negating the usefulness of PodDeletionCost which is already available, although theoretically kubernetes/enhancements#2255, which means we have some room for improvements based on the feedback

The problem is that PodDeletionCost is static and we have to keep updating them to control the order of deletion flexibly.
We might be able to utilize it for this use case if we somehow allow PodDeletionCost to be dynamic, e.g., we give the label selector for PodDeletionCost, define the score to be decided based on the number of pods matching label selector, and the RS controller calculates the score based on the definition.
..but, then I prefer to use Evacuation API than that, which would allow us to decouple this implementation from RS controller as well.

thesuperzapper · 2024-12-01T02:00:10Z

Separate to this evacuation API idea, I think we can significantly improve the usefulness of pod-deletion-cost with the proposal from #123541 to extend the scale API to allow temporarily setting pod-deletion-cost during downscale, thus avoiding its core issues while being fully backwards-compatible.

I say this because currently, the evacuation API KEP (kubernetes/enhancements#4565) does not seem to address the core purpose of pod-deletion-cost, which is influencing which replicas are removed during down-scale (when reducing the replicas of a deployment).

sanposhiho · 2024-12-01T02:45:23Z

Left a comment on your issue to move the discussion.

atiratree · 2024-12-01T12:48:07Z

I say this because currently, the evacuation API KEP (kubernetes/enhancements#4565) does not seem to address the core purpose of pod-deletion-cost, which is influencing which replicas are removed during down-scale (when reducing the replicas of a deployment).

The KEP does not implement the whole stack (scheduling, downscaling), but the new API tries to address exactly this problem of influencing which pods get deleted and when.

k8s-ci-robot assigned sanposhiho Apr 14, 2024

k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. kind/feature Categorizes issue or PR as related to a new feature. labels Apr 14, 2024

github-project-automation bot added this to SIG Apps Apr 14, 2024

github-project-automation bot moved this to Needs Triage in SIG Apps Apr 14, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Apr 14, 2024

k8s-ci-robot added the kind/design Categorizes issue or PR as related to design. label Apr 20, 2024

alculquicondor mentioned this issue Apr 26, 2024

Zone-aware down scaling behavior #124149

Open

alculquicondor mentioned this issue Apr 26, 2024

PROPOSAL - extend 'scale' subresource API to support pod-deletion-cost #123541

Open

k8s-ci-robot added the sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. label Apr 26, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 21, 2024

k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Nov 21, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 21, 2024

toVersus mentioned this issue Dec 27, 2024

When scaling down pod replicas in ReplicaSets, it is supported to prioritize removing specified pods. #129396

Open

Replicaset scaling down takes inter-pods scheduling constraints into consideration #124306

Replicaset scaling down takes inter-pods scheduling constraints into consideration #124306

Comments

sanposhiho commented Apr 14, 2024 • edited Loading

Summary

Motivation

Goal

Non Goal

Proposal

How each plugin is implemented

PodAffinity

PodTopologySpread

Existing sorting logic

Risk(Downside) and Mitigation

Negative impact to throughput of replicaset controller

Backwards compatibility

Alternative

Reference

k8s-ci-robot commented Apr 14, 2024

toVersus commented Apr 14, 2024 • edited Loading

sanposhiho commented Apr 15, 2024 • edited Loading

nayihz commented Apr 16, 2024

sanposhiho commented Apr 16, 2024 • edited Loading

toVersus commented Apr 16, 2024

sanposhiho commented Apr 18, 2024

toVersus commented Apr 18, 2024 • edited Loading

sanposhiho commented Apr 19, 2024

sanposhiho commented Apr 20, 2024 • edited Loading

alculquicondor commented Apr 22, 2024

sanposhiho commented Apr 23, 2024 • edited Loading

tskillian commented Apr 24, 2024

alculquicondor commented Apr 25, 2024

alculquicondor commented Apr 25, 2024

sanposhiho commented Apr 26, 2024 • edited Loading

alculquicondor commented Apr 26, 2024

alculquicondor commented Apr 26, 2024

sanposhiho commented Apr 27, 2024 • edited Loading

sanposhiho commented May 23, 2024 • edited Loading

szuecs commented May 27, 2024

sanposhiho commented May 30, 2024 • edited Loading

szuecs commented Jun 4, 2024

PradeepKumar-15 commented Aug 14, 2024

sanposhiho commented Aug 15, 2024 • edited Loading

szuecs commented Aug 15, 2024

rtomadpg commented Aug 23, 2024

szuecs commented Aug 23, 2024

k8s-triage-robot commented Nov 21, 2024

szuecs commented Nov 21, 2024

szuecs commented Nov 21, 2024

yogeek commented Nov 22, 2024

szuecs commented Nov 22, 2024

YamasouA commented Nov 22, 2024

sanposhiho commented Nov 23, 2024

atiratree commented Nov 27, 2024 • edited Loading

sanposhiho commented Nov 27, 2024

atiratree commented Nov 27, 2024

sanposhiho commented Nov 27, 2024

soltysh commented Nov 28, 2024

atiratree commented Nov 29, 2024

sanposhiho commented Nov 30, 2024 • edited Loading

thesuperzapper commented Dec 1, 2024 • edited Loading

sanposhiho commented Dec 1, 2024 • edited Loading

atiratree commented Dec 1, 2024

sanposhiho commented Apr 14, 2024 •

edited

Loading

toVersus commented Apr 14, 2024 •

edited

Loading

sanposhiho commented Apr 15, 2024 •

edited

Loading

sanposhiho commented Apr 16, 2024 •

edited

Loading

toVersus commented Apr 18, 2024 •

edited

Loading

sanposhiho commented Apr 20, 2024 •

edited

Loading

sanposhiho commented Apr 23, 2024 •

edited

Loading

sanposhiho commented Apr 26, 2024 •

edited

Loading

sanposhiho commented Apr 27, 2024 •

edited

Loading

sanposhiho commented May 23, 2024 •

edited

Loading

sanposhiho commented May 30, 2024 •

edited

Loading

sanposhiho commented Aug 15, 2024 •

edited

Loading

atiratree commented Nov 27, 2024 •

edited

Loading

sanposhiho commented Nov 30, 2024 •

edited

Loading

thesuperzapper commented Dec 1, 2024 •

edited

Loading

sanposhiho commented Dec 1, 2024 •

edited

Loading