PROPOSAL

thesuperzapper · 2024-02-27T22:22:49Z

Background

The idea of letting users customize the way Deployments (ReplicaSets) remove Pods when replicas are decreased has been floating around since at least 2017, with other issues dating back to 2015.

Since Kubernetes 1.22, the controller.kubernetes.io/pod-deletion-cost annotation proposed in KEP-2255 is available in BETA.

There have been several other proposals, but this one should supersede them:

Problem

Problem 1: It's too hard to get/update pod-deletion cost

It is currently too hard to get/update the controller.kubernetes.io/pod-deletion-cost annotation for all Pods in a Deployment/ReplicaSet. This makes it difficult to use pod-deletion-cost in practice.

The main issue is that the pod-deletion-cost annotation must be updated BEFORE the replicas count is decreased, this means that any system that wants to use pod-deletion-cost must:

Track which Pods are currently in the Deployment (possibly under multiple ReplicaSets)
Clean up any pod-deletion-cost annotations that were not used.
When scaling down, update the pod-deletion-cost of the Pods that will be deleted, and THEN update the replicas count.

This difficulty often prompts people to use the pod-deletion-cost annotation in a way that is NOT recommended, such as making a controller to update the pod-deletion-cost annotation even when no scale-down is happening (which is a stated anti-pattern).

Problem 2: HorizontalPodAutoscaler cant use pod-deletion-cost

There is no sensible way to extend the HorizontalPodAutoscaler resource to be able to make use of pod-deletion-cost when scaling Deployments. This is because introducing complicated Pod-specific logic to update pod-deletion-cost annotations is inevitably going to be brittle.

Overview

The general idea is to make it easier to read/write the controller.kubernetes.io/pod-deletion-cost annotation for all Pods in the Deployment/ReplicaSet. To achieve this, we can extend the existing Scale v1 subresource to be able to read/write the controller.kubernetes.io/pod-deletion-cost annotations of Pods in the Deployment/ReplicaSet.

Current State

We already have a special Scale v1 subresource, which can be used by autoscalers to do things like:

GET: the current replicas of a Deployment
GET: the current label selector of a Deployment
PATCH: the current replicas of a Deployment

Example 1: GET: /apis/apps/v1/namespaces/{namespace}/deployments/{name}/scale:

> curl -s localhost:8001/apis/apps/v1/namespaces/my-namespace/deployments/my-deployment/scale | yq .
kind: Scale
apiVersion: autoscaling/v1
metadata:
  name: my-deployment
  namespace: my-namespace
spec:
    replicas: 2
status:
    replicas: 2
    selector: my-label=my-label-value

NOTE: the HorizontalPodAutoscaler already uses this API to do its scaling in a resource-agnostic way

Future State

We can extend the Scale v1 subresource with two new fields:

spec.podDeletionCosts: used to PATCH the controller.kubernetes.io/pod-deletion-cost annotation on specific Pods
status.podDeletionCosts: used to GET the current pod-deletion-cost of Pods in the Deployment

Example 1: GET: /apis/apps/v1/namespaces/{namespace}/deployments/{name}/scale:

> curl -s localhost:8001/apis/apps/v1/namespaces/my-namespace/deployments/my-deployment/scale | yq .
kind: Scale
apiVersion: autoscaling/v1
metadata:
  name: my-deployment
  namespace: my-namespace
spec:
  replicas: 3

  ## NOTE: this is empty, because this is a GET, not a PATCH 
  ##.      (we could make this be non-empty, but it would be redundant)
  podDeletionCosts: {}

status:
  replicas: 3
  selector: my-label=my-label-value

  ## all pods with non-empty `controller.kubernetes.io/pod-deletion-cost` annotations
  ## NOTE: this might include Pods from multiple ReplicaSets, if the Deployment is rolling out
  ## NOTE: a value of 0 means the annotation is explicitly set to 0, not that it is missing
  ##.      (pods with no annotation are NOT included in this list, for efficiency)
  podDeletionCosts:
    pod-1: 0
    pod-2: 100
    pod-3: 200

Example 2: kubectl patch ... --subresource=scale:

# this command does two things:
#  1. scale the Deployment to 2 replica
#  2. set the `pod-deletion-cost` of `pod-1` to -100 (making it much more likely to be deleted)
kubectl patch deployment my-deployment \
  --namespace my-namespace \
  --subresource='scale' \
  --type='merge' \
  --patch='{"spec": {"replicas": 2, "podDeletionCosts": {"pod-1": -100}}}'

Benefits / Drawbacks

The main benefits of this approach are:

Autoscaler systems can easily check what the current pod-deletion-cost are, and then update them during scale-down as appropriate. No need to make hundreds of Pod GET requests.
Autoscaler systems can use a single PATCH to change spec.replicas AND update the pod-deletion-cost of Pods.
It does not require significant changes to how scaling is implemented. We can piggyback on the existing work of pod-deletion-cost annotations.

The main drawbacks are:

This still only applies to Deployments/ReplicaSets (because pod-deletion-cost is a feature of ReplicaSets)
It is slightly strange to automatically update the controller.kubernetes.io/pod-deletion-cost annotation:
- However, we should remember that the Pods are "managed" by the Deployment, so it IS appropriate for the ReplicaSet controller to update the Pod's definition.

User Stories

User 1: Manual Scaling

As a user, I want to be able to scale down a Deployment and influence which Pods are deleted based on my knowledge of the current state of the system.

For example, say I am running a stateful application with 3 replicas:

I know that pod-1 is currently idle, but pod-2 and pod-3 are both busy.
I want to scale down to 2 replicas, but I want to make sure that pod-1 is deleted first, because it is idle.

To achieve this, I can do the following:

Verify the Deployment is not currently rolling out (so there is only one active ReplicaSet)
Use kubectl get ... --subresource=scale to see the current pod-deletion-cost of all Pods in the Deployment
Use kubectl patch ... --subresource=scale to BOTH:
- set set replicas to 2
- update the pod-deletion-cost of pod-1 to a value that makes it more likely to be deleted

User 2: Custom Autoscalers

As a developer of a custom autoscaler, I want to use application-specific metrics to influence which Pods are deleted during scale-down to minimize the impact on my application and its users.

To achieve this, I can do the following:

Keep track of what the Pods are doing, so I can make informed decisions about which pods are best to delete.
When the time comes to scale down:
- Use the Scale subresource to read the pod-deletion-cost of all Pods in the Deployment
- Use the Scale subresource to update the replicas AND the pod-deletion-cost of Pods as appropriate

User 3: HorizontalPodAutoscaler

At least initially, the HorizontalPodAutoscaler will not directly use this feature, because it is primarily concerned with scaling replicas based on a metric, and does not know about application-specific factors that might influence which Pods should be deleted.

However, this feature will make it easier for the HorizontalPodAutoscaler to be extended to have "pod-deletion-cost" awareness in the future.

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2024-02-27T22:22:58Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

thesuperzapper · 2024-02-27T23:35:05Z

/sig autoscaling
/sig apps
/wg batch

AxeZhan · 2024-02-28T03:18:30Z

/cc

sftim · 2024-02-28T08:42:33Z

If we had 10000 Pods for a workload, how would using this subresource work? Any concerns there?

kwiesmueller · 2024-02-28T16:44:56Z

I'm a bit worried about the scale too. It reminds me a little of services blowing up with too many pod IPs.
And what would the semantic be for patching? If one only wants to update a subset of pods, does the patch support that (e.g. SSA).

Also, it seems like a new style of interaction where the handler of the scale subresource needs to patch annotations in all pods. How would we do that? What are the risks? Could this cause requests to stall if we fan out a bunch of patches?
This ties an annotation very explicitly to a non-annotation field. Maybe the annotation should first move to a proper API field (or something different like Pod Disruption Budgets).

I'm not very familiar with the annotation, but from a quick read would more expect that users run a cron or something that updates them at certain intervals. Updating them at scale down certainly works, but these annotations feel like metrics so I wonder if there is a better system to handle this in general (maybe using metrics API).

remiville · 2024-02-28T18:09:24Z

If no pod-deletion-cost is set as a parameter of the scale operation (only pods name/ip), it could be implicitly -MAX_INT by default

thesuperzapper · 2024-02-28T20:47:17Z

I'm a bit worried about the scale too. It reminds me a little of services blowing up with too many pod IPs. And what would the semantic be for patching? If one only wants to update a subset of pods, does the patch support that (e.g. SSA).

The idea of only updating specific Pods is inherent to the design of spec.podDeletionCosts. Think of spec.podDeletionCosts as a "transient" list of "requests" to update the pod-deletion-cost annotations of specific Pods, the goal of the Deployment/ReplicaSet controller is to make it empty by applying it to the Pods.

For example, if a 3-replica Deployment (my-deployment) has a ReplicaSet which has 3 Pods (pod-1, pod-2, and pod-3), consider the following scale patch on my-deployment:

spec:
  replicas: 2
  podDeletionCosts:
    pod-1: -100

The deployment controller can answer this request by doing the following (IN ORDER):

Updating the pod-deletion-cost annotation of pod-1 to be -100
Updating the spec.replicas of the deployment to be 2

We can be quite lax about processing spec.podDeletionCosts. For example, if the controller encounters a non-existent Pod name (or one not owned by the Deployment/ReplicaSet being patched), then it can just ignore that key and continue silently (possibly creating an event to bubble this up to the user).

Also, it seems like a new style of interaction where the handler of the scale subresource needs to patch annotations in all pods. How would we do that? What are the risks? Could this cause requests to stall if we fan out a bunch of patches? This ties an annotation very explicitly to a non-annotation field. Maybe the annotation should first move to a proper API field (or something different like Pod Disruption Budgets).

Unless the user requests a patch of all pods (by listing every pod in spec.podDeletionCosts), the controller should only need to make a small number of patches. But as a protection mechanism, we could have a timeout or a maximum number of spec.podDeletionCosts keys.

Regarding whether we should introduce a new Pod field, and remove the controller.kubernetes.io/pod-deletion-cost annotation, I think this was not done previously in KEP-2255 because such a field would not be applicable to all Pods (only ones in ReplicaSets), and we don't really like extending the Pod API.

I'm not very familiar with the annotation, but from a quick read would more expect that users run a cron or something that updates them at certain intervals. Updating them at scale down certainly works, but these annotations feel like metrics so I wonder if there is a better system to handle this in general (maybe using metrics API).

The KEP-2255 actually explicitly recommends against updating the pod-deletion-cost using automated methods, and only recommends people update it just before scale down. (Which is also noted in the docs).

It's really important NOT think of pod-deletion-cost as a metric, because the actual value is not important. Rather, the relative values across all Pods in a ReplicaSet (at the moment of scale-down) is what matters. During a scale-down, the controller will remove Pods with the lowest pod-deletion-cost first (with 0 being the implied default if no pod-deletion-cost annotation is set on a Pod).

thesuperzapper · 2024-02-28T21:12:17Z

If we had 10000 Pods for a workload, how would using this subresource work? Any concerns there?

@sftim Here is my thinking about Deployments with MANY pods.

In the "update" direction:

The controller only needs to take an action for Pods explicitly listed in spec.podDeletionCosts, the overall number of Pods that happen to be in the Deployment is not relevant. (Unless the user does something like try and update all 10,000 Pods at the same time, but we could have a "max updates per request" for scale patches to mitigate this).

In the "read" direction:

The status.podDeletionCosts (note: status) should contain only the Pods which have a pod-deletion-cost annotation set, which should be small or empty in most cases.
- If the pod-deletion-cost annotation is used correctly (only setting it just before scale-down, not with a controller), the number of Pods with the annotation should be very low (only non-0 during scale-down), and the frequency of changes to the annotations should be very small.
Rather than checking the annotations of all Pods for each request (by looping over all the owned Pods), we can store this information in the status field of the ReplicaSet (similar to our current status.availableReplicas and status.fullyLabeledReplicas, but rather than a single integer, as a mapping of pod_name -> pod_deletion_cost).
- The ReplicaSet controller is already checking things about all the Pods it "owns", and watching for changes in "ready" status, so we can use a similar watcher pattern to reconcile the new status.podDeletionCosts field of the ReplicaSet in an efficient way.

sftim · 2024-02-28T21:15:17Z

I'm worried that this won't scale and will be a burden to maintain. You don't need to convince me, but SIG Apps and SIG Node might want to see your working.

alculquicondor · 2024-04-26T15:05:03Z

FYI #124306

Although it's quite more limited than what is proposed here

smarterclayton · 2024-05-23T17:57:47Z

In wg-serving we've identified a few places where an improved pod deletion cost would be relevant, although not quite aligned with this proposal.

One LLM serving use case is to be able to bias scale down towards specific replicas based on their load which varies over a few to tens of seconds for many use cases. In these cases an upstream component may be biasing traffic flow (since LLM latency is proportional to traffic, by controlling the simultaneous requests to a server you can achieve different latency goals) to specific instances, and would like to have a way to ensure scale down happens on the instances that are not at their target traffic (i.e. autoscale should scale the one that the upstream component is already steering traffic away from).

I would generally not think of the solution being specific to a particular workload controller - implementing it as a field on replica sets unfortunately prevents statefulsets or jobs or argo rollouts from participating. In the use case above, we'd want to be thinking about pods (from the traffic shaping level), not the controller that created those pods.

x13n · 2024-07-17T15:25:42Z

A comment on CUJs, not on this specific proposal: One more use case to consider is the alignment of behavior between parent controllers and kube-scheduler. If kube-scheduler is configured to target most utilized nodes first, it would make sense for the parent controller to delete pods from least utilized nodes first. Similarly, if kube-scheduler favors spreading by targeting least utilized nodes with new pods, the controllers should then favor deleting pods from most utilized nodes first. Otherwise the scheduler settings are not actually meaningful in clusters with a lot of autoscaling.

k8s-triage-robot · 2024-10-15T15:53:50Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

thesuperzapper · 2024-10-15T17:12:50Z

/remove-lifecycle stale

This is a longstanding important feature, to avoid needing to patch the pod-deletion-cost as annotations.

Being able to set them temporarily for a specific downscale would be a game changer for Kubernetes Deployment autoscaling, as it would make it much easier to provide candidates for deletion.

atiratree · 2024-11-29T16:15:36Z

This feature and its problems have been discussed in #124306. It could potentially be solved by the proposed Evacuation API: #124306 (comment)

sanposhiho · 2024-12-01T02:42:15Z

I see the same scalability concern on this, like other folks, and I don't see a valid solution for that.

In the "update" direction:
The controller only needs to take an action for Pods explicitly listed in spec.podDeletionCosts, the overall number of Pods that happen to be in the Deployment is not relevant. (Unless the user does something like try and update all 10,000 Pods at the same time, but we could have a "max updates per request" for scale patches to mitigate this).

So, what if there're 10k Pods, and it's about to get scaled down to 1? How do we update the annotation of 10k-1 Pods in a scalable way? and who? (<- apparently SIG-App maintainers don't want to include any scheduling related logic in the RS controller (see comments on my issue from folks). I believe they also would not want to do with the deployment controller as well for the same reason)
I do prefer to use Evacuation API than this proposal.

Transfer your comment here:

I say this because currently, the evacuation API KEP (kubernetes/enhancements#4565) does not seem to address the core purpose of pod-deletion-cost, which is influencing which replicas are removed during down-scale (when reducing the replicas of a deployment).
#124306 (comment)

The Evacuation API does... right? like @atiratree mentioned at the KEP's motivation and the descheduling/downscaling section.

atiratree · 2024-12-01T12:47:17Z

So, what if there're 10k Pods, and it's about to get scaled down to 1? How do we update the annotation of 10k-1 Pods in a scalable way?

Yeah, this is very difficult during the scale subresource request.

We do not connect to the deployment/replicaset controller, we only manipulate the API objects. The advantage of the scale is that it is generic and can be used for many controllers. So to implement this we either:

Pass the pod cost to the underlying object (e.g. deployment), but that does not scale well and would require support and API change from all the controllers. And it is impossible to enforce such support for the CRDs.
Update the pods before scaling. But how do we manage such a large number of pods and possible update failures? And without breaking the API contract?

I think an easier alternative is to update all the pods before calling the scale endpoint.

The Evacuation API does... right? like @atiratree mentioned at the KEP's motivation and the descheduling/downscaling section.

Yes. It does not implement the workload scaling and scheduling itself, but it could be used as a scalable delivery mechanism for the downscaling that other components (e.g. HPA, descheduler) can use and build upon.

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Feb 27, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Feb 27, 2024

thesuperzapper mentioned this issue Feb 27, 2024

PROPOSAL configurable down-scaling behaviour in ReplicaSets & Deployments #107598

Closed

github-project-automation bot added this to SIG Apps Feb 27, 2024

github-project-automation bot moved this to Needs Triage in SIG Apps Feb 27, 2024

This was referenced Feb 28, 2024

Scheduler input should be taken when reducing replicas #4301

Open

Scale down a deployment by removing specific pods #45509

Closed

lexey-e-shelf mentioned this issue Feb 28, 2024

task-aware celery worker autoscaling (+ pod-deletion-cost) airflow-helm/charts#339

Open

szuecs mentioned this issue Apr 2, 2024

Zone-aware down scaling behavior #124149

Open

alculquicondor mentioned this issue Apr 22, 2024

Replicaset scaling down takes inter-pods scheduling constraints into consideration #124306

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 15, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 15, 2024

ddelange mentioned this issue Oct 29, 2024

Scale down a deployment by removing specific pods (PodDeletionCost) kubernetes/enhancements#2255

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PROPOSAL - extend 'scale' subresource API to support `pod-deletion-cost` #123541

PROPOSAL - extend 'scale' subresource API to support `pod-deletion-cost` #123541

thesuperzapper commented Feb 27, 2024 •

edited

Loading

k8s-ci-robot commented Feb 27, 2024

thesuperzapper commented Feb 27, 2024

AxeZhan commented Feb 28, 2024

sftim commented Feb 28, 2024

kwiesmueller commented Feb 28, 2024

remiville commented Feb 28, 2024

thesuperzapper commented Feb 28, 2024

thesuperzapper commented Feb 28, 2024

sftim commented Feb 28, 2024

alculquicondor commented Apr 26, 2024

smarterclayton commented May 23, 2024 •

edited

Loading

x13n commented Jul 17, 2024

k8s-triage-robot commented Oct 15, 2024

thesuperzapper commented Oct 15, 2024

atiratree commented Nov 29, 2024

sanposhiho commented Dec 1, 2024 •

edited

Loading

atiratree commented Dec 1, 2024 •

edited

Loading

PROPOSAL - extend 'scale' subresource API to support pod-deletion-cost #123541

PROPOSAL - extend 'scale' subresource API to support pod-deletion-cost #123541

Comments

thesuperzapper commented Feb 27, 2024 • edited Loading

Background

Problem

Problem 1: It's too hard to get/update pod-deletion cost

Problem 2: HorizontalPodAutoscaler cant use pod-deletion-cost

Proposal

Overview

Current State

Future State

Benefits / Drawbacks

User Stories

User 1: Manual Scaling

User 2: Custom Autoscalers

User 3: HorizontalPodAutoscaler

k8s-ci-robot commented Feb 27, 2024

thesuperzapper commented Feb 27, 2024

AxeZhan commented Feb 28, 2024

sftim commented Feb 28, 2024

kwiesmueller commented Feb 28, 2024

remiville commented Feb 28, 2024

thesuperzapper commented Feb 28, 2024

thesuperzapper commented Feb 28, 2024

sftim commented Feb 28, 2024

alculquicondor commented Apr 26, 2024

smarterclayton commented May 23, 2024 • edited Loading

x13n commented Jul 17, 2024

k8s-triage-robot commented Oct 15, 2024

thesuperzapper commented Oct 15, 2024

atiratree commented Nov 29, 2024

sanposhiho commented Dec 1, 2024 • edited Loading

atiratree commented Dec 1, 2024 • edited Loading

PROPOSAL - extend 'scale' subresource API to support `pod-deletion-cost` #123541

PROPOSAL - extend 'scale' subresource API to support `pod-deletion-cost` #123541

thesuperzapper commented Feb 27, 2024 •

edited

Loading

smarterclayton commented May 23, 2024 •

edited

Loading

sanposhiho commented Dec 1, 2024 •

edited

Loading

atiratree commented Dec 1, 2024 •

edited

Loading