Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PROPOSAL - extend 'scale' subresource API to support pod-deletion-cost #123541

Open
thesuperzapper opened this issue Feb 27, 2024 · 17 comments
Open
Labels
needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. wg/batch Categorizes an issue or PR as relevant to WG Batch.

Comments

@thesuperzapper
Copy link

thesuperzapper commented Feb 27, 2024

Background

The idea of letting users customize the way Deployments (ReplicaSets) remove Pods when replicas are decreased has been floating around since at least 2017, with other issues dating back to 2015.

Since Kubernetes 1.22, the controller.kubernetes.io/pod-deletion-cost annotation proposed in KEP-2255 is available in BETA.

There have been several other proposals, but this one should supersede them:

Problem

Problem 1: It's too hard to get/update pod-deletion cost

It is currently too hard to get/update the controller.kubernetes.io/pod-deletion-cost annotation for all Pods in a Deployment/ReplicaSet. This makes it difficult to use pod-deletion-cost in practice.

The main issue is that the pod-deletion-cost annotation must be updated BEFORE the replicas count is decreased, this means that any system that wants to use pod-deletion-cost must:

  • Track which Pods are currently in the Deployment (possibly under multiple ReplicaSets)
  • Clean up any pod-deletion-cost annotations that were not used.
  • When scaling down, update the pod-deletion-cost of the Pods that will be deleted, and THEN update the replicas count.

This difficulty often prompts people to use the pod-deletion-cost annotation in a way that is NOT recommended, such as making a controller to update the pod-deletion-cost annotation even when no scale-down is happening (which is a stated anti-pattern).

Problem 2: HorizontalPodAutoscaler cant use pod-deletion-cost

There is no sensible way to extend the HorizontalPodAutoscaler resource to be able to make use of pod-deletion-cost when scaling Deployments. This is because introducing complicated Pod-specific logic to update pod-deletion-cost annotations is inevitably going to be brittle.

Proposal

Overview

The general idea is to make it easier to read/write the controller.kubernetes.io/pod-deletion-cost annotation for all Pods in the Deployment/ReplicaSet. To achieve this, we can extend the existing Scale v1 subresource to be able to read/write the controller.kubernetes.io/pod-deletion-cost annotations of Pods in the Deployment/ReplicaSet.

Current State

We already have a special Scale v1 subresource, which can be used by autoscalers to do things like:

  • GET: the current replicas of a Deployment
  • GET: the current label selector of a Deployment
  • PATCH: the current replicas of a Deployment

Example 1: GET: /apis/apps/v1/namespaces/{namespace}/deployments/{name}/scale:

> curl -s localhost:8001/apis/apps/v1/namespaces/my-namespace/deployments/my-deployment/scale | yq .
kind: Scale
apiVersion: autoscaling/v1
metadata:
  name: my-deployment
  namespace: my-namespace
spec:
    replicas: 2
status:
    replicas: 2
    selector: my-label=my-label-value

NOTE: the HorizontalPodAutoscaler already uses this API to do its scaling in a resource-agnostic way

Future State

We can extend the Scale v1 subresource with two new fields:

  • spec.podDeletionCosts: used to PATCH the controller.kubernetes.io/pod-deletion-cost annotation on specific Pods
  • status.podDeletionCosts: used to GET the current pod-deletion-cost of Pods in the Deployment

Example 1: GET: /apis/apps/v1/namespaces/{namespace}/deployments/{name}/scale:

> curl -s localhost:8001/apis/apps/v1/namespaces/my-namespace/deployments/my-deployment/scale | yq .
kind: Scale
apiVersion: autoscaling/v1
metadata:
  name: my-deployment
  namespace: my-namespace
spec:
  replicas: 3

  ## NOTE: this is empty, because this is a GET, not a PATCH 
  ##.      (we could make this be non-empty, but it would be redundant)
  podDeletionCosts: {}

status:
  replicas: 3
  selector: my-label=my-label-value

  ## all pods with non-empty `controller.kubernetes.io/pod-deletion-cost` annotations
  ## NOTE: this might include Pods from multiple ReplicaSets, if the Deployment is rolling out
  ## NOTE: a value of 0 means the annotation is explicitly set to 0, not that it is missing
  ##.      (pods with no annotation are NOT included in this list, for efficiency)
  podDeletionCosts:
    pod-1: 0
    pod-2: 100
    pod-3: 200

Example 2: kubectl patch ... --subresource=scale:

# this command does two things:
#  1. scale the Deployment to 2 replica
#  2. set the `pod-deletion-cost` of `pod-1` to -100 (making it much more likely to be deleted)
kubectl patch deployment my-deployment \
  --namespace my-namespace \
  --subresource='scale' \
  --type='merge' \
  --patch='{"spec": {"replicas": 2, "podDeletionCosts": {"pod-1": -100}}}'

Benefits / Drawbacks

The main benefits of this approach are:

  • Autoscaler systems can easily check what the current pod-deletion-cost are, and then update them during scale-down as appropriate. No need to make hundreds of Pod GET requests.
  • Autoscaler systems can use a single PATCH to change spec.replicas AND update the pod-deletion-cost of Pods.
  • It does not require significant changes to how scaling is implemented. We can piggyback on the existing work of pod-deletion-cost annotations.

The main drawbacks are:

  • This still only applies to Deployments/ReplicaSets (because pod-deletion-cost is a feature of ReplicaSets)
  • It is slightly strange to automatically update the controller.kubernetes.io/pod-deletion-cost annotation:
    • However, we should remember that the Pods are "managed" by the Deployment, so it IS appropriate for the ReplicaSet controller to update the Pod's definition.

User Stories

User 1: Manual Scaling

As a user, I want to be able to scale down a Deployment and influence which Pods are deleted based on my knowledge of the current state of the system.

For example, say I am running a stateful application with 3 replicas:

  • I know that pod-1 is currently idle, but pod-2 and pod-3 are both busy.
  • I want to scale down to 2 replicas, but I want to make sure that pod-1 is deleted first, because it is idle.

To achieve this, I can do the following:

  1. Verify the Deployment is not currently rolling out (so there is only one active ReplicaSet)
  2. Use kubectl get ... --subresource=scale to see the current pod-deletion-cost of all Pods in the Deployment
  3. Use kubectl patch ... --subresource=scale to BOTH:
    • set set replicas to 2
    • update the pod-deletion-cost of pod-1 to a value that makes it more likely to be deleted

User 2: Custom Autoscalers

As a developer of a custom autoscaler, I want to use application-specific metrics to influence which Pods are deleted during scale-down to minimize the impact on my application and its users.

To achieve this, I can do the following:

  1. Keep track of what the Pods are doing, so I can make informed decisions about which pods are best to delete.
  2. When the time comes to scale down:
    • Use the Scale subresource to read the pod-deletion-cost of all Pods in the Deployment
    • Use the Scale subresource to update the replicas AND the pod-deletion-cost of Pods as appropriate

User 3: HorizontalPodAutoscaler

At least initially, the HorizontalPodAutoscaler will not directly use this feature, because it is primarily concerned with scaling replicas based on a metric, and does not know about application-specific factors that might influence which Pods should be deleted.

However, this feature will make it easier for the HorizontalPodAutoscaler to be extended to have "pod-deletion-cost" awareness in the future.

@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Feb 27, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Feb 27, 2024
@thesuperzapper
Copy link
Author

/sig autoscaling
/sig apps
/wg batch

@k8s-ci-robot k8s-ci-robot added sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/apps Categorizes an issue or PR as relevant to SIG Apps. wg/batch Categorizes an issue or PR as relevant to WG Batch. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 27, 2024
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Apps Feb 27, 2024
@AxeZhan
Copy link
Member

AxeZhan commented Feb 28, 2024

/cc

@sftim
Copy link
Contributor

sftim commented Feb 28, 2024

If we had 10000 Pods for a workload, how would using this subresource work? Any concerns there?

@kwiesmueller
Copy link
Member

I'm a bit worried about the scale too. It reminds me a little of services blowing up with too many pod IPs.
And what would the semantic be for patching? If one only wants to update a subset of pods, does the patch support that (e.g. SSA).

Also, it seems like a new style of interaction where the handler of the scale subresource needs to patch annotations in all pods. How would we do that? What are the risks? Could this cause requests to stall if we fan out a bunch of patches?
This ties an annotation very explicitly to a non-annotation field. Maybe the annotation should first move to a proper API field (or something different like Pod Disruption Budgets).

I'm not very familiar with the annotation, but from a quick read would more expect that users run a cron or something that updates them at certain intervals. Updating them at scale down certainly works, but these annotations feel like metrics so I wonder if there is a better system to handle this in general (maybe using metrics API).

@remiville
Copy link

If no pod-deletion-cost is set as a parameter of the scale operation (only pods name/ip), it could be implicitly -MAX_INT by default

@thesuperzapper
Copy link
Author

I'm a bit worried about the scale too. It reminds me a little of services blowing up with too many pod IPs. And what would the semantic be for patching? If one only wants to update a subset of pods, does the patch support that (e.g. SSA).

The idea of only updating specific Pods is inherent to the design of spec.podDeletionCosts. Think of spec.podDeletionCosts as a "transient" list of "requests" to update the pod-deletion-cost annotations of specific Pods, the goal of the Deployment/ReplicaSet controller is to make it empty by applying it to the Pods.

For example, if a 3-replica Deployment (my-deployment) has a ReplicaSet which has 3 Pods (pod-1, pod-2, and pod-3), consider the following scale patch on my-deployment:

spec:
  replicas: 2
  podDeletionCosts:
    pod-1: -100

The deployment controller can answer this request by doing the following (IN ORDER):

  1. Updating the pod-deletion-cost annotation of pod-1 to be -100
  2. Updating the spec.replicas of the deployment to be 2

We can be quite lax about processing spec.podDeletionCosts. For example, if the controller encounters a non-existent Pod name (or one not owned by the Deployment/ReplicaSet being patched), then it can just ignore that key and continue silently (possibly creating an event to bubble this up to the user).


Also, it seems like a new style of interaction where the handler of the scale subresource needs to patch annotations in all pods. How would we do that? What are the risks? Could this cause requests to stall if we fan out a bunch of patches? This ties an annotation very explicitly to a non-annotation field. Maybe the annotation should first move to a proper API field (or something different like Pod Disruption Budgets).

Unless the user requests a patch of all pods (by listing every pod in spec.podDeletionCosts), the controller should only need to make a small number of patches. But as a protection mechanism, we could have a timeout or a maximum number of spec.podDeletionCosts keys.

Regarding whether we should introduce a new Pod field, and remove the controller.kubernetes.io/pod-deletion-cost annotation, I think this was not done previously in KEP-2255 because such a field would not be applicable to all Pods (only ones in ReplicaSets), and we don't really like extending the Pod API.


I'm not very familiar with the annotation, but from a quick read would more expect that users run a cron or something that updates them at certain intervals. Updating them at scale down certainly works, but these annotations feel like metrics so I wonder if there is a better system to handle this in general (maybe using metrics API).

The KEP-2255 actually explicitly recommends against updating the pod-deletion-cost using automated methods, and only recommends people update it just before scale down. (Which is also noted in the docs).

It's really important NOT think of pod-deletion-cost as a metric, because the actual value is not important. Rather, the relative values across all Pods in a ReplicaSet (at the moment of scale-down) is what matters. During a scale-down, the controller will remove Pods with the lowest pod-deletion-cost first (with 0 being the implied default if no pod-deletion-cost annotation is set on a Pod).

@thesuperzapper
Copy link
Author

If we had 10000 Pods for a workload, how would using this subresource work? Any concerns there?

@sftim Here is my thinking about Deployments with MANY pods.

In the "update" direction:

  • The controller only needs to take an action for Pods explicitly listed in spec.podDeletionCosts, the overall number of Pods that happen to be in the Deployment is not relevant. (Unless the user does something like try and update all 10,000 Pods at the same time, but we could have a "max updates per request" for scale patches to mitigate this).

In the "read" direction:

  • The status.podDeletionCosts (note: status) should contain only the Pods which have a pod-deletion-cost annotation set, which should be small or empty in most cases.
    • If the pod-deletion-cost annotation is used correctly (only setting it just before scale-down, not with a controller), the number of Pods with the annotation should be very low (only non-0 during scale-down), and the frequency of changes to the annotations should be very small.
  • Rather than checking the annotations of all Pods for each request (by looping over all the owned Pods), we can store this information in the status field of the ReplicaSet (similar to our current status.availableReplicas and status.fullyLabeledReplicas, but rather than a single integer, as a mapping of pod_name -> pod_deletion_cost).
    • The ReplicaSet controller is already checking things about all the Pods it "owns", and watching for changes in "ready" status, so we can use a similar watcher pattern to reconcile the new status.podDeletionCosts field of the ReplicaSet in an efficient way.

@sftim
Copy link
Contributor

sftim commented Feb 28, 2024

I'm worried that this won't scale and will be a burden to maintain. You don't need to convince me, but SIG Apps and SIG Node might want to see your working.

@alculquicondor
Copy link
Member

FYI #124306

Although it's quite more limited than what is proposed here

@smarterclayton
Copy link
Contributor

smarterclayton commented May 23, 2024

In wg-serving we've identified a few places where an improved pod deletion cost would be relevant, although not quite aligned with this proposal.

One LLM serving use case is to be able to bias scale down towards specific replicas based on their load which varies over a few to tens of seconds for many use cases. In these cases an upstream component may be biasing traffic flow (since LLM latency is proportional to traffic, by controlling the simultaneous requests to a server you can achieve different latency goals) to specific instances, and would like to have a way to ensure scale down happens on the instances that are not at their target traffic (i.e. autoscale should scale the one that the upstream component is already steering traffic away from).

I would generally not think of the solution being specific to a particular workload controller - implementing it as a field on replica sets unfortunately prevents statefulsets or jobs or argo rollouts from participating. In the use case above, we'd want to be thinking about pods (from the traffic shaping level), not the controller that created those pods.

@x13n
Copy link
Member

x13n commented Jul 17, 2024

A comment on CUJs, not on this specific proposal: One more use case to consider is the alignment of behavior between parent controllers and kube-scheduler. If kube-scheduler is configured to target most utilized nodes first, it would make sense for the parent controller to delete pods from least utilized nodes first. Similarly, if kube-scheduler favors spreading by targeting least utilized nodes with new pods, the controllers should then favor deleting pods from most utilized nodes first. Otherwise the scheduler settings are not actually meaningful in clusters with a lot of autoscaling.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 15, 2024
@thesuperzapper
Copy link
Author

/remove-lifecycle stale

This is a longstanding important feature, to avoid needing to patch the pod-deletion-cost as annotations.

Being able to set them temporarily for a specific downscale would be a game changer for Kubernetes Deployment autoscaling, as it would make it much easier to provide candidates for deletion.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 15, 2024
@atiratree
Copy link
Member

This feature and its problems have been discussed in #124306. It could potentially be solved by the proposed Evacuation API: #124306 (comment)

@sanposhiho
Copy link
Member

sanposhiho commented Dec 1, 2024

I see the same scalability concern on this, like other folks, and I don't see a valid solution for that.

In the "update" direction:
The controller only needs to take an action for Pods explicitly listed in spec.podDeletionCosts, the overall number of Pods that happen to be in the Deployment is not relevant. (Unless the user does something like try and update all 10,000 Pods at the same time, but we could have a "max updates per request" for scale patches to mitigate this).

So, what if there're 10k Pods, and it's about to get scaled down to 1? How do we update the annotation of 10k-1 Pods in a scalable way? and who? (<- apparently SIG-App maintainers don't want to include any scheduling related logic in the RS controller (see comments on my issue from folks). I believe they also would not want to do with the deployment controller as well for the same reason)
I do prefer to use Evacuation API than this proposal.

Transfer your comment here:

I say this because currently, the evacuation API KEP (kubernetes/enhancements#4565) does not seem to address the core purpose of pod-deletion-cost, which is influencing which replicas are removed during down-scale (when reducing the replicas of a deployment).
#124306 (comment)

The Evacuation API does... right? like @atiratree mentioned at the KEP's motivation and the descheduling/downscaling section.

@atiratree
Copy link
Member

atiratree commented Dec 1, 2024

So, what if there're 10k Pods, and it's about to get scaled down to 1? How do we update the annotation of 10k-1 Pods in a scalable way?

Yeah, this is very difficult during the scale subresource request.

We do not connect to the deployment/replicaset controller, we only manipulate the API objects. The advantage of the scale is that it is generic and can be used for many controllers. So to implement this we either:

  • Pass the pod cost to the underlying object (e.g. deployment), but that does not scale well and would require support and API change from all the controllers. And it is impossible to enforce such support for the CRDs.
  • Update the pods before scaling. But how do we manage such a large number of pods and possible update failures? And without breaking the API contract?

I think an easier alternative is to update all the pods before calling the scale endpoint.

The Evacuation API does... right? like @atiratree mentioned at the KEP's motivation and the descheduling/downscaling section.

Yes. It does not implement the workload scaling and scheduling itself, but it could be used as a scalable delivery mechanism for the downscaling that other components (e.g. HPA, descheduler) can use and build upon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. wg/batch Categorizes an issue or PR as relevant to WG Batch.
Projects
Status: Needs Triage
Development

No branches or pull requests