-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PROPOSAL - extend 'scale' subresource API to support pod-deletion-cost
#123541
Comments
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/sig autoscaling |
/cc |
If we had 10000 Pods for a workload, how would using this subresource work? Any concerns there? |
I'm a bit worried about the scale too. It reminds me a little of services blowing up with too many pod IPs. Also, it seems like a new style of interaction where the handler of the scale subresource needs to patch annotations in all pods. How would we do that? What are the risks? Could this cause requests to stall if we fan out a bunch of patches? I'm not very familiar with the annotation, but from a quick read would more expect that users run a cron or something that updates them at certain intervals. Updating them at scale down certainly works, but these annotations feel like metrics so I wonder if there is a better system to handle this in general (maybe using metrics API). |
If no pod-deletion-cost is set as a parameter of the scale operation (only pods name/ip), it could be implicitly -MAX_INT by default |
The idea of only updating specific Pods is inherent to the design of For example, if a 3-replica Deployment ( spec:
replicas: 2
podDeletionCosts:
pod-1: -100 The deployment controller can answer this request by doing the following (IN ORDER):
We can be quite lax about processing
Unless the user requests a patch of all pods (by listing every pod in Regarding whether we should introduce a new Pod field, and remove the
The KEP-2255 actually explicitly recommends against updating the It's really important NOT think of |
@sftim Here is my thinking about Deployments with MANY pods. In the "update" direction:
In the "read" direction:
|
I'm worried that this won't scale and will be a burden to maintain. You don't need to convince me, but SIG Apps and SIG Node might want to see your working. |
FYI #124306 Although it's quite more limited than what is proposed here |
In wg-serving we've identified a few places where an improved pod deletion cost would be relevant, although not quite aligned with this proposal. One LLM serving use case is to be able to bias scale down towards specific replicas based on their load which varies over a few to tens of seconds for many use cases. In these cases an upstream component may be biasing traffic flow (since LLM latency is proportional to traffic, by controlling the simultaneous requests to a server you can achieve different latency goals) to specific instances, and would like to have a way to ensure scale down happens on the instances that are not at their target traffic (i.e. autoscale should scale the one that the upstream component is already steering traffic away from). I would generally not think of the solution being specific to a particular workload controller - implementing it as a field on replica sets unfortunately prevents statefulsets or jobs or argo rollouts from participating. In the use case above, we'd want to be thinking about pods (from the traffic shaping level), not the controller that created those pods. |
A comment on CUJs, not on this specific proposal: One more use case to consider is the alignment of behavior between parent controllers and kube-scheduler. If kube-scheduler is configured to target most utilized nodes first, it would make sense for the parent controller to delete pods from least utilized nodes first. Similarly, if kube-scheduler favors spreading by targeting least utilized nodes with new pods, the controllers should then favor deleting pods from most utilized nodes first. Otherwise the scheduler settings are not actually meaningful in clusters with a lot of autoscaling. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale This is a longstanding important feature, to avoid needing to patch the Being able to set them temporarily for a specific downscale would be a game changer for Kubernetes Deployment autoscaling, as it would make it much easier to provide candidates for deletion. |
This feature and its problems have been discussed in #124306. It could potentially be solved by the proposed Evacuation API: #124306 (comment) |
I see the same scalability concern on this, like other folks, and I don't see a valid solution for that.
So, what if there're 10k Pods, and it's about to get scaled down to 1? How do we update the annotation of 10k-1 Pods in a scalable way? and who? (<- apparently SIG-App maintainers don't want to include any scheduling related logic in the RS controller (see comments on my issue from folks). I believe they also would not want to do with the deployment controller as well for the same reason) Transfer your comment here:
The Evacuation API does... right? like @atiratree mentioned at the KEP's motivation and the descheduling/downscaling section. |
Yeah, this is very difficult during the scale subresource request. We do not connect to the deployment/replicaset controller, we only manipulate the API objects. The advantage of the scale is that it is generic and can be used for many controllers. So to implement this we either:
I think an easier alternative is to update all the pods before calling the scale endpoint.
Yes. It does not implement the workload scaling and scheduling itself, but it could be used as a scalable delivery mechanism for the downscaling that other components (e.g. HPA, descheduler) can use and build upon. |
Background
The idea of letting users customize the way Deployments (ReplicaSets) remove Pods when
replicas
are decreased has been floating around since at least 2017, with other issues dating back to 2015.Since Kubernetes 1.22, the
controller.kubernetes.io/pod-deletion-cost
annotation proposed in KEP-2255 is available in BETA.There have been several other proposals, but this one should supersede them:
Problem
Problem 1: It's too hard to get/update pod-deletion cost
It is currently too hard to get/update the
controller.kubernetes.io/pod-deletion-cost
annotation for all Pods in a Deployment/ReplicaSet. This makes it difficult to usepod-deletion-cost
in practice.The main issue is that the
pod-deletion-cost
annotation must be updated BEFORE thereplicas
count is decreased, this means that any system that wants to usepod-deletion-cost
must:pod-deletion-cost
annotations that were not used.pod-deletion-cost
of the Pods that will be deleted, and THEN update thereplicas
count.This difficulty often prompts people to use the
pod-deletion-cost
annotation in a way that is NOT recommended, such as making a controller to update thepod-deletion-cost
annotation even when no scale-down is happening (which is a stated anti-pattern).Problem 2: HorizontalPodAutoscaler cant use pod-deletion-cost
There is no sensible way to extend the HorizontalPodAutoscaler resource to be able to make use of
pod-deletion-cost
when scaling Deployments. This is because introducing complicated Pod-specific logic to updatepod-deletion-cost
annotations is inevitably going to be brittle.Proposal
Overview
The general idea is to make it easier to read/write the
controller.kubernetes.io/pod-deletion-cost
annotation for all Pods in the Deployment/ReplicaSet. To achieve this, we can extend the existingScale
v1 subresource to be able to read/write thecontroller.kubernetes.io/pod-deletion-cost
annotations of Pods in the Deployment/ReplicaSet.Current State
We already have a special
Scale
v1 subresource, which can be used by autoscalers to do things like:Example 1: GET:
/apis/apps/v1/namespaces/{namespace}/deployments/{name}/scale
:NOTE: the HorizontalPodAutoscaler already uses this API to do its scaling in a resource-agnostic way
Future State
We can extend the
Scale
v1 subresource with two new fields:spec.podDeletionCosts
: used to PATCH thecontroller.kubernetes.io/pod-deletion-cost
annotation on specific Podsstatus.podDeletionCosts
: used to GET the currentpod-deletion-cost
of Pods in the DeploymentExample 1: GET:
/apis/apps/v1/namespaces/{namespace}/deployments/{name}/scale
:Example 2:
kubectl patch ... --subresource=scale
:Benefits / Drawbacks
The main benefits of this approach are:
pod-deletion-cost
are, and then update them during scale-down as appropriate. No need to make hundreds of Pod GET requests.spec.replicas
AND update thepod-deletion-cost
of Pods.pod-deletion-cost
annotations.The main drawbacks are:
pod-deletion-cost
is a feature of ReplicaSets)controller.kubernetes.io/pod-deletion-cost
annotation:User Stories
User 1: Manual Scaling
As a user, I want to be able to scale down a Deployment and influence which Pods are deleted based on my knowledge of the current state of the system.
For example, say I am running a stateful application with 3 replicas:
pod-1
is currently idle, butpod-2
andpod-3
are both busy.pod-1
is deleted first, because it is idle.To achieve this, I can do the following:
kubectl get ... --subresource=scale
to see the currentpod-deletion-cost
of all Pods in the Deploymentkubectl patch ... --subresource=scale
to BOTH:replicas
to2
pod-deletion-cost
ofpod-1
to a value that makes it more likely to be deletedUser 2: Custom Autoscalers
As a developer of a custom autoscaler, I want to use application-specific metrics to influence which Pods are deleted during scale-down to minimize the impact on my application and its users.
To achieve this, I can do the following:
Scale
subresource to read thepod-deletion-cost
of all Pods in the DeploymentScale
subresource to update thereplicas
AND thepod-deletion-cost
of Pods as appropriateUser 3: HorizontalPodAutoscaler
At least initially, the HorizontalPodAutoscaler will not directly use this feature, because it is primarily concerned with scaling
replicas
based on a metric, and does not know about application-specific factors that might influence which Pods should be deleted.However, this feature will make it easier for the HorizontalPodAutoscaler to be extended to have "pod-deletion-cost" awareness in the future.
The text was updated successfully, but these errors were encountered: