-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replicaset scaling down takes inter-pods scheduling constraints into consideration #124306
Comments
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I think in the existing deletion process, various aspects such as startup time, Pod status, and restart count are prioritized and checked sequentially. If you were to break the existing process down into finer plugins, it seems like you would need to assign priorities or weights to each plugin. How do you plan to handle this aspect?
In what scenarios would the scheduling constraints not be considered when deciding which Pods to scale down? What exactly does "best-effort" mean here in this context?
Does this mean targeting only Pods belonging to a specific ReplicaSet?
If PodTopologySpread were to support the uneven Pod spreading as proposed in #124306, it would mean that the PodTopologySpread plugin in ReplicaSet controller would start considering uneven Pod placement when deleting Pods. Sounds great!
The ReplicaSet controller doesn't perform Pod creation and deletion processes concurrently, so wouldn't a delay in the scale-down process lead to a delay in the start of the scale-out process? I feel like this would only have an impact in special cases where HPA behavior settings and Pod loads spike at short intervals. |
We can just, for example, return very high scores if a Pod is not scheduled or not in running state.
There are many scenarios that would be missed by the implementation of this proposal, e.g.,
It means the controller doesn't try to keep the scheduling constraint of Pods that aren't owned by the Replicaset being scaled down.
Yes exactly. We can easily support that by changing the way of normalizing in PodTopologySpread ScaleDownPlugin.
Actually, the controller has workers that are run concurrently. But, your concern is to the point either way; Given the workers are shared between scaling up/scaling down processes, we cannot say there'd be zero impact on scaling up throughput. I'm not sure whether the impact would be so big that people have to concern, though. |
Nice proposal.
Can you elaborate more about the example? Why all pods from domain-c will be removed? Do we need to break the existing process (start time/ pod status/ restart count) down into individual plugins? If so, what is the correct order for executing these plugins? If there are multiple plugins, can they be executed concurrently? |
The higher score a Node gets, the more preferable to remove the Node.
Technically, we don't need. We can have separate plugins for each existing logic, which I think would look great, but if we want, we can have one plugin for all existing logic.
I updated the interface design a bit while thinking about the answer to this part. |
You've made changes to call the Score plugin concurrently, so now |
Yes.
No. It'd be just like: call all Score() plugins → aggregate all scores from them → call NormalizeScore() plugins → pick up the higher score nodes to delete. Any reason/idea particular you asked this? |
Thanks! My words may have been clumsy, but the processing flow was just as I had imagined.
If each Score plugin is executed in parallel, we'll hold a list of PodWithScore (pointer of Pod object and its score) in memory for each Score plugin, right? This is necessary to aggregate the lists of PodWithScore scored by each plugin before calling NormalizeScore plugins. I was a little concerned about whether memory usage would be high due to the lists of PodWithScore if there are many ReplicaSets to scale down and a large number of Pods associated with those ReplicaSets. |
Yes, higher memory could likely be consumed. |
@kubernetes/sig-apps-proposals @kubernetes/sig-scheduling-leads Can anyone give input on this proposal? |
Yes, both proposals are actually similar, trying to use First, for scalability, we have to implement an algorithm that does not update a deletion-cost value every time another Pod under the same replicaset is added/removed. But, actually every Pod addition/removal directly means a change in the situation of each domain and thus possibly a desired order of deleting Pods under the same replicaset. I couldn't come up with any awesome idea myself and also I don't see clear solution in both issues. Furthermore, if we bring Pod Affinity in addition to that (or maybe other more stuff in the future), it's nearly impossible to calculate the best deletion-cost for each Pod, without a frequent update to the cost value. A general problem is that we don't know when scaling down happens from outside, and then we have to keep putting a deletion-cost to represent a desired order of pod deletion. It's much more straightforward to calculate them in the replicaset controller when the scaling down actually happens. That is an idea where this proposal came from. |
I think something like this would be really helpful. My use case is that I don't want the replica set controller to prioritize removing pods that are colocated - it makes sense to me why it's the safest default, but I want to optimize for the # of pods packed on each node during scale down. This solution would definitely allow me to do that. I'd also be open to helping with the implementation of whatever solution is chosen if needed. |
What if there are multiple pods in the same topology? Which one will be picked? Will we recalculate every time we delete one pod? IIUC, HPA might remove multiple pods in one iteration, but I could be wrong. In general, the proposed solution is very narrow, so not problematic from an API point-of-view. But this also implies that it wouldn't solve other use cases where the "deletion cost" depends on what's running on the Pod. So two questions:
|
cc @atiratree for a SIG Apps perspective. |
Supposing we're going to remove only one Pod and two candidate Pods are on the same Node.
Yes, HPA controller could. And, we wouldn't recalculate every time we delete one Pod.
pod-deletion-cost will be implemented as a plugin, and starting time will be too. So, those two questions have the similar point; how we can integrate new factors (pod affinity / topology spreading) with existing factors. Regarding versus PodAffinity plugin, it depends on how high a score we give based on pod-deletion-cost/starting time is relative to the PodAffinity plugin. i.e., if we want to weigh pod-deletion-cost (or starting time) more than PodAffinity, then we can let pod-deletion-cost (starting time) plugin return a higher score than PodAffinity's one. But, regarding versus PodTopologySpread plugin, PodTopologySpread's If we concern about such breaking by PodTopologySpread plugin, we can do, for example,
|
Let's say you have 8 pods in one topology, and 7 in the other. Now we want to remove 4 pods. Assuming that the remaining scores are equal, wouldn't this result in (4, 7) pods in each topology?
I think something along those lines might be necessary. Maybe it can be in the HPA object. But all those things can be discussed in a KEP :) I'm more interested in hearing from people whether the overall proposal is enough for most needs. |
/sig autoscaling Actually, I'm starting to think that sig-scheduling doesn't have much ownership on this topic. |
I believe you also assume one Pod matching PodAffinity exists in the second topology (the one with 7 Pods).
I'm not yet very sure about the right place for adding a new field or annotation (if we do).
The main group in charge of this proposal should definitely be SIG-Apps though, |
You mentioned -
and that directly violates my first point ↓.
Again, updating the value every time Pod create/update/delete happens is definitely not scalable.
Let's say -
In this case, when we want to scale down app1, it's better to delete app1-pod-B because PodAffinity on app1-pod-A is still satisfied while app1-pod-B is not.
In #124306 (comment), I meant that:
|
Thanks for your answers it makes a lot of sense!
Let's have a more detailed example why I have no scalability concern or I don't see it, yet. If podX is created it likely has not the topology as it is not yet scheduled so for example node/zone topologies would not be there yet, so this we can just shortcut and do nothing. Let's think about the abnormal termination cases:
These will all boil down to a So let's think about
There is no LIST on ever update required. I hope it makes sense to you what I write. Maybe you are right and the scheduler approach is better. I want to understand if that's true also for the |
Thanks for the detailed explanation. Now I got your idea and understand why you don't see a scalability issue. That should work for a simple scenario, but I still see many points that need to be considered -
Overall, I don't see much benefit to do this with deletion cost in the first place, and still believe implementing in the replicaset controller like this proposal would be much simpler. |
@sanposhiho thanks for your answers! I am looking forward for a solution to the problem. |
@sanposhiho thank you for taking up this problem which is very crucial to maintain steady and balanced state of an auto-scaled environment. I am only reaching out to check if you are still pursuing solution to this problem and this git issue is still on your to-do list. Thanks in advance. |
The current status here is that I have to get a sponsor from SIG-App (@kubernetes/sig-apps-misc), who can be a reviewer, to start a KEP. (Plus, may or may not need some discussion before actually starting KEP, like I did with @atiratree) |
@sanposhiho thanks for the clear statement! |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/priority important-longterm |
/remove-lifecycle stale |
Same experience as @rtomadpg on our side : we were quite suprised to end up with replicas in the same AZ despite using podTopologySpreadConstraints after a scale-down by the HPA... when looking for information about this behavior, I found multiple discussions that seems linked to the topic : As I am not entirely sure, can you confirm they are all related to the same topic please ? |
#124149 documents the bug and the others are different proposals to fix it. |
@szuecs |
Again, the blocker is that we need a sponsor from sig-apps to review the KEP and related implementation PRs. @kubernetes/sig-apps-misc Can anyone take a look at the proposal here? |
I still think the biggest problem is the separation of concern as I mentioned in #124306 (comment). And the sheer complexity; multiple plugins managing multiple ReplicaSets in a single decision step (e.g. balancing a Deployment during a rollout). Configuring everything as different applications have different dynamic needs (which will be increasing).
ReplicaSet does not trigger the scale down of its pods by itself. You can create a workload that scales down ReplicaSets only once the pod's metadata has been updated. So the pods do not necessarily have to be updated often. And you could achieve the desired behavior with additional controller and There are also other alternatives:
There needs to be a consensus from sig-apps on the feasibility/value of this approach, before the KEP can be accepted. |
#124306 (comment) doesn't answer all? (or any specific answer of mine didn't satisfy you? let me know in that case.)
The calculation scope is only pods from one Replicaset (being scaled down) in a single decision. As mentioned at
There're various actors who scales down the Pods (HPA, KEDA, or other custom solutions in the community), which depends on each user's cluster. It should be implemented in the RS controller given all clusters (who want this feature) must have the RS controller vs each cluster might use a different scaling component.
So, who would be the Evacuation controller to achieve this feature? Are we gonna have a built-in Evacuation controller that controls the evacuation for those builtin features?
We know the descheduler (or a similar descheduling strategy like kubernetes/enhancements#4329) tries to solve the problem. But, if we all satisfy with them, we don't have the issue(s) here. |
I meant the whole thread from there. In short, I am not sure it fits well within the responsibilities of the ReplicaSet.
I am not entirely convinced that this is a Non Goal. This may be enough in the beginning (depending on which strategies are implemented first). But when we consider all the de/scheduling rules, we should at least consider all the pods of the same application, which may be part of different revisions/replicasets. And what do we do if a part of the application is controlled by a different controller?
True, not possible in vanilla k8s.
I do not know where it should live and who is going to maintain it. And it is not up to me to make the final decision. My opinion is just that it would be better for this functionality not to be tightly coupled with the replicaset controller.
Agree, it is not an ideal solution. |
So, it would end up the question how much we could accept the degradation in RS controller reconciliation. If the performance is within the acceptable range even with the heavy calculation to consider all pods from different RS, then that's the best, of course. If not, we have to compromise, considering a single RS only.
I don't much care where this feature is implemented, directly in RS controller or via Evactuation API like you proposed. We would be happy if it's implemented in any ways. The only hope is having it implemented within the vanilla kubernetes with any option. And, maintenance wise, Yes, I agree that it'd be nicer for this feature to be decoupled from the RS controller. Then, maybe we - sig-scheduling - could have control of those kind of scaling down considering the scheduling constraints. One off topic that comes up my mind: It might be an interesting idea to implement Evacuation API controller within the scheduler. For example, let's say the cluster is packed and one RS ie being scaled down (= we can delete one Pod). |
Having read through this entire issue and all the linked ones, the one theme that comes up is to add additional functionality to core RS controller, which as many has mentioned is currently in the critical path for a lot of applications. I can understand that desire, but at the same time as one of the owners of RS controller I need to make sure we don't bloat the functionality of controllers unnecessarily before fully considering all possible options. I've seen several comments quickly negating the usefulness of PodDeletionCost which is already available, although theoretically not fully stable, which means we have some room for improvements based on the feedback. There's a valuable suggestion from @bgrant0607 in #4301 (comment):
I'm also against mixing the functionality of a controller with scheduler, so the idea of injecting PodAffinity, PodTopologySpread, other, strictly scheduler functionality into any of the existing controllers is definitely not desirable from maintenance perspective. Having said that, I welcome anyone interested to jump on any of the next sig-apps calls to discuss this topic more. |
I have added this issue to the Motivation section (see number 3) of the Evacuation API. And added a Descheduling and Downscaling section that describes how this problem could be solved. Please let me know if you see any issues with the proposed solution. |
The problem is that PodDeletionCost is static and we have to keep updating them to control the order of deletion flexibly. |
Separate to this evacuation API idea, I think we can significantly improve the usefulness of pod-deletion-cost with the proposal from #123541 to extend the I say this because currently, the evacuation API KEP (kubernetes/enhancements#4565) does not seem to address the core purpose of pod-deletion-cost, which is influencing which replicas are removed during down-scale (when reducing the replicas of a deployment). |
Left a comment on your issue to move the discussion. |
The KEP does not implement the whole stack (scheduling, downscaling), but the new API tries to address exactly this problem of influencing which pods get deleted and when. |
/sig apps
/sig scheduling
/kind feature
/assign
Want to get feedback on this proposal. I can proceed to KEP process later, if it looks worthy for it to everyone.
Summary
Refactor Replicaset scaling down implementation for better extendability,
and take Inter-pod scheduling constraints into account during scaling down.
Motivation
When scaling down happens, the replicaset controller determines Pod(s) to remove based on starting time etc (ref), but not scheduling constraints.
Inter-Pod scheduling constraints (PodAffinity/PodTopologySpread) depend on other Pods’ placement, but they’re honored only at the scheduling time.
It’d be more awesome if we can, at least, keep spreading or affinity to some extent during scaling down in a vanilla kubernetes, that is, without installing descheduler etc.
Actually, we’ve seen many issues hoping this feature; the oldest one as far as I found is #4301!
Goal
ScaleDownScorePlugin
/ScaleDownNormalizeScorePlugin
interface and each scaling down factor is implemented as a plugin.Non Goal
This proposal doesn't mean -
ScaleDownScorePlugin
/ScaleDownNormalizeScorePlugin
from users, like the scheduling framework does. But, that's out-of-scope in this proposal.(An elaborated explanation of these Non Goal: #124306 (comment))
Proposal
We’ll introduce a new interface named
ScaleDownScorePlugin
/ScaleDownNormalizeScorePlugin
.Score
of each plugin concurrently.NormalizeScore
of each plugin one by one.How each plugin is implemented
We’ll implement plugins for PodAffinity / PodTopologySpread ( / existing logic to determine which Pods to delete).
Each implementation would be like the following:
PodAffinity
It'll be implemented as
ScaleDownScorePlugin
.It gives a high score to candidate Pod if a Node that the candidate Pod is running on doesn’t have a Pod matching PodAffinity of the candidate Pod.
In order to lower the cost of calculation, it’s probably better to use the scheduler’s snapshot in the replicaset controller too.
PodTopologySpread
It'll be implemented as
ScaleDownNormalizeScorePlugin
.Example:
Before PodTopologySpread
NormalizeScore()
:After PodTopologySpread
NormalizeScore()
:Before PodTopologySpread, all Pods from Domain-c could be removed, but afterwards it’s not likely that all Pods from the same domain are removed.
When actually implementing it, we can consider more; for example, in above case, if the controller is trying to remove 5 Pods, all Pods in domain-b could be removed at the same time. So, maybe we want to avoid "all Pods have the same score" after the normalization. (e.g., change to PodD:99 PodE:100 PodF:101)
Existing sorting logic
And also note that existing logic to determine which Pod(s) to remove (starting time, deletion-cost, etc) can be also implemented as ScaleDownScorePlugin.
Risk(Downside) and Mitigation
Negative impact to throughput of replicaset controller
Scaling down would be slower than the current implementation because of an additional calculation.
I believe it's not the hard/critical requirement to keep the current throughput of scaling down.
But, also, given the controller has workers that are run concurrently and are shared between scaling up/scaling down processes, we cannot say there'd be zero impact on scaling up throughput.
We have to make sure whether the throughput (both scale up and down) is really acceptable after the implementation change.
Backwards compatibility
#124306 (comment)
If we just implement the proposal and make it the default behavior, we would break an existing scaling-down behavior.
If we concern about such breaking by this proposal, we can do, for example,
I'd prefer (1). If we choose (2), the benefit from this proposal would be small.
Alternative
Somehow use
controller.kubernetes.io/pod-deletion-cost
, but in anyways, we'd likely end up having to update Pods frequently and I didn't choose that way.(An elaborated explanation why I didn't choose deletion-cost idea: #124306 (comment))
Reference
The text was updated successfully, but these errors were encountered: