Having programable, the waiting time before the "autoscale-up/down" is effective #56335

rsd35410 · 2017-11-24T10:59:11Z

Is this a BUG REPORT or FEATURE REQUEST?:

Kind feature
@kubernetes/sig-autoscaling-feature-requests
@kubernetes/sig/autoscaling

What happened:

There is no way to change the waiting time before the "autoscale-up/down" is effective.
In addition, when the "autoscale-up" is performed, it is not directly scaled to the max number defined inside the HPA.

What you expected to happen:

Having these parameters configurable into the template definition of the HPA, like:

"scale-up" waiting time
"scale-down" waiting time
"scale-up" step
"scale-down" step

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version):
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

jhorwit2 · 2017-11-25T14:42:43Z

/sig autoscaling
/kind feature

mattjmcnaughton · 2017-11-27T13:37:34Z

Hi @rsd35410! Cool idea!

Just a couple of clarifying questions to make sure I'm understanding. Currently, its possible to specify the HorizontalPodAutoscalerUpscaleForbiddenWindow and HorizontalPodAutoscalerDownscaleForbiddenWindow when creating the HorizontalController. I believe changing these settings would give you control over the "scale-up/scale-down" intervals you mention. However, this setting is global across all horizontal pod autoscalers (as it is set on the controller). Are you hoping to have multiple autoscalers in the same cluster, each with different upscale and downscale forbidden windows?

Curious on your thoughts @DirectXMan12 and @MaciekPytel ? I personally think allowing window specification on a per hpa basis could be a useful change (especially given our discussions on auto-scaler performance @MaciekPytel). The main downside I see is that it requires changing the public api. Fortunately, I don't think it would need to be a breaking change, as we could always fall back to the controller values. I'm happy to take a stab at implementing these changes if we do go down this path.

For your second suggestion, do you mind providing an example of a time when "when the "autoscale-up" is performed, it is not directly scaled to the max number defined inside the HPA"? Do you mean that if you have 3 nodes operating at 120% of their desired capacity, the number of desired nodes is the same, regardless of whether maxReplicas is 10 or 100?

rsd35410 · 2017-11-27T14:18:40Z

Hi @mattjmcnaughton

Yes, my idea is to have multiple autoscalers in the same cluster with possibly different 'custom metrics' and so with different updscale and downscale forbidden windows for each.

For my second suggestion, I will try to explain my use case:
I have developped an application with my own 'custom metric' which is more or less a number of 'jobs' in a message queue.
So, according to the number of 'job', the HPA performs autoscaling to have as many job processing as job inside the message queue.
Unfortunately, when the HPA is doing the autoscaling up, the number of scaled jobs is not the one defined by the 'custom metric'. It take times to grow to the value of the 'custom metric'.

That's why I'm asking for a kind of 'step' in order to have this programmable.

I hope my explanation is enough clear.

MaciekPytel · 2017-12-01T17:24:51Z

Hi @rsd35410, @mattjmcnaughton,
Sorry for late reply. I think it makes sense to add more options to control HPA speed. I'd prefer a more systematic approach of having a wider discussion in sig on how to give users more control and improve HPA reaction speed in general, rather than just adding a few fields. However, we're already too late for 1.9 anyway and most people are stuck in last minute pre-release bugfixes, etc, so this may not be the best time. Let's come back to this after 1.9 release, if that's ok with you.

DirectXMan12 · 2017-12-01T20:34:49Z

specification on a per hpa basis could be a useful change

Yeah, we've discussed this quite a bit in the past, and the conclusion has always been that the forbidden window is an implementation detail that we really shouldn't need at all, so we shouldn't expose it as a knob. There's an argument to be made about a pragmatic approach, but even then, it probably shouldn't be an API field (at most an annotation).

As for the step field, I'm not certain as to how that helps your usecase, but I'm not certain I understand exactly what you're describing.

As for the usecase itself, it's probably a better idea to scale on the ratio between incoming and processed messages, and then weight a little to process additional backlog.

The problem with directly scaling on the number of jobs in the queue is that picking a good target number is strange (do you actually want to always have 3 items in the queue, or would you really prefer to just be able to process items as they're coming in), and at that increasing the number of replicas proportional to the number of jobs in the queue is not necessarily the best way to scale:

What if jobs are coming in just as fast as you are processing, but there's a 10 item backlog? Then, if you have a target of 1 backlog item, you'll get 10 times as many pods, as you do now, which is probably not what you want -- you probably just want 1 or 2 extra pods to process down the backlog.

mattjmcnaughton · 2017-12-04T13:42:28Z

@MaciekPytel @DirectXMan12 that makes total sense! I agree with not wanting to "double down" on adding customization to an implementation detail that you feel shouldn't really be necessary. I'll give some thought to if there is a good alternative and we can sync after the 1.9 release. Good luck getting 1.9 out the door :)

fejta-bot · 2018-03-04T14:17:14Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

MaciekPytel · 2018-03-05T09:57:37Z

/remove-lifecycle stale

I'm not convinced we want to expose the fields as requested in this issue, but the problem of arbitrary forbidden periods and 2x limit on scale-up is still there and needs to be addressed.

fejta-bot · 2018-06-03T10:26:19Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

DirectXMan12 · 2018-06-12T19:45:39Z

For future readers of this issue, we're brainstorming improvements here: https://docs.google.com/document/d/1Gy90Rbjazq3yYEUL-5cvoVBgxpzcJC9vcfhAkkhMINs/edit#heading=h.9oka059ig9n5

foxish · 2018-06-26T01:30:51Z

/remove-lifecycle stale

foxish · 2018-06-26T01:34:02Z

Specifying the HPA windows for up-scale and down-scale on a per-HPA basis makes a lot of sense. We have several HPAs and some of them need aggressive scale-up and some of them do not (and should not). @DirectXMan12, is this specific use of instance-specific cooldown also being discussed as part of the effort you linked?

DirectXMan12 · 2018-07-16T18:50:46Z

@foxish yes, it's on the list of things we're discussing.

fejta-bot · 2018-10-14T19:21:00Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

varkey · 2018-10-15T04:09:56Z

/remove-lifecycle stale

fejta-bot · 2019-01-13T04:44:42Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-02-12T05:01:07Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

gliush · 2019-02-25T13:38:29Z

Hey there, I'm working on the RFC to add some configuration parameters into the HPA. will the following RFC help solving your issues?
https://docs.google.com/document/d/1Su2tzHEtjc7_orTVbKZEhSFGWiHk16JqA4Z7zJU2oNs/edit
I'd appreciate any feedback.

dturn · 2019-03-15T23:23:37Z

Are the "scale-up" & "scale-down" waiting time ideas still being discussed in this issue? If not can someone suggest a pointer of where to look, the google doc seems to have gone quiet. I'd be happy to open a new issue / doc, I have an app that I'd like to scale via HPA, but is rather sensitive to thrashing.

gliush · 2019-04-02T08:42:42Z

@dturn: if you're talking about my document, the work is being done here:
k8s PR with API changes: #74525
KEP issue: kubernetes/enhancements#853
KEP itself: kubernetes/enhancements#883

sqerison · 2019-04-09T11:44:47Z

I'm also interested to have "scale-up" & "scale-down" waiting time and "scale-step", because HPA adds as much as it can and not relying to percentage over threshold.
/remove-lifecycle stale

gliush · 2019-04-09T12:05:50Z

@sqerison : This is covered by my KEP.

sqerison · 2019-04-09T12:37:03Z

@gliush, I see, thanks. But it still not merged and not released.
I working with 1.12.* version and not planning to upgrade soon.

Is there some possibility to smoothly add new pods, like by 10?

Currently HPA adds from 100 to 200 pods just in one minute when metric only 10-20% above threshold.

fejta-bot · 2019-05-09T13:36:43Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2019-05-09T13:36:51Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Nov 24, 2017

rsd35410 changed the title ~~Having the wating time before the "scale-up/down" is effective programable~~ Having programable, the waiting time before the "autoscale-up/down" is effective Nov 24, 2017

k8s-ci-robot added sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. kind/feature Categorizes issue or PR as related to a new feature. labels Nov 25, 2017

k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Nov 25, 2017

mattjmcnaughton mentioned this issue Jan 5, 2018

HPA should have scale down/up limits #39090

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 4, 2018

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 5, 2018

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 3, 2018

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 26, 2018

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 14, 2018

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 15, 2018

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 13, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 12, 2019

gliush mentioned this issue Feb 25, 2019

Configurable HorizontalPodAutoscaler #74525

Merged

k8s-ci-robot closed this as completed May 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Having programable, the waiting time before the "autoscale-up/down" is effective #56335

Having programable, the waiting time before the "autoscale-up/down" is effective #56335

rsd35410 commented Nov 24, 2017 •

edited

Loading

jhorwit2 commented Nov 25, 2017

mattjmcnaughton commented Nov 27, 2017

rsd35410 commented Nov 27, 2017

MaciekPytel commented Dec 1, 2017

DirectXMan12 commented Dec 1, 2017

mattjmcnaughton commented Dec 4, 2017

fejta-bot commented Mar 4, 2018

MaciekPytel commented Mar 5, 2018

fejta-bot commented Jun 3, 2018

DirectXMan12 commented Jun 12, 2018

foxish commented Jun 26, 2018

foxish commented Jun 26, 2018

DirectXMan12 commented Jul 16, 2018

fejta-bot commented Oct 14, 2018

varkey commented Oct 15, 2018

fejta-bot commented Jan 13, 2019

fejta-bot commented Feb 12, 2019

gliush commented Feb 25, 2019

dturn commented Mar 15, 2019

gliush commented Apr 2, 2019

sqerison commented Apr 9, 2019

gliush commented Apr 9, 2019

sqerison commented Apr 9, 2019

fejta-bot commented May 9, 2019

k8s-ci-robot commented May 9, 2019

Having programable, the waiting time before the "autoscale-up/down" is effective #56335

Having programable, the waiting time before the "autoscale-up/down" is effective #56335

Comments

rsd35410 commented Nov 24, 2017 • edited Loading

jhorwit2 commented Nov 25, 2017

mattjmcnaughton commented Nov 27, 2017

rsd35410 commented Nov 27, 2017

MaciekPytel commented Dec 1, 2017

DirectXMan12 commented Dec 1, 2017

mattjmcnaughton commented Dec 4, 2017

fejta-bot commented Mar 4, 2018

MaciekPytel commented Mar 5, 2018

fejta-bot commented Jun 3, 2018

DirectXMan12 commented Jun 12, 2018

foxish commented Jun 26, 2018

foxish commented Jun 26, 2018

DirectXMan12 commented Jul 16, 2018

fejta-bot commented Oct 14, 2018

varkey commented Oct 15, 2018

fejta-bot commented Jan 13, 2019

fejta-bot commented Feb 12, 2019

gliush commented Feb 25, 2019

dturn commented Mar 15, 2019

gliush commented Apr 2, 2019

sqerison commented Apr 9, 2019

gliush commented Apr 9, 2019

sqerison commented Apr 9, 2019

fejta-bot commented May 9, 2019

k8s-ci-robot commented May 9, 2019

rsd35410 commented Nov 24, 2017 •

edited

Loading