Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Having programable, the waiting time before the "autoscale-up/down" is effective #56335

Closed
rsd35410 opened this issue Nov 24, 2017 · 25 comments
Closed
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling.

Comments

@rsd35410
Copy link

rsd35410 commented Nov 24, 2017

Is this a BUG REPORT or FEATURE REQUEST?:

Kind feature
@kubernetes/sig-autoscaling-feature-requests
@kubernetes/sig/autoscaling

What happened:

There is no way to change the waiting time before the "autoscale-up/down" is effective.
In addition, when the "autoscale-up" is performed, it is not directly scaled to the max number defined inside the HPA.

What you expected to happen:

Having these parameters configurable into the template definition of the HPA, like:

  • "scale-up" waiting time
  • "scale-down" waiting time
  • "scale-up" step
  • "scale-down" step

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@k8s-github-robot k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Nov 24, 2017
@rsd35410 rsd35410 changed the title Having the wating time before the "scale-up/down" is effective programable Having programable, the waiting time before the "autoscale-up/down" is effective Nov 24, 2017
@jhorwit2
Copy link
Contributor

/sig autoscaling
/kind feature

@k8s-ci-robot k8s-ci-robot added sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. kind/feature Categorizes issue or PR as related to a new feature. labels Nov 25, 2017
@k8s-github-robot k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Nov 25, 2017
@mattjmcnaughton
Copy link
Contributor

Hi @rsd35410! Cool idea!

Just a couple of clarifying questions to make sure I'm understanding. Currently, its possible to specify the HorizontalPodAutoscalerUpscaleForbiddenWindow and HorizontalPodAutoscalerDownscaleForbiddenWindow when creating the HorizontalController. I believe changing these settings would give you control over the "scale-up/scale-down" intervals you mention. However, this setting is global across all horizontal pod autoscalers (as it is set on the controller). Are you hoping to have multiple autoscalers in the same cluster, each with different upscale and downscale forbidden windows?

Curious on your thoughts @DirectXMan12 and @MaciekPytel ? I personally think allowing window specification on a per hpa basis could be a useful change (especially given our discussions on auto-scaler performance @MaciekPytel). The main downside I see is that it requires changing the public api. Fortunately, I don't think it would need to be a breaking change, as we could always fall back to the controller values. I'm happy to take a stab at implementing these changes if we do go down this path.

For your second suggestion, do you mind providing an example of a time when "when the "autoscale-up" is performed, it is not directly scaled to the max number defined inside the HPA"? Do you mean that if you have 3 nodes operating at 120% of their desired capacity, the number of desired nodes is the same, regardless of whether maxReplicas is 10 or 100?

@rsd35410
Copy link
Author

Hi @mattjmcnaughton

Yes, my idea is to have multiple autoscalers in the same cluster with possibly different 'custom metrics' and so with different updscale and downscale forbidden windows for each.

For my second suggestion, I will try to explain my use case:
I have developped an application with my own 'custom metric' which is more or less a number of 'jobs' in a message queue.
So, according to the number of 'job', the HPA performs autoscaling to have as many job processing as job inside the message queue.
Unfortunately, when the HPA is doing the autoscaling up, the number of scaled jobs is not the one defined by the 'custom metric'. It take times to grow to the value of the 'custom metric'.

That's why I'm asking for a kind of 'step' in order to have this programmable.

I hope my explanation is enough clear.

@MaciekPytel
Copy link
Contributor

Hi @rsd35410, @mattjmcnaughton,
Sorry for late reply. I think it makes sense to add more options to control HPA speed. I'd prefer a more systematic approach of having a wider discussion in sig on how to give users more control and improve HPA reaction speed in general, rather than just adding a few fields. However, we're already too late for 1.9 anyway and most people are stuck in last minute pre-release bugfixes, etc, so this may not be the best time. Let's come back to this after 1.9 release, if that's ok with you.

@DirectXMan12
Copy link
Contributor

specification on a per hpa basis could be a useful change

Yeah, we've discussed this quite a bit in the past, and the conclusion has always been that the forbidden window is an implementation detail that we really shouldn't need at all, so we shouldn't expose it as a knob. There's an argument to be made about a pragmatic approach, but even then, it probably shouldn't be an API field (at most an annotation).

As for the step field, I'm not certain as to how that helps your usecase, but I'm not certain I understand exactly what you're describing.

As for the usecase itself, it's probably a better idea to scale on the ratio between incoming and processed messages, and then weight a little to process additional backlog.

The problem with directly scaling on the number of jobs in the queue is that picking a good target number is strange (do you actually want to always have 3 items in the queue, or would you really prefer to just be able to process items as they're coming in), and at that increasing the number of replicas proportional to the number of jobs in the queue is not necessarily the best way to scale:

What if jobs are coming in just as fast as you are processing, but there's a 10 item backlog? Then, if you have a target of 1 backlog item, you'll get 10 times as many pods, as you do now, which is probably not what you want -- you probably just want 1 or 2 extra pods to process down the backlog.

@mattjmcnaughton
Copy link
Contributor

@MaciekPytel @DirectXMan12 that makes total sense! I agree with not wanting to "double down" on adding customization to an implementation detail that you feel shouldn't really be necessary. I'll give some thought to if there is a good alternative and we can sync after the 1.9 release. Good luck getting 1.9 out the door :)

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 4, 2018
@MaciekPytel
Copy link
Contributor

/remove-lifecycle stale

I'm not convinced we want to expose the fields as requested in this issue, but the problem of arbitrary forbidden periods and 2x limit on scale-up is still there and needs to be addressed.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 5, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 3, 2018
@DirectXMan12
Copy link
Contributor

For future readers of this issue, we're brainstorming improvements here: https://docs.google.com/document/d/1Gy90Rbjazq3yYEUL-5cvoVBgxpzcJC9vcfhAkkhMINs/edit#heading=h.9oka059ig9n5

@foxish
Copy link
Contributor

foxish commented Jun 26, 2018

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 26, 2018
@foxish
Copy link
Contributor

foxish commented Jun 26, 2018

Specifying the HPA windows for up-scale and down-scale on a per-HPA basis makes a lot of sense. We have several HPAs and some of them need aggressive scale-up and some of them do not (and should not). @DirectXMan12, is this specific use of instance-specific cooldown also being discussed as part of the effort you linked?

@DirectXMan12
Copy link
Contributor

@foxish yes, it's on the list of things we're discussing.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 14, 2018
@varkey
Copy link

varkey commented Oct 15, 2018

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 15, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 13, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 12, 2019
@gliush
Copy link
Contributor

gliush commented Feb 25, 2019

Hey there, I'm working on the RFC to add some configuration parameters into the HPA. will the following RFC help solving your issues?
https://docs.google.com/document/d/1Su2tzHEtjc7_orTVbKZEhSFGWiHk16JqA4Z7zJU2oNs/edit
I'd appreciate any feedback.

@dturn
Copy link

dturn commented Mar 15, 2019

Are the "scale-up" & "scale-down" waiting time ideas still being discussed in this issue? If not can someone suggest a pointer of where to look, the google doc seems to have gone quiet. I'd be happy to open a new issue / doc, I have an app that I'd like to scale via HPA, but is rather sensitive to thrashing.

@gliush
Copy link
Contributor

gliush commented Apr 2, 2019

@dturn: if you're talking about my document, the work is being done here:
k8s PR with API changes: #74525
KEP issue: kubernetes/enhancements#853
KEP itself: kubernetes/enhancements#883

@sqerison
Copy link

sqerison commented Apr 9, 2019

I'm also interested to have "scale-up" & "scale-down" waiting time and "scale-step", because HPA adds as much as it can and not relying to percentage over threshold.
/remove-lifecycle stale

@gliush
Copy link
Contributor

gliush commented Apr 9, 2019

@sqerison : This is covered by my KEP.

@sqerison
Copy link

sqerison commented Apr 9, 2019

@gliush, I see, thanks. But it still not merged and not released.
I working with 1.12.* version and not planning to upgrade soon.

Is there some possibility to smoothly add new pods, like by 10?

Currently HPA adds from 100 to 200 pods just in one minute when metric only 10-20% above threshold.

@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling.
Projects
None yet
Development

No branches or pull requests