Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Int overflow in hpa causing incorrect replica count #126892

Open
sheepster1 opened this issue Aug 23, 2024 · 5 comments · May be fixed by #126979
Open

Int overflow in hpa causing incorrect replica count #126892

sheepster1 opened this issue Aug 23, 2024 · 5 comments · May be fixed by #126979
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling.

Comments

@sheepster1
Copy link

What happened?

The setup:
I am using keda with the prometheus scaler. The query I am using, returns the lag in the message queue i am using, and the threshold is set to 0.1.

What happened:
The lag was increasing for a long time, and the replica count reached the max setting as expected. Everything was running fine for some time. When the lag value reached 214,748,364 hpa decided to reduce the replicas from the max limit to 1.

What I think is the problem:
When the lag passes 214,748,364, the calculation here divides by the threshold 0.1 and it passes the max int32 value. causing hpa to scale to the minimum value, 1.
It also seems like a lot of other places in this file cast a 64 bit float to a 32 bit int. Should there maybe be a check everywhere this is done?

What did you expect to happen?

I expected the replica count to stay at the max value. Or alternatively, get an error that we have reached the max value for an external metric value

How can we reproduce it (as minimally and precisely as possible)?

Use an external metric, and set it above 214,748,364 with a threshold of 0.1.

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
# paste output here

1.29

Cloud provider

aws eks

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@sheepster1 sheepster1 added the kind/bug Categorizes issue or PR as related to a bug. label Aug 23, 2024
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 23, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@neolit123
Copy link
Member

/sig autoscaling

@k8s-ci-robot k8s-ci-robot added sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 27, 2024
@omerap12
Copy link
Member

/assign

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 25, 2024
@vaibhav2107
Copy link
Member

As PR is still open,
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants