Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPA: Fix int overflow in GetExternalPerPodMetricReplicas #127050

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

omerap12
Copy link
Member

@omerap12 omerap12 commented Sep 1, 2024

What type of PR is this?

/kind bug

What this PR does / why we need it:

Addresses potential integer overflows in GetExternalPerPodMetricReplicas function to prevent unexpected behavior with extreme metric values.

Which issue(s) this PR fixes:

As part of #127022, we've identified a need to address potential integer overflows in our scaling calculations. This PR implements the fix for the GetExternalPerPodMetricReplicas function.

Special notes for your reviewer:

Does this PR introduce a user-facing change?

None

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

None

Signed-off-by: Omer Aplatony <omerap12@gmail.com>
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 1, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Sep 1, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @omerap12. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 1, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: omerap12
Once this PR has been reviewed and has the lgtm label, please assign mwielgus for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@@ -369,9 +369,21 @@ func (c *ReplicaCalculator) GetExternalPerPodMetricReplicas(statusReplicas int32
usageRatio := float64(usage) / (float64(targetUsagePerPod) * float64(replicaCount))
if math.Abs(1.0-usageRatio) > c.tolerance {
// update number of replicas if the change is large enough
replicaCount = int32(math.Ceil(float64(usage) / float64(targetUsagePerPod)))
replicaCountResult := math.Ceil(float64(usage) / float64(targetUsagePerPod))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we calculate here first and then use the math.Ceil method?
like this:

replicaCountResult := float64(usage) / float64(targetUsagePerPod)
if replicaCountResult > float64(math.MaxInt32) {
   replicaCount = math.MaxInt32
} else {
   replicaCount = math.Ceil(int32(replicaCountResult))
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. it makes sense. Ill adjust thanks :)

expectedReplicas: math.MaxInt32,
metric: &metricInfo{
name: "qps",
levels: []int64{math.MaxInt64},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand the purpose of this test case. Does our test case test the effect of int overflow?

@googs1025
Copy link
Member

I'm not quite following the backstory behind this fix. Any chance you could give me a scenario to reproduce it? Also, one thing that's got me curious – when it comes to handling integer overflows, why are we only addressing the maximum overflow and not the minimum? I'm wondering what kind of situations could lead to this issue.
If I've overlooked something, please point it out to me, thanks :)

@omerap12
Copy link
Member Author

omerap12 commented Sep 1, 2024

I'm not quite following the backstory behind this fix. Any chance you could give me a scenario to reproduce it? Also, one thing that's got me curious – when it comes to handling integer overflows, why are we only addressing the maximum overflow and not the minimum? I'm wondering what kind of situations could lead to this issue. If I've overlooked something, please point it out to me, thanks :)

Hey @googs1025, thanks for your review! We might also want to consider addressing the minimum. Perhaps @sheepster1 could provide us with some additional insights.

@sheepster1
Copy link

@googs1025 Wrote about how I encountered this issue over at #126892
I'm not too familiar with the project. Could you maybe elaborate on what other info can be relevant here?

The exact setup I am using is keda with the Prometheus scaler querying a value larger than maxInt/0.1. I think this can be mocked with a query like this: vector(300000000), and a threshold of 0.1 but I didn't get to testing this mock yet.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 30, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.

5 participants