Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPA scales up despite utilization being under target #129228

Open
onematchfox opened this issue Dec 16, 2024 · 3 comments
Open

HPA scales up despite utilization being under target #129228

onematchfox opened this issue Dec 16, 2024 · 3 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling.

Comments

@onematchfox
Copy link

onematchfox commented Dec 16, 2024

What happened?

We use Argo Rollouts to perform canary deployments of our services. During a canary deployment, new pods are brought up (the canary pods) which are included in the Status of the Rollout's scale subresource. When HPA is configured to scale on a metric with a high utilization ratio (generally memory), this results in HPA scaling out, despite the fact that the utilization is under the target.

This seems to be a result of the behaviour of the replica calculator where:

  • the recommended replicas is a simple function of utilizationRatio and the total number of pods returned by the selector
  • missing pods with a utilization ratio of below 1 are assumed to be because of a "scale down" and are treated as if they are consuming 100% of resources (whereas in this case we are scaling out during a rollout)

In addition, the utilization ratio is never checked prior to updating prior to HPA increasing the desired replicas leading to incorrect/misleading log messages like "New size: X; reason: memory resource utilization (percentage of request) above target".

Further details can be found in issues logged for Argo Rollouts here and here.

What did you expect to happen?

HPA should not scale up when utilization is below target (I.e. when utilizationRatio < 1)

How can we reproduce it (as minimally and precisely as possible)?

The issue can be reproduced by simulating the behaviour of Argo Rollouts and creating a 2nd ReplicaSet with selectors that match an existing Deployment. For example:

Create base Deployment and HorizontalPodAutoscaler resources:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  namespace: test
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
        - image: nginx
          name: nginx
          resources:
            requests:
              memory: "10Mi"
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: nginx
  namespace: test
spec:
  maxReplicas: 50
  minReplicas: 3
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: nginx
  metrics:
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 50

Wait to ensure that metrics are available and HPA is stable at 3 pods.

$ kubectl get pods,hpa
NAME                         READY   STATUS    RESTARTS   AGE
pod/nginx-7fb4f6d65f-jgdzm   1/1     Running   0          30s
pod/nginx-7fb4f6d65f-mtnxv   1/1     Running   0          30s
pod/nginx-7fb4f6d65f-tc6c8   1/1     Running   0          30s

NAME                                        REFERENCE          TARGETS           MINPODS   MAXPODS   REPLICAS   AGE
horizontalpodautoscaler.autoscaling/nginx   Deployment/nginx   memory: 42%/50%   3         50        3          30s

Create a 2nd replica set:

apiVersion: apps/v1
kind: ReplicaSet
metadata:
  name: nginx-canary
  namespace: test
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
      role: canary
  template:
    metadata:
      labels:
        app: nginx
        role: canary
    spec:
      containers:
        - image: nginx
          name: nginx
          resources:
            requests:
              memory: "10Mi"

Observe that HPA will scale the existing Deployment:

$ kubectl get pods,hpa                                                                                     
NAME                         READY   STATUS    RESTARTS   AGE
pod/nginx-7fb4f6d65f-2hh7f   1/1     Running   0          2m36s
pod/nginx-7fb4f6d65f-6lfr5   1/1     Running   0          2m6s
pod/nginx-7fb4f6d65f-9wgfn   1/1     Running   0          6s
pod/nginx-7fb4f6d65f-gpzp9   1/1     Running   0          2m6s
pod/nginx-7fb4f6d65f-hdsff   1/1     Running   0          3m6s
pod/nginx-7fb4f6d65f-j24kp   1/1     Running   0          3m6s
pod/nginx-7fb4f6d65f-jgdzm   1/1     Running   0          4m36s
pod/nginx-7fb4f6d65f-kxbvz   1/1     Running   0          66s
pod/nginx-7fb4f6d65f-l4rzr   1/1     Running   0          2m36s
pod/nginx-7fb4f6d65f-mtnxv   1/1     Running   0          4m36s
pod/nginx-7fb4f6d65f-qvpqx   1/1     Running   0          3m6s
pod/nginx-7fb4f6d65f-rq2r8   1/1     Running   0          96s
pod/nginx-7fb4f6d65f-s52hj   1/1     Running   0          66s
pod/nginx-7fb4f6d65f-tc6c8   1/1     Running   0          4m36s
pod/nginx-7fb4f6d65f-zwpfc   1/1     Running   0          36s
pod/nginx-canary-2prb6       1/1     Running   0          3m41s
pod/nginx-canary-9l5rc       1/1     Running   0          3m41s
pod/nginx-canary-g79tk       1/1     Running   0          3m41s

NAME                                        REFERENCE          TARGETS           MINPODS   MAXPODS   REPLICAS   AGE
horizontalpodautoscaler.autoscaling/nginx   Deployment/nginx   memory: 42%/50%   3         50        14         4m36s

Check events on HPA to see details on why scaling occurred:

$ kubectl describe hpa nginx
Name:                                                     nginx
Namespace:                                                test
Labels:                                                   <none>
Annotations:                                              <none>
CreationTimestamp:                                        Mon, 16 Dec 2024 10:20:04 +0100
Reference:                                                Deployment/nginx
Metrics:                                                  ( current / target )
  resource memory on pods  (as a percentage of request):  42% (4503324444m) / 50%
Min replicas:                                             3
Max replicas:                                             50
Deployment pods:                                          8 current / 8 desired
Conditions:
  Type            Status  Reason              Message
  ----            ------  ------              -------
  AbleToScale     True    ReadyForNewScale    recommended size matches current size
  ScalingActive   True    ValidMetricFound    the HPA was able to successfully calculate a replica count from memory resource utilization (percentage of request)
  ScalingLimited  False   DesiredWithinRange  the desired count is within the acceptable range
Events:
  Type     Reason                   Age                    From                       Message
  ----     ------                   ----                   ----                       -------
  Warning  FailedGetResourceMetric  4m35s (x3 over 4m46s)  horizontal-pod-autoscaler  unable to get metric memory: no metrics returned from resource metrics API
  Normal   SuccessfulRescale        3m31s                  horizontal-pod-autoscaler  New size: 6; reason: memory resource utilization (percentage of request) above target
  Normal   SuccessfulRescale        3m1s                   horizontal-pod-autoscaler  New size: 8; reason: memory resource utilization (percentage of request) above target
  Normal   SuccessfulRescale        2m31s                  horizontal-pod-autoscaler  New size: 10; reason: memory resource utilization (percentage of request) above target
  Normal   SuccessfulRescale        2m1s                   horizontal-pod-autoscaler  New size: 11; reason: memory resource utilization (percentage of request) above target
  Normal   SuccessfulRescale        91s                    horizontal-pod-autoscaler  New size: 13; reason: memory resource utilization (percentage of request) above target
  Normal   SuccessfulRescale        61s                    horizontal-pod-autoscaler  New size: 14; reason: memory resource utilization (percentage of request) above target
  Normal   SuccessfulRescale        31s                    horizontal-pod-autoscaler  New size: 15; reason: memory resource utilization (percentage of request) above target
  Normal   SuccessfulRescale        1s                     horizontal-pod-autoscaler  New size: 16; reason: memory resource utilization (percentage of request) above target

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
Client Version: v1.31.3
Kustomize Version: v5.4.2
Server Version: v1.30.5-gke.1443001

Cloud provider

GKE

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@onematchfox onematchfox added the kind/bug Categorizes issue or PR as related to a bug. label Dec 16, 2024
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 16, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@onematchfox
Copy link
Author

Linking #120875 as this is essentially the same issue; albeit in this case, the duplicated selectors are intentional.

@onematchfox
Copy link
Author

/sig autoscaling

@k8s-ci-robot k8s-ci-robot added sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling.
Projects
None yet
Development

No branches or pull requests

2 participants