add additional log & UT for HPA #127394

kevinteng525 · 2024-09-16T13:37:14Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

We are encountering an issue in production where the Horizontal Pod Autoscaler (HPA) sometimes scales up later than expected when advanced policies are defined. This issue cannot be reproduced with a testing application using the same strategy, nor can it be consistently reproduced with the same production application at a later time.

This PR adds additional logging to the HPA for enhanced troubleshooting, helping diagnose intermittent scale delay issues. The current logs lack sufficient details, and the HPA status does not retain enough information for effective diagnosis. This enhancement will provide greater visibility into metrics and events during scaling, aiding in identifying and resolving the delay problem. Additionally, unit tests have been added to help reproduce and address the issue.

Which issue(s) this PR fixes:

N/A

Special notes for your reviewer:

This enhancement addresses the need for better diagnostics in production environments where the scaling delay cannot be reproduced in testing pools or via unit tests alone.

As normalizeDesiredReplicasWithBehaviors needs to consider past recommendations and scale-up/scale-down events, this PR logs all relevant data to aid in troubleshooting.

Before this PR

If log verbose is 2:

I0916 10:35:42.589751       7 horizontal.go:893] "Successfully rescaled" HPA="default/keda-hpa-nginx-scaledobject" scaleTarget="Deployment/default/nginx-deployment" currentReplicas=3 desiredReplicas=4 reason="external metric s0-prometheus-CPU(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: nginx-scaledobject,},MatchExpressions:[]LabelSelectorRequirement{},}) above target"
I0916 10:35:42.602209       7 horizontal.go:1423] "Successfully updated status" HPA="default/keda-hpa-nginx-scaledobject"
I0916 10:36:27.213778       7 horizontal.go:1423] "Successfully updated status" HPA="default/keda-hpa-nginx-scaledobject"
I0916 10:36:57.357731       7 horizontal.go:1423] "Successfully updated status" HPA="default/keda-hpa-nginx-scaledobject"
I0916 10:37:27.812401       7 horizontal.go:1423] "Successfully updated status" HPA="default/keda-hpa-nginx-scaledobject"

After this PR

If log verbose is 2:

I0916 10:35:42.581074       7 horizontal.go:851] "Proposing desired replicas" desiredReplicas=6 currentReplicas=3 metric="external metric s0-prometheus-CPU(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: nginx-scaledobject,},MatchExpressions:[]LabelSelectorRequirement{},})" timestamp="0001-01-01 00:00:00 +0000 UTC" scaleTarget="Deployment/default/nginx-deployment" metrics=[{"type":"External","external":{"metric":{"name":"s0-prometheus-CPU","selector":{"matchLabels":{"scaledobject.keda.sh/name":"nginx-scaledobject"}}},"current":{"value":"197m"}}},{"type":"External","external":{"metric":{"name":"s1-prometheus-TEST","selector":{"matchLabels":{"scaledobject.keda.sh/name":"nginx-scaledobject"}}},"current":{"value":"197m"}}},{"type":"External","external":{"metric":{"name":"s2-prometheus-Thread","selector":{"matchLabels":{"scaledobject.keda.sh/name":"nginx-scaledobject"}}},"current":{"value":"0"}}}]
I0916 10:35:42.589751       7 horizontal.go:893] "Successfully rescaled" HPA="default/keda-hpa-nginx-scaledobject" scaleTarget="Deployment/default/nginx-deployment" currentReplicas=3 desiredReplicas=4 reason="external metric s0-prometheus-CPU(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: nginx-scaledobject,},MatchExpressions:[]LabelSelectorRequirement{},}) above target"
I0916 10:35:42.602209       7 horizontal.go:1423] "Successfully updated status" HPA="default/keda-hpa-nginx-scaledobject"
I0916 10:35:57.785436       7 horizontal.go:851] "Proposing desired replicas" desiredReplicas=8 currentReplicas=4 metric="external metric s0-prometheus-CPU(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: nginx-scaledobject,},MatchExpressions:[]LabelSelectorRequirement{},})" timestamp="0001-01-01 00:00:00 +0000 UTC" scaleTarget="Deployment/default/nginx-deployment" metrics=[{"type":"External","external":{"metric":{"name":"s0-prometheus-CPU","selector":{"matchLabels":{"scaledobject.keda.sh/name":"nginx-scaledobject"}}},"current":{"value":"197m"}}},{"type":"External","external":{"metric":{"name":"s1-prometheus-TEST","selector":{"matchLabels":{"scaledobject.keda.sh/name":"nginx-scaledobject"}}},"current":{"value":"197m"}}},{"type":"External","external":{"metric":{"name":"s2-prometheus-Thread","selector":{"matchLabels":{"scaledobject.keda.sh/name":"nginx-scaledobject"}}},"current":{"value":"0"}}}]
I0916 10:35:57.785600       7 horizontal.go:906] "Decided not to scale" scaleTarget="Deployment/default/nginx-deployment" desiredReplicas=4 lastScaleTime="2024-09-16 10:35:42 +0000 UTC"
I0916 10:35:57.793778       7 horizontal.go:1423] "Successfully updated status" HPA="default/keda-hpa-nginx-scaledobject"

If log verbose is 4:

I0916 10:35:42.581074       7 horizontal.go:851] "Proposing desired replicas" desiredReplicas=6 currentReplicas=3 metric="external metric s0-prometheus-CPU(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: nginx-scaledobject,},MatchExpressions:[]LabelSelectorRequirement{},})" timestamp="0001-01-01 00:00:00 +0000 UTC" scaleTarget="Deployment/default/nginx-deployment" metrics=[{"type":"External","external":{"metric":{"name":"s0-prometheus-CPU","selector":{"matchLabels":{"scaledobject.keda.sh/name":"nginx-scaledobject"}}},"current":{"value":"197m"}}},{"type":"External","external":{"metric":{"name":"s1-prometheus-TEST","selector":{"matchLabels":{"scaledobject.keda.sh/name":"nginx-scaledobject"}}},"current":{"value":"197m"}}},{"type":"External","external":{"metric":{"name":"s2-prometheus-Thread","selector":{"matchLabels":{"scaledobject.keda.sh/name":"nginx-scaledobject"}}},"current":{"value":"0"}}}]
I0916 10:35:42.581358       7 horizontal.go:1155] Stabilizing recommendation: key=default/keda-hpa-nginx-scaledobject, currentReplicas=3, desiredReplicas=6, upRecommendation=6, downRecommendation=6, recommendations=[1 4 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6]
I0916 10:35:42.581414       7 horizontal.go:1011] Normalized desired replicas with behaviors - after stabilized recommendation: scaleTarget=Deployment/default/nginx-deployment, currentReplicas=3, minReplicas=3, stabilizedRecommendation=6, prenormalizedDesiredReplicas=6, reason=ScaleUpStabilized, message=recent recommendations were lower than current one, applying the lowest recent recommendation
I0916 10:35:42.581426       7 horizontal.go:1186] Converting desired replicas with behavior rate - scale up: key=default/keda-hpa-nginx-scaledobject, currentReplicas=3, desiredReplicas=6, scaleUpEvents=[{2 {13957253141225347267 1910484270720 0x74b4f40} false}], scaleDownEvents=[]
I0916 10:35:42.581442       7 horizontal.go:1021] Normalized desired replicas with behaviors - after rated recommendation: scaleTarget=Deployment/default/nginx-deployment, currentReplicas=3, minReplicas=3, stabilizedRecommendation=6, desiredReplicas=4, reason=ScaleUpLimit, message=the desired replica count is increasing faster than the maximum scale rate
I0916 10:35:42.589751       7 horizontal.go:893] "Successfully rescaled" HPA="default/keda-hpa-nginx-scaledobject" scaleTarget="Deployment/default/nginx-deployment" currentReplicas=3 desiredReplicas=4 reason="external metric s0-prometheus-CPU(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: nginx-scaledobject,},MatchExpressions:[]LabelSelectorRequirement{},}) above target"
I0916 10:35:42.602209       7 horizontal.go:1423] "Successfully updated status" HPA="default/keda-hpa-nginx-scaledobject"
I0916 10:35:57.785436       7 horizontal.go:851] "Proposing desired replicas" desiredReplicas=8 currentReplicas=4 metric="external metric s0-prometheus-CPU(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: nginx-scaledobject,},MatchExpressions:[]LabelSelectorRequirement{},})" timestamp="0001-01-01 00:00:00 +0000 UTC" scaleTarget="Deployment/default/nginx-deployment" metrics=[{"type":"External","external":{"metric":{"name":"s0-prometheus-CPU","selector":{"matchLabels":{"scaledobject.keda.sh/name":"nginx-scaledobject"}}},"current":{"value":"197m"}}},{"type":"External","external":{"metric":{"name":"s1-prometheus-TEST","selector":{"matchLabels":{"scaledobject.keda.sh/name":"nginx-scaledobject"}}},"current":{"value":"197m"}}},{"type":"External","external":{"metric":{"name":"s2-prometheus-Thread","selector":{"matchLabels":{"scaledobject.keda.sh/name":"nginx-scaledobject"}}},"current":{"value":"0"}}}]
I0916 10:35:57.785527       7 horizontal.go:1155] Stabilizing recommendation: key=default/keda-hpa-nginx-scaledobject, currentReplicas=4, desiredReplicas=8, upRecommendation=6, downRecommendation=8, recommendations=[1 4 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6]
I0916 10:35:57.785549       7 horizontal.go:1011] Normalized desired replicas with behaviors - after stabilized recommendation: scaleTarget=Deployment/default/nginx-deployment, currentReplicas=4, minReplicas=3, stabilizedRecommendation=6, prenormalizedDesiredReplicas=8, reason=ScaleUpStabilized, message=recent recommendations were lower than current one, applying the lowest recent recommendation
I0916 10:35:57.785557       7 horizontal.go:1186] Converting desired replicas with behavior rate - scale up: key=default/keda-hpa-nginx-scaledobject, currentReplicas=4, desiredReplicas=6, scaleUpEvents=[{1 {13957253790305771122 2515024632876 0x74b4f40} false}], scaleDownEvents=[]
I0916 10:35:57.785582       7 horizontal.go:1021] Normalized desired replicas with behaviors - after rated recommendation: scaleTarget=Deployment/default/nginx-deployment, currentReplicas=4, minReplicas=3, stabilizedRecommendation=6, desiredReplicas=4, reason=ScaleUpLimit, message=the desired replica count is increasing faster than the maximum scale rate
I0916 10:35:57.785600       7 horizontal.go:906] "Decided not to scale" scaleTarget="Deployment/default/nginx-deployment" desiredReplicas=4 lastScaleTime="2024-09-16 10:35:42 +0000 UTC"
I0916 10:35:57.793778       7 horizontal.go:1423] "Successfully updated status" HPA="default/keda-hpa-nginx-scaledobject"

Does this PR introduce a user-facing change?

No user-facing change.

Added enhanced logging to HPA for better diagnostics of scaling delays when defined advanced policies.

k8s-ci-robot · 2024-09-16T13:37:24Z

Hi @kevinteng525. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2024-09-16T13:37:33Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kevinteng525
Once this PR has been reviewed and has the lgtm label, please assign mwielgus for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/controller/podautoscaler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

googs1025 · 2024-10-01T03:32:38Z

pkg/controller/podautoscaler/horizontal.go

@@ -843,11 +843,13 @@ func (a *HorizontalController) reconcileAutoscaler(ctx context.Context, hpaShare
 			retErr = err
 		}

-		logger.V(4).Info("Proposing desired replicas",
+		logger.V(2).Info("Proposing desired replicas",


Why do we need to increase the original log level? If we need to troubleshoot a problem, we should increase the level of the controller-manager log instead of changing the log level.

changed back log level to 4, thanks!

pacoxu · 2024-10-10T09:03:08Z

/test pull-kubernetes-unit

kevinteng525 · 2024-10-12T08:30:57Z

/test pull-kubernetes-unit

k8s-ci-robot · 2024-10-12T08:31:12Z

@kevinteng525: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/test pull-kubernetes-unit

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

pacoxu · 2024-10-12T09:57:10Z

/ok-to-test

/test pull-kubernetes-unit

pacoxu · 2024-11-04T06:33:00Z

/cc @mengjiao-liu @gjtempleton
/priority important-longterm
/triage accepted
/assign @MaciekPytel

mengjiao-liu · 2024-11-04T06:37:38Z

/wg structured-logging
/area logging
/area test

mengjiao-liu · 2024-11-07T03:44:55Z

pkg/controller/podautoscaler/horizontal.go

 			"metric", metricName,
 			"timestamp", metricTimestamp,
-			"scaleTarget", reference)
+			"scaleTarget", reference,


I checked and found that the modified logs below also use reference := fmt.Sprintf("%s/%s/%s", hpa.Spec.ScaleTargetRef.Kind, hpa.Namespace, hpa.Spec.ScaleTargetRef.Name) many times. Then you can directly use the following method at the top level of the call(Maybe under line 815), and the logs below will also have this key-value pair.

If other key-value pairs are used multiple times, you can also add them here.

if loggerV := logger.V(4); loggerV.Enabled() { logger = klog.LoggerWithValues( logger, "scaleTarget", reference, "currentReplicas", currentReplicas, ... ) ctx = klog.NewContext(ctx, logger) }

I think it's for logging, and is only utilized 3 times within this method. The rest are used in other methods. Switching to loggerV with V(4) might add complexity and reduce readability. It also can't be used for other verbose levels. I would suggest we keep it simple.

It makes sense.

pkg/controller/podautoscaler/horizontal_test.go

add additional log & UT for HPA

mengjiao-liu

/lgtm
From WG-structured-logging side

k8s-ci-robot · 2024-11-22T03:45:21Z

LGTM label has been added.

Git tree hash: a96248aab189d27d8f3d9b60f6582f9f30f1fd01

k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 16, 2024

k8s-ci-robot requested review from gjtempleton and mwielgus September 16, 2024 13:37

googs1025 reviewed Oct 1, 2024

View reviewed changes

kevinteng525 force-pushed the master branch from af80146 to fa658d3 Compare October 3, 2024 08:20

hashim21223445 approved these changes Oct 3, 2024

View reviewed changes

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 12, 2024

kevinteng525 force-pushed the master branch 2 times, most recently from b1f3879 to abce1d8 Compare October 13, 2024 08:13

kevinteng525 force-pushed the master branch 2 times, most recently from b060f39 to f56faef Compare October 20, 2024 08:47

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 20, 2024

k8s-ci-robot assigned MaciekPytel Nov 4, 2024

k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Nov 4, 2024

k8s-ci-robot requested a review from mengjiao-liu November 4, 2024 06:33

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 4, 2024

k8s-ci-robot added wg/structured-logging Categorizes an issue or PR as relevant to WG Structured Logging. area/logging area/test labels Nov 4, 2024

mengjiao-liu reviewed Nov 7, 2024

View reviewed changes

kevinteng525 force-pushed the master branch from f56faef to c3122bb Compare November 10, 2024 08:17

add additional log & UT for HPA

8da97af

add additional log & UT for HPA

kevinteng525 force-pushed the master branch from c3122bb to 8da97af Compare November 10, 2024 08:31

mengjiao-liu reviewed Nov 22, 2024

View reviewed changes

k8s-ci-robot assigned mengjiao-liu Nov 22, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add additional log & UT for HPA #127394

add additional log & UT for HPA #127394

kevinteng525 commented Sep 16, 2024

k8s-ci-robot commented Sep 16, 2024

k8s-ci-robot commented Sep 16, 2024

googs1025 Oct 1, 2024

kevinteng525 Oct 3, 2024

pacoxu commented Oct 10, 2024

kevinteng525 commented Oct 12, 2024

k8s-ci-robot commented Oct 12, 2024

pacoxu commented Oct 12, 2024

pacoxu commented Nov 4, 2024

mengjiao-liu commented Nov 4, 2024

mengjiao-liu Nov 7, 2024

kevinteng525 Nov 10, 2024

mengjiao-liu Nov 15, 2024

mengjiao-liu left a comment

k8s-ci-robot commented Nov 22, 2024

add additional log & UT for HPA #127394

Are you sure you want to change the base?

add additional log & UT for HPA #127394

Conversation

kevinteng525 commented Sep 16, 2024

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Before this PR

After this PR

Does this PR introduce a user-facing change?

k8s-ci-robot commented Sep 16, 2024

k8s-ci-robot commented Sep 16, 2024

googs1025 Oct 1, 2024

Choose a reason for hiding this comment

kevinteng525 Oct 3, 2024

Choose a reason for hiding this comment

pacoxu commented Oct 10, 2024

kevinteng525 commented Oct 12, 2024

k8s-ci-robot commented Oct 12, 2024

pacoxu commented Oct 12, 2024

pacoxu commented Nov 4, 2024

mengjiao-liu commented Nov 4, 2024

mengjiao-liu Nov 7, 2024

Choose a reason for hiding this comment

kevinteng525 Nov 10, 2024

Choose a reason for hiding this comment

mengjiao-liu Nov 15, 2024

Choose a reason for hiding this comment

mengjiao-liu left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Nov 22, 2024