[clusteragent/autoscaling] Implement stabilization for horizontal recommendations #31547

jennchenn · 2024-11-27T22:20:39Z

What does this PR do?

Implement stabilization for horizontal recommendations. Algorithm follows what is implemented for HPA.

Motivation

We want to be able to prevent frequent scaling actions being applied in the case of recommendation flapping.

Describe how to test/QA your changes

Set up autoscaling
Configure upscale/downscale stabilization, e.g.

apiVersion: datadoghq.com/v1alpha1
kind: DatadogPodAutoscaler
metadata:
  name:  better-cyclic-burner-query
spec:
  ...
  policy:
    downscale:
      stabilizationWindowSeconds: 300

Check that horizontal recommendations are being limited by stabilization (i.e. when recommendations are flapping)

Possible Drawbacks / Trade-offs

Additional Notes

Relies on changes here DataDog/datadog-operator#1519

agent-platform-auto-pr · 2024-11-27T22:21:38Z

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

inv aws.create-vm --pipeline-id=51624624 --os-family=ubuntu

Note: This applies to commit 1e6acbe

jennchenn · 2024-11-27T22:36:55Z

go.mod

@@ -160,7 +160,7 @@ require (
 	github.com/DataDog/datadog-agent/pkg/util/pointer v0.59.0
 	github.com/DataDog/datadog-agent/pkg/util/scrubber v0.59.0
 	github.com/DataDog/datadog-go/v5 v5.5.0
-	github.com/DataDog/datadog-operator v0.7.1-0.20241024104907-734366f3c0d1
+	github.com/DataDog/datadog-operator v0.7.1-0.20241111183642-43cd97e856a5


temporary; this is referencing this commit DataDog/datadog-operator#1519

cit-pr-commenter · 2024-11-27T22:51:47Z

Regression Detector

Regression Detector Results

Metrics dashboard
Target profiles
Run ID: 03a4ef77-a220-42ec-955f-10d4194315d8

Baseline: 3763407
Comparison: 1e6acbe
Diff

Optimization Goals: ✅ No significant changes detected

Fine details of change detection per experiment

perf	experiment	goal	Δ mean %	Δ mean % CI	trials	links
➖	uds_dogstatsd_to_api_cpu	% cpu utilization	+0.55	[-0.13, +1.24]	1	Logs
➖	quality_gate_idle	memory utilization	+0.47	[+0.44, +0.50]	1	Logs bounds checks dashboard
➖	file_to_blackhole_500ms_latency	egress throughput	+0.26	[-0.52, +1.03]	1	Logs
➖	file_to_blackhole_0ms_latency_http1	egress throughput	+0.15	[-0.75, +1.05]	1	Logs
➖	file_to_blackhole_300ms_latency	egress throughput	+0.04	[-0.62, +0.69]	1	Logs
➖	file_to_blackhole_100ms_latency	egress throughput	+0.01	[-0.68, +0.70]	1	Logs
➖	file_to_blackhole_0ms_latency	egress throughput	+0.00	[-0.91, +0.92]	1	Logs
➖	tcp_dd_logs_filter_exclude	ingress throughput	-0.01	[-0.02, +0.01]	1	Logs
➖	uds_dogstatsd_to_api	ingress throughput	-0.01	[-0.13, +0.11]	1	Logs
➖	file_to_blackhole_0ms_latency_http2	egress throughput	-0.01	[-0.85, +0.83]	1	Logs
➖	quality_gate_idle_all_features	memory utilization	-0.06	[-0.14, +0.02]	1	Logs bounds checks dashboard
➖	file_to_blackhole_1000ms_latency_linear_load	egress throughput	-0.07	[-0.54, +0.40]	1	Logs
➖	file_tree	memory utilization	-0.21	[-0.34, -0.08]	1	Logs
➖	file_to_blackhole_1000ms_latency	egress throughput	-0.34	[-1.13, +0.46]	1	Logs
➖	tcp_syslog_to_blackhole	ingress throughput	-1.27	[-1.34, -1.21]	1	Logs
➖	otel_to_otel_logs	ingress throughput	-1.67	[-2.36, -0.98]	1	Logs
➖	quality_gate_logs	% cpu utilization	-3.06	[-6.25, +0.13]	1	Logs

Bounds Checks: ✅ Passed

perf	experiment	bounds_check_name	replicates_passed	links
✅	file_to_blackhole_0ms_latency	lost_bytes	10/10
✅	file_to_blackhole_0ms_latency	memory_usage	10/10
✅	file_to_blackhole_0ms_latency_http1	lost_bytes	10/10
✅	file_to_blackhole_0ms_latency_http1	memory_usage	10/10
✅	file_to_blackhole_0ms_latency_http2	lost_bytes	10/10
✅	file_to_blackhole_0ms_latency_http2	memory_usage	10/10
✅	file_to_blackhole_1000ms_latency	memory_usage	10/10
✅	file_to_blackhole_1000ms_latency_linear_load	memory_usage	10/10
✅	file_to_blackhole_100ms_latency	lost_bytes	10/10
✅	file_to_blackhole_100ms_latency	memory_usage	10/10
✅	file_to_blackhole_300ms_latency	lost_bytes	10/10
✅	file_to_blackhole_300ms_latency	memory_usage	10/10
✅	file_to_blackhole_500ms_latency	lost_bytes	10/10
✅	file_to_blackhole_500ms_latency	memory_usage	10/10
✅	quality_gate_idle	memory_usage	10/10	bounds checks dashboard
✅	quality_gate_idle_all_features	memory_usage	10/10	bounds checks dashboard
✅	quality_gate_logs	lost_bytes	10/10
✅	quality_gate_logs	memory_usage	10/10

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
Its configuration does not mark it "erratic".

CI Pass/Fail Decision

✅ Passed. All Quality Gates passed.

quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.

vboulineau · 2024-12-04T18:30:05Z

pkg/clusteragent/autoscaling/workload/controller_horizontal.go

@@ -204,6 +204,30 @@ func (hr *horizontalController) computeScaleAction(
 		return nil, 0, errors.New(reason)
 	}

+	var evalAfter time.Duration


Is there a reason why it's early in the flow? I would expect stabilization to be after the outsideBoundaries check and after flooring targetDesiredReplicas between min and max?

vboulineau · 2024-12-04T18:33:02Z

pkg/clusteragent/autoscaling/workload/controller_horizontal.go

+		return originalTargetDesiredReplicas, limitReason
+	}
+
+	upRecommendation := originalTargetDesiredReplicas


Could we use the scaleDirection to only the necessary calculation?

…nt-horizontal-stabilization

agent-platform-auto-pr · 2024-12-10T13:22:39Z

Package size comparison

Comparison with ancestor 3911e67300941f3b1a6910a1883f13bd28e403f3

Diff per package

package	diff	status	size	ancestor	threshold
datadog-agent-amd64-deb	0.00MB	✅	1270.67MB	1270.67MB	140.00MB
datadog-iot-agent-amd64-deb	0.00MB	✅	113.20MB	113.20MB	10.00MB
datadog-dogstatsd-amd64-deb	0.00MB	✅	78.32MB	78.32MB	10.00MB
datadog-heroku-agent-amd64-deb	0.00MB	✅	526.45MB	526.45MB	70.00MB
datadog-agent-x86_64-rpm	0.00MB	⚠️	1279.91MB	1279.91MB	140.00MB
datadog-agent-x86_64-suse	0.00MB	⚠️	1279.91MB	1279.91MB	140.00MB
datadog-iot-agent-x86_64-rpm	0.00MB	✅	113.26MB	113.26MB	10.00MB
datadog-iot-agent-x86_64-suse	0.00MB	✅	113.26MB	113.26MB	10.00MB
datadog-dogstatsd-x86_64-rpm	0.00MB	✅	78.40MB	78.40MB	10.00MB
datadog-dogstatsd-x86_64-suse	0.00MB	✅	78.40MB	78.40MB	10.00MB
datadog-agent-arm64-deb	0.00MB	✅	1004.85MB	1004.85MB	140.00MB
datadog-iot-agent-arm64-deb	0.00MB	✅	108.67MB	108.67MB	10.00MB
datadog-dogstatsd-arm64-deb	0.00MB	✅	55.59MB	55.59MB	10.00MB
datadog-agent-aarch64-rpm	0.00MB	✅	1014.06MB	1014.06MB	140.00MB
datadog-iot-agent-aarch64-rpm	0.00MB	✅	108.74MB	108.74MB	10.00MB

Decision

⚠️ Warning

…nt-horizontal-stabilization

agent-platform-auto-pr · 2024-12-19T23:42:08Z

Uncompressed package size comparison

Comparison with ancestor 37634072805c45b57216ee880e06e380257056e7

Diff per package

package	diff	status	size	ancestor	threshold
datadog-agent-amd64-deb	0.01MB	⚠️	1187.90MB	1187.89MB	140.00MB
datadog-agent-x86_64-rpm	0.01MB	⚠️	1197.16MB	1197.15MB	140.00MB
datadog-agent-x86_64-suse	0.01MB	⚠️	1197.16MB	1197.15MB	140.00MB
datadog-agent-aarch64-rpm	0.01MB	⚠️	943.12MB	943.11MB	140.00MB
datadog-agent-arm64-deb	0.01MB	⚠️	933.88MB	933.87MB	140.00MB
datadog-dogstatsd-amd64-deb	0.00MB	✅	78.57MB	78.57MB	10.00MB
datadog-dogstatsd-x86_64-rpm	0.00MB	✅	78.64MB	78.64MB	10.00MB
datadog-dogstatsd-x86_64-suse	0.00MB	✅	78.64MB	78.64MB	10.00MB
datadog-dogstatsd-arm64-deb	0.00MB	✅	55.77MB	55.77MB	10.00MB
datadog-heroku-agent-amd64-deb	0.00MB	✅	504.86MB	504.86MB	70.00MB
datadog-iot-agent-amd64-deb	0.00MB	✅	113.32MB	113.32MB	10.00MB
datadog-iot-agent-x86_64-rpm	0.00MB	✅	113.39MB	113.39MB	10.00MB
datadog-iot-agent-x86_64-suse	0.00MB	✅	113.39MB	113.39MB	10.00MB
datadog-iot-agent-arm64-deb	0.00MB	✅	108.79MB	108.79MB	10.00MB
datadog-iot-agent-aarch64-rpm	0.00MB	✅	108.86MB	108.86MB	10.00MB

Decision

⚠️ Warning

jennchenn · 2024-12-20T15:35:45Z

/merge

dd-devflow · 2024-12-20T15:35:53Z

Devflow running: `/merge`

View all feedbacks in Devflow UI.

2024-12-20 15:35:53 UTC ℹ️ MergeQueue: pull request added to the queue

The median merge time in main is 34m.

2024-12-20 16:08:05 UTC ℹ️ MergeQueue: This merge request was merged

…ommendations (#31547)

jennchenn added 6 commits November 27, 2024 21:31

Implement basic stabilization algorithm

8d25837

Add stabilization algorithm basic unit test

88b4a40

Update horizontal event retention to account for stabilization window

c6c4906

Add unit tests for scaling with stabilization

49daae0

Pin operator version to pull spec changes

3d91206

Add tests for horizontal retention calculation

ba1cdf9

jennchenn added team/containers changelog/no-changelog qa/done QA done before merge and regressions are covered by tests component/autoscaling labels Nov 27, 2024

jennchenn requested a review from a team as a code owner November 27, 2024 22:20

github-actions bot added the medium review PR review might take time label Nov 27, 2024

jennchenn commented Nov 27, 2024

View reviewed changes

vboulineau reviewed Dec 4, 2024

View reviewed changes

jennchenn added 3 commits December 9, 2024 14:42

Apply stabilization after checking constraints and scaling rules

960e0cb

Use scale direction to check which actions to compare to

0b09096

Merge remote-tracking branch 'origin/main' into jenn/CASCL-61_impleme…

58ab15d

…nt-horizontal-stabilization

jennchenn requested a review from vboulineau December 16, 2024 21:21

vboulineau approved these changes Dec 19, 2024

View reviewed changes

jennchenn added 3 commits December 19, 2024 20:26

Merge remote-tracking branch 'origin/main' into jenn/CASCL-61_impleme…

bda4790

…nt-horizontal-stabilization

Update operator commit reference

53a8f1d

fixup! Update operator commit reference

1e6acbe

dd-mergequeue bot merged commit 81efa42 into main Dec 20, 2024
230 checks passed

dd-mergequeue bot deleted the jenn/CASCL-61_implement-horizontal-stabilization branch December 20, 2024 16:08

github-actions bot added this to the 7.62.0 milestone Dec 20, 2024

louis-cqrl pushed a commit that referenced this pull request Dec 25, 2024

[clusteragent/autoscaling] Implement stabilization for horizontal rec…

64b2c4a

…ommendations (#31547)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[clusteragent/autoscaling] Implement stabilization for horizontal recommendations #31547

[clusteragent/autoscaling] Implement stabilization for horizontal recommendations #31547

jennchenn commented Nov 27, 2024

agent-platform-auto-pr bot commented Nov 27, 2024 •

edited

Loading

jennchenn Nov 27, 2024

cit-pr-commenter bot commented Nov 27, 2024 •

edited

Loading

Fine details of change detection per experiment

Bounds Checks: ✅ Passed

Explanation

vboulineau Dec 4, 2024

vboulineau Dec 4, 2024

agent-platform-auto-pr bot commented Dec 10, 2024

agent-platform-auto-pr bot commented Dec 19, 2024

jennchenn commented Dec 20, 2024

dd-devflow bot commented Dec 20, 2024 •

edited

Loading

[clusteragent/autoscaling] Implement stabilization for horizontal recommendations #31547

[clusteragent/autoscaling] Implement stabilization for horizontal recommendations #31547

Conversation

jennchenn commented Nov 27, 2024

What does this PR do?

Motivation

Describe how to test/QA your changes

Possible Drawbacks / Trade-offs

Additional Notes

agent-platform-auto-pr bot commented Nov 27, 2024 • edited Loading

Test changes on VM

jennchenn Nov 27, 2024

Choose a reason for hiding this comment

cit-pr-commenter bot commented Nov 27, 2024 • edited Loading

Regression Detector

Regression Detector Results

Optimization Goals: ✅ No significant changes detected

Fine details of change detection per experiment

Bounds Checks: ✅ Passed

Explanation

CI Pass/Fail Decision

vboulineau Dec 4, 2024

Choose a reason for hiding this comment

vboulineau Dec 4, 2024

Choose a reason for hiding this comment

agent-platform-auto-pr bot commented Dec 10, 2024

Package size comparison

Decision

agent-platform-auto-pr bot commented Dec 19, 2024

Uncompressed package size comparison

Decision

jennchenn commented Dec 20, 2024

dd-devflow bot commented Dec 20, 2024 • edited Loading

Devflow running: /merge

agent-platform-auto-pr bot commented Nov 27, 2024 •

edited

Loading

cit-pr-commenter bot commented Nov 27, 2024 •

edited

Loading

dd-devflow bot commented Dec 20, 2024 •

edited

Loading

Devflow running: `/merge`