Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestAutoscaleSustaining scales to 8 instead of 10 #13679

Merged
merged 4 commits into from
Feb 23, 2023

Conversation

mgencur
Copy link
Contributor

@mgencur mgencur commented Feb 7, 2023

There's an indication that the client node that sends the requests is not able to generate enough load to actually scale the ksvc to 10. The test runs "vegeta" tool and configures number of workers to send requests. However, when the client machine doesn't have enough resources (possibly just 2 vCPUs like KinD) or there are also other tests running in parallel the worker threads are not able to generate enough traffic to scale the ksvc as desired.
We ran into this issue downstream as well, there were no errors on Knative (cluster) side. Increasing the CPU resources for the client machine resolved the issue and it doesn't happen anymore.

Fixes #13049

Proposed Changes

  • Decrease the target scale for TestAutoscaleSustaining. Choose a reasonable default that will work in KinD, Prow cluster and also when running the test in a container.

I was considering leaving the default to 10 for Prow and when the GOMAXPROCS is lower than 10 use 8. But that is just complicated and unnecessary. Also, GOMAXPROCS in itself doesn't work well in a container where it actually returns the number of CPUs of the whole node.

The KinD tests ran 3 times in this PR without issues.

Release Note


There's an indication that the source node that sends the results is not
able to generate enough load to actually scale the ksvc to 10. Trying to
to find a lower bar for KinD tests.
@knative-prow knative-prow bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. area/test-and-release It flags unit/e2e/conformance/perf test issues for product features labels Feb 7, 2023
@knative-prow knative-prow bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Feb 7, 2023
@codecov
Copy link

codecov bot commented Feb 7, 2023

Codecov Report

Base: 86.24% // Head: 86.21% // Decreases project coverage by -0.03% ⚠️

Coverage data is based on head (b13a8fe) compared to base (0639c5f).
Patch has no changes to coverable lines.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #13679      +/-   ##
==========================================
- Coverage   86.24%   86.21%   -0.03%     
==========================================
  Files         197      197              
  Lines       14783    14774       -9     
==========================================
- Hits        12749    12737      -12     
- Misses       1733     1735       +2     
- Partials      301      302       +1     
Impacted Files Coverage Δ
pkg/reconciler/configuration/configuration.go 82.93% <0.00%> (-1.43%) ⬇️
pkg/reconciler/route/resources/ingress.go 94.80% <0.00%> (-0.20%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@mgencur
Copy link
Contributor Author

mgencur commented Feb 7, 2023

Tests passed, run 1

@mgencur mgencur changed the title [WIP] TestAutoscaleSustaining scales to 8 instead of 10 TestAutoscaleSustaining scales to 8 instead of 10 Feb 8, 2023
@knative-prow knative-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 8, 2023
@@ -143,7 +138,7 @@ func TestAutoscaleSustaining(t *testing.T) {
}))
test.EnsureTearDown(t, ctx.Clients(), ctx.Names())

AssertAutoscaleUpToNumPods(ctx, 1, 10, time.After(2*time.Minute), false /* quick */)
AssertAutoscaleUpToNumPods(ctx, 1, 8, time.After(2*time.Minute), false /* quick */)
Copy link
Contributor

@skonto skonto Feb 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If flakiness is only for the exponential case (not linear) we could only adjust target for that to distinguish between the two algorithms for the "same" generated traffic. Assuming flakiness is not a result of some bug in the statistics aggregation or due to some other failure. Thinking out loud about the latter, afaik exponential favors latest values in the window statistics so I would expect it would catch up faster (compared to linear) if enough traffic is there? 🤔 cc @dprotaso @psschwei

Copy link
Contributor Author

@mgencur mgencur Feb 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'd probably prefer setting the same target for both because we're adjusting this for environment that is not able to achieve what is requested by the test (the test requests certain number of workers that run in parallel).
For environments which can satisfy the test requirements, the different target is not necessary. So, it complicates the test settings (and code) for the more usual case as well. But I'm not sure.

Copy link
Contributor Author

@mgencur mgencur Feb 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaik exponential favors latest values in the window statistics

In that case, would frequent changes in traffic also cause more frequent scaling up/down with the exponential algorithm than with the linear one? Perhaps insufficient resources on client side would cause the traffic to be more unstable, and the exponential algorithm would react more quickly to scale ksvc up/down.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be better to set different targets depending on environment, using the short flag to distinguish the constrained one (i.e. something like targetPods := 10; if testing.Short() { targetPods = 8 })? That said, I'm not sure if there was a specific reason why we chose 10 to begin with or if it was just a nice round number.

(The weird thing is that these tests used to work on Kind too. If I remember correctly, it was right around the time that they switched over to systemd cgroups driver that things stopped working. But that's neither here nor there...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be better to set different targets depending on environment, using the short flag to distinguish the constrained one (i.e. something like targetPods := 10; if testing.Short() { targetPods = 8 })?

That's an option too. But having 10 or 8 doesn't make a reasonable difference to me. As you said, it's not clear where the value 10 came from. So, I chose one common value (8) to simplify the test and have it identical for all environments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's probably fine to just go with 8 (I don't know what scaling up two more pods would really tell us, other than we can handle double digits), but will give @dprotaso a chance to weigh in in case he has more context here.

Can you also drop the -short flag here:

? We added that just for this test, so may as well get rid of it if we're fixing the issue that caused us to add it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dprotaso gentle ping.

Copy link
Contributor

@psschwei psschwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, should've gotten back to this one sooner...

/lgtm
/approve

@knative-prow knative-prow bot added the lgtm Indicates that a PR is ready to be merged. label Feb 23, 2023
@knative-prow
Copy link

knative-prow bot commented Feb 23, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mgencur, psschwei

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow knative-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 23, 2023
@knative-prow knative-prow bot merged commit 708374e into knative:main Feb 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test-and-release It flags unit/e2e/conformance/perf test issues for product features lgtm Indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[flaky] TestAutoscaleSustaining/aggregation-weighted-exponential is flakey in kind
3 participants