-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TestAutoscaleSustaining scales to 8 instead of 10 #13679
Conversation
There's an indication that the source node that sends the results is not able to generate enough load to actually scale the ksvc to 10. Trying to to find a lower bar for KinD tests.
Codecov ReportBase: 86.24% // Head: 86.21% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #13679 +/- ##
==========================================
- Coverage 86.24% 86.21% -0.03%
==========================================
Files 197 197
Lines 14783 14774 -9
==========================================
- Hits 12749 12737 -12
- Misses 1733 1735 +2
- Partials 301 302 +1
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
Tests passed, run 1 |
@@ -143,7 +138,7 @@ func TestAutoscaleSustaining(t *testing.T) { | |||
})) | |||
test.EnsureTearDown(t, ctx.Clients(), ctx.Names()) | |||
|
|||
AssertAutoscaleUpToNumPods(ctx, 1, 10, time.After(2*time.Minute), false /* quick */) | |||
AssertAutoscaleUpToNumPods(ctx, 1, 8, time.After(2*time.Minute), false /* quick */) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If flakiness is only for the exponential case (not linear) we could only adjust target for that to distinguish between the two algorithms for the "same" generated traffic. Assuming flakiness is not a result of some bug in the statistics aggregation or due to some other failure. Thinking out loud about the latter, afaik exponential favors latest values in the window statistics so I would expect it would catch up faster (compared to linear) if enough traffic is there? 🤔 cc @dprotaso @psschwei
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'd probably prefer setting the same target for both because we're adjusting this for environment that is not able to achieve what is requested by the test (the test requests certain number of workers that run in parallel).
For environments which can satisfy the test requirements, the different target is not necessary. So, it complicates the test settings (and code) for the more usual case as well. But I'm not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
afaik exponential favors latest values in the window statistics
In that case, would frequent changes in traffic also cause more frequent scaling up/down with the exponential algorithm than with the linear one? Perhaps insufficient resources on client side would cause the traffic to be more unstable, and the exponential algorithm would react more quickly to scale ksvc up/down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would be better to set different targets depending on environment, using the short flag to distinguish the constrained one (i.e. something like targetPods := 10; if testing.Short() { targetPods = 8 }
)? That said, I'm not sure if there was a specific reason why we chose 10 to begin with or if it was just a nice round number.
(The weird thing is that these tests used to work on Kind too. If I remember correctly, it was right around the time that they switched over to systemd cgroups driver that things stopped working. But that's neither here nor there...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would be better to set different targets depending on environment, using the short flag to distinguish the constrained one (i.e. something like targetPods := 10; if testing.Short() { targetPods = 8 })?
That's an option too. But having 10 or 8 doesn't make a reasonable difference to me. As you said, it's not clear where the value 10 came from. So, I chose one common value (8) to simplify the test and have it identical for all environments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's probably fine to just go with 8 (I don't know what scaling up two more pods would really tell us, other than we can handle double digits), but will give @dprotaso a chance to weigh in in case he has more context here.
Can you also drop the -short
flag here:
serving/.github/workflows/kind-e2e.yaml
Line 331 in 9b9a951
-short \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dprotaso gentle ping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, should've gotten back to this one sooner...
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mgencur, psschwei The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There's an indication that the client node that sends the requests is not able to generate enough load to actually scale the ksvc to 10. The test runs "vegeta" tool and configures number of workers to send requests. However, when the client machine doesn't have enough resources (possibly just 2 vCPUs like KinD) or there are also other tests running in parallel the worker threads are not able to generate enough traffic to scale the ksvc as desired.
We ran into this issue downstream as well, there were no errors on Knative (cluster) side. Increasing the CPU resources for the client machine resolved the issue and it doesn't happen anymore.
Fixes #13049
Proposed Changes
I was considering leaving the default to 10 for Prow and when the GOMAXPROCS is lower than 10 use 8. But that is just complicated and unnecessary. Also, GOMAXPROCS in itself doesn't work well in a container where it actually returns the number of CPUs of the whole node.
The KinD tests ran 3 times in this PR without issues.
Release Note