-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
E2E Flakiness: Eliminates client-side rate-limiting for AP&F drown-out test #96798
E2E Flakiness: Eliminates client-side rate-limiting for AP&F drown-out test #96798
Conversation
@yue9944882: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
(Maybe) Fixes: #96710 |
/retest |
1 similar comment
/retest |
/retest (retesting a few times to see if the flakiness is reproducible in this thread) |
test/e2e/apimachinery/flowcontrol.go
Outdated
@@ -321,6 +323,7 @@ func createFlowSchema(f *framework.Framework, flowSchemaName string, matchingPre | |||
func makeRequest(f *framework.Framework, username string) *http.Response { | |||
config := f.ClientConfig() | |||
config.Impersonate.UserName = username | |||
config.RateLimiter = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, since each request uses a separate config (a deep copy of the internal config), do you think we should do f.ClientConfig()
once and pass that config around instead? alternatively, we can make ClientConfig return the right config with the rate limiter set to nil somehow
setting the rate limiter to nil on each ClientConfig, I don't think this actually removes the rate limiting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't recall now, but I thought you had to set qps/burst to -1 to turn this off?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like setting this to nil will make the REST library to use the e2e framework-configured QPS limit:
kubernetes/staging/src/k8s.io/client-go/rest/config.go
Lines 333 to 337 in da75c26
if rateLimiter == nil { | |
qps := config.QPS | |
if config.QPS == 0.0 { | |
qps = DefaultQPS | |
} |
The e2e framework configured limit is 20 reqs/sec:
kubernetes/test/e2e/framework/framework.go
Line 145 in 540e41c
ClientQPS: 20, |
kubernetes/test/e2e/framework/framework.go
Line 192 in 540e41c
config.QPS = f.Options.ClientQPS |
I couldn't find any references to the -1 thing though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch! i crafted a new instance of rate-limiter with qps/burst set set to -1/0, the clientside rate-limiting should be bypassed now.
as discussed in the issue, it might be worth it to reduce the percentage thresholds too -- let's say 75% for highqps (doesn't really matter, since the real thing we're testing is lowqps's percentage) and 90% for lowqps (allows 5/50 requests to fail or not complete)? 5 requests is 1 second worth. |
Let's change the thresholds (and maybe add some waiting) in a second PR? |
Fixes: #96803 (comment) added a random suffix to the e2e resources and usernames to support parallel run. |
/test pull-kubernetes-e2e-kind-ipv6 LGTM, thanks Min! I'll leave it to @lavalamp to take a look. |
Sorry I did not review these tests in the first place. They are both mis-directed. Note that these tests are evaluating whether the apiserver under test has served requests at an expected rate. That's a dicey proposition, since we have:
The feature under test directly regulates concurrency, not rate. Better to test that. The "should ensure that requests can't be drowned out (priority)" test is flaking because it is requiring a rate of service that can be crowded out by other activity when the server and/or its node is busy. Better to take the approach in https://github.com/kubernetes/kubernetes/blob/release-1.19/test/integration/apiserver/flowcontrol/concurrency_test.go , which looks to see whether the two priority levels deliver their promised amount of concurrency. BTW: while that integration test examines whether the less demanding flow gets all its allowed concurrency, we could create an additional test that instead uses a single thread for the less demanding flow and checks that this flow suffers essentially no queuing (we can not insist on absolutely no queuing because the code path always goes through the queue, so there will always be some small amount of time there). For end-to-end testing both sorts of test make sense as well. The "should ensure that requests can't be drowned out (fairness)" test seems to be aimed at testing the fair queuing but uses a priority level that does no queuing! No small tweak will fix this, an entirely different test is required. Look at https://github.com/kubernetes/kubernetes/blob/release-1.19/staging/src/k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/queueset/queueset_test.go for ways to test the fair queuing. Also, I added a coupe of comments on #96646 . But really, my remarks here are my belated review of that PR. |
Concurrency is, on average, the product of rate and duration. A test that controls rate but not duration is not even posing a defined quantity of challenge. |
} | ||
clients := []client{ | ||
// "highqps" refers to a client that creates requests at a much higher | ||
// QPS than its counter-part and well above its concurrency share limit. | ||
// In contrast, "lowqps" stays under its concurrency shares. | ||
// Additionally, the "highqps" client also has a higher matching | ||
// precedence for its flow schema. | ||
{username: "highqps", qps: 100.0, concurrencyMultiplier: 2.0, matchingPrecedence: 999}, | ||
{username: "lowqps", qps: 5.0, concurrencyMultiplier: 0.5, matchingPrecedence: 1000}, | ||
{username: highQPSClientName, qps: 100.0, concurrencyMultiplier: 2.0, matchingPrecedence: 999, expectedCompletedPercentage: 0.75}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was there any quantitative reasoning for picking 100 here? If not and we just want this test to stop flaking, why not try a lower number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmm i guess we chose 100 for the highqps client as a random number? @adtac
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there isn't much reasoning behind 100 other than that it's a nice round number that's large enough -- I don't have any objections to reducing the QPS a bit (as long as the ratio between the two clients stays large enough)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So #96874 is mainly about reducing the expected as well as attempted throughput.
a621ae6
to
b0c52fd
Compare
/test pull-kubernetes-e2e-kind |
This seems like it should make the test less flaky. I agree w/ Mike's comments that we need to make another pass on these tests though. /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: lavalamp, yue9944882 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/kind flake |
/sig api-machinery
/kind flakiness
a few clues i noticed could be related w/ the e2e flakiness:
kubernetes/test/e2e/framework/framework.go
Lines 145 to 146 in b2ecd1b
the rate-limiter is actually shared by "elephant" and "mouse" client, which makes the two competing at the client-side.
in the fairness test, the matching flow-schema use "*" to match {high,low}qps. this will be involving impact from the controllers/nodes into the test scenarios.