E2E Flakiness: Eliminates client-side rate-limiting for AP&F drown-out test #96798

yue9944882 · 2020-11-23T03:25:54Z

NONE

/sig api-machinery
/kind flakiness

a few clues i noticed could be related w/ the e2e flakiness:

the e2e framework has a default client-side tbf rate-limiter of qps/burst=20/50, while our "elephant" client in the e2e test's qps is 100 which will be always rate-limitted.

kubernetes/test/e2e/framework/framework.go

Lines 145 to 146 in b2ecd1b

    
           ClientQPS:   20, 
        
           ClientBurst: 50,

the rate-limiter is actually shared by "elephant" and "mouse" client, which makes the two competing at the client-side.
in the fairness test, the matching flow-schema use "*" to match {high,low}qps. this will be involving impact from the controllers/nodes into the test scenarios.

k8s-ci-robot · 2020-11-23T03:25:56Z

@yue9944882: The label(s) kind/flakiness cannot be applied, because the repository doesn't have them

In response to this:

NONE
/sig api-machinery
/kind flakiness

a few clues i noticed could be related w/ the e2e flakiness:

the e2e framework has a default client-side tbf rate-limiter of qps/burst=20/50, while our "elephant" client in the e2e test's qps is 100 which will be always rate-limitted.

kubernetes/test/e2e/framework/framework.go

Lines 145 to 146 in b2ecd1b

ClientQPS: 20,

ClientBurst: 50,

the rate-limiter is actually shared by "elephant" and "mouse" client, which makes the two competing at the client-side.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yue9944882 · 2020-11-23T03:26:34Z

(Maybe) Fixes: #96710

yue9944882 · 2020-11-23T08:14:24Z

/retest

yue9944882 · 2020-11-23T08:49:01Z

/retest

yue9944882 · 2020-11-23T10:25:41Z

/retest

(retesting a few times to see if the flakiness is reproducible in this thread)

test/e2e/apimachinery/flowcontrol.go

adtac · 2020-11-23T13:39:49Z

test/e2e/apimachinery/flowcontrol.go

@@ -321,6 +323,7 @@ func createFlowSchema(f *framework.Framework, flowSchemaName string, matchingPre
 func makeRequest(f *framework.Framework, username string) *http.Response {
 	config := f.ClientConfig()
 	config.Impersonate.UserName = username
+	config.RateLimiter = nil


hmm, since each request uses a separate config (a deep copy of the internal config), do you think we should do f.ClientConfig() once and pass that config around instead? alternatively, we can make ClientConfig return the right config with the rate limiter set to nil somehow

setting the rate limiter to nil on each ClientConfig, I don't think this actually removes the rate limiting

I can't recall now, but I thought you had to set qps/burst to -1 to turn this off?

It looks like setting this to nil will make the REST library to use the e2e framework-configured QPS limit:

kubernetes/staging/src/k8s.io/client-go/rest/config.go

Lines 333 to 337 in da75c26

if rateLimiter == nil {

qps := config.QPS

if config.QPS == 0.0 {

qps = DefaultQPS

}

The e2e framework configured limit is 20 reqs/sec:

kubernetes/test/e2e/framework/framework.go

Line 145 in 540e41c

ClientQPS: 20,

kubernetes/test/e2e/framework/framework.go

Line 192 in 540e41c

config.QPS = f.Options.ClientQPS

I couldn't find any references to the -1 thing though.

nice catch! i crafted a new instance of rate-limiter with qps/burst set set to -1/0, the clientside rate-limiting should be bypassed now.

adtac · 2020-11-23T13:42:51Z

as discussed in the issue, it might be worth it to reduce the percentage thresholds too -- let's say 75% for highqps (doesn't really matter, since the real thing we're testing is lowqps's percentage) and 90% for lowqps (allows 5/50 requests to fail or not complete)? 5 requests is 1 second worth.

lavalamp · 2020-11-23T17:05:32Z

Let's change the thresholds (and maybe add some waiting) in a second PR?

MikeSpreitzer · 2020-11-23T18:34:43Z

@MikeSpreitzer

yue9944882 · 2020-11-24T07:13:21Z

Let's change the thresholds (and maybe add some waiting) in a second PR?

i updated the failing threshold for priority test to 80% for both highqps and lowqps clients, while for the fairness test, i the failing threshold is updated to 75% and 90%. wdyt? @lavalamp @adtac

yue9944882 · 2020-11-24T07:14:31Z

Fixes: #96803 (comment)

added a random suffix to the e2e resources and usernames to support parallel run.

adtac · 2020-11-24T15:33:14Z

/test pull-kubernetes-e2e-kind-ipv6
/test pull-kubernetes-e2e-kind
(unrelated flakes)

LGTM, thanks Min! I'll leave it to @lavalamp to take a look.

fedebongio · 2020-11-24T21:10:51Z

/assign @adtac @lavalamp
/triage accepted

MikeSpreitzer · 2020-11-25T08:15:35Z

Sorry I did not review these tests in the first place. They are both mis-directed.

Note that these tests are evaluating whether the apiserver under test has served requests at an expected rate. That's a dicey proposition, since we have:

no strong guarantees on the power of that server,
no strong guarantees on what else is going on at the same time,
no control over (or really even an idea of) how long it takes to serve one request.

The feature under test directly regulates concurrency, not rate. Better to test that.

The "should ensure that requests can't be drowned out (priority)" test is flaking because it is requiring a rate of service that can be crowded out by other activity when the server and/or its node is busy. Better to take the approach in https://github.com/kubernetes/kubernetes/blob/release-1.19/test/integration/apiserver/flowcontrol/concurrency_test.go , which looks to see whether the two priority levels deliver their promised amount of concurrency. BTW: while that integration test examines whether the less demanding flow gets all its allowed concurrency, we could create an additional test that instead uses a single thread for the less demanding flow and checks that this flow suffers essentially no queuing (we can not insist on absolutely no queuing because the code path always goes through the queue, so there will always be some small amount of time there). For end-to-end testing both sorts of test make sense as well.

The "should ensure that requests can't be drowned out (fairness)" test seems to be aimed at testing the fair queuing but uses a priority level that does no queuing! No small tweak will fix this, an entirely different test is required. Look at https://github.com/kubernetes/kubernetes/blob/release-1.19/staging/src/k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/queueset/queueset_test.go for ways to test the fair queuing.

Also, I added a coupe of comments on #96646 . But really, my remarks here are my belated review of that PR.

MikeSpreitzer · 2020-11-25T13:09:55Z

Concurrency is, on average, the product of rate and duration. A test that controls rate but not duration is not even posing a defined quantity of challenge.

test/e2e/apimachinery/flowcontrol.go

MikeSpreitzer · 2020-11-25T21:58:08Z

test/e2e/apimachinery/flowcontrol.go

 		}
 		clients := []client{
 			// "highqps" refers to a client that creates requests at a much higher
 			// QPS than its counter-part and well above its concurrency share limit.
 			// In contrast, "lowqps" stays under its concurrency shares.
 			// Additionally, the "highqps" client also has a higher matching
 			// precedence for its flow schema.
-			{username: "highqps", qps: 100.0, concurrencyMultiplier: 2.0, matchingPrecedence: 999},
-			{username: "lowqps", qps: 5.0, concurrencyMultiplier: 0.5, matchingPrecedence: 1000},
+			{username: highQPSClientName, qps: 100.0, concurrencyMultiplier: 2.0, matchingPrecedence: 999, expectedCompletedPercentage: 0.75},


Was there any quantitative reasoning for picking 100 here? If not and we just want this test to stop flaking, why not try a lower number?

mmm i guess we chose 100 for the highqps client as a random number? @adtac

there isn't much reasoning behind 100 other than that it's a nice round number that's large enough -- I don't have any objections to reducing the QPS a bit (as long as the ratio between the two clients stays large enough)

So #96874 is mainly about reducing the expected as well as attempted throughput.

adtac · 2020-11-30T17:58:19Z

/test pull-kubernetes-e2e-kind
(unrelated flake)

lavalamp · 2020-11-30T21:50:40Z

This seems like it should make the test less flaky. I agree w/ Mike's comments that we need to make another pass on these tests though.

/lgtm
/approve
/milestone v1.20

k8s-ci-robot · 2020-11-30T21:51:13Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lavalamp, yue9944882

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~test/e2e/apimachinery/OWNERS~~ [lavalamp]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

spiffxp · 2020-11-30T22:42:57Z

/kind flake

ap&f e2e: eliminates client-side rate-limiting

ee31c93

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 23, 2020

k8s-ci-robot requested review from ncdc and thockin November 23, 2020 03:26

k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Nov 23, 2020

matches specific usernames instead of "*"

812f13f

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 23, 2020

adtac reviewed Nov 23, 2020

View reviewed changes

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 24, 2020

yue9944882 mentioned this pull request Nov 24, 2020

[Flaking Test][sig-api-machinery] API priority and fairness should ensure that requests can be classified by testing flow-schemas/priority-levels #96803

Closed

k8s-ci-robot assigned adtac and lavalamp Nov 24, 2020

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 24, 2020

kolyshkin mentioned this pull request Nov 25, 2020

test: add no-pull-on-run; fix no pull on create cri-o/cri-o#4377

Merged

MikeSpreitzer reviewed Nov 25, 2020

View reviewed changes

test/e2e/apimachinery/flowcontrol.go Outdated Show resolved Hide resolved

MikeSpreitzer reviewed Nov 25, 2020

View reviewed changes

This was referenced Nov 25, 2020

Cleaner way to disable rate limiting yue9944882/kubernetes#2

Closed

Tweak up flaking end-to-end tests of API Priority and Fairness #96874

Merged

addressing review comments and supports parallel run

b0c52fd

yue9944882 force-pushed the flaky/apnf-e2e-drown-test branch from a621ae6 to b0c52fd Compare November 26, 2020 10:38

k8s-ci-robot added this to the v1.20 milestone Nov 30, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 30, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 30, 2020

k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. and removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Nov 30, 2020

k8s-ci-robot merged commit e0c587b into kubernetes:master Nov 30, 2020

adtac mentioned this pull request Dec 2, 2020

"highqps" not reaching 95% completion threshold on API priority and fairness ... , blocking gce-master-default #96710

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

E2E Flakiness: Eliminates client-side rate-limiting for AP&F drown-out test #96798

E2E Flakiness: Eliminates client-side rate-limiting for AP&F drown-out test #96798

yue9944882 commented Nov 23, 2020 •

edited

Loading

k8s-ci-robot commented Nov 23, 2020

yue9944882 commented Nov 23, 2020

yue9944882 commented Nov 23, 2020

yue9944882 commented Nov 23, 2020

yue9944882 commented Nov 23, 2020

adtac Nov 23, 2020

lavalamp Nov 23, 2020

adtac Nov 23, 2020

yue9944882 Nov 24, 2020

adtac commented Nov 23, 2020 •

edited

Loading

lavalamp commented Nov 23, 2020

MikeSpreitzer commented Nov 23, 2020

yue9944882 commented Nov 24, 2020

yue9944882 commented Nov 24, 2020

adtac commented Nov 24, 2020

fedebongio commented Nov 24, 2020

MikeSpreitzer commented Nov 25, 2020

MikeSpreitzer commented Nov 25, 2020

MikeSpreitzer Nov 25, 2020

yue9944882 Nov 26, 2020

adtac Nov 26, 2020

MikeSpreitzer Nov 30, 2020

adtac commented Nov 30, 2020

lavalamp commented Nov 30, 2020

k8s-ci-robot commented Nov 30, 2020

spiffxp commented Nov 30, 2020

	if rateLimiter == nil {
	qps := config.QPS
	if config.QPS == 0.0 {
	qps = DefaultQPS
	}

E2E Flakiness: Eliminates client-side rate-limiting for AP&F drown-out test #96798

E2E Flakiness: Eliminates client-side rate-limiting for AP&F drown-out test #96798

Conversation

yue9944882 commented Nov 23, 2020 • edited Loading

k8s-ci-robot commented Nov 23, 2020

yue9944882 commented Nov 23, 2020

yue9944882 commented Nov 23, 2020

yue9944882 commented Nov 23, 2020

yue9944882 commented Nov 23, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adtac commented Nov 23, 2020 • edited Loading

lavalamp commented Nov 23, 2020

MikeSpreitzer commented Nov 23, 2020

yue9944882 commented Nov 24, 2020

yue9944882 commented Nov 24, 2020

adtac commented Nov 24, 2020

fedebongio commented Nov 24, 2020

MikeSpreitzer commented Nov 25, 2020

MikeSpreitzer commented Nov 25, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adtac commented Nov 30, 2020

lavalamp commented Nov 30, 2020

k8s-ci-robot commented Nov 30, 2020

spiffxp commented Nov 30, 2020

yue9944882 commented Nov 23, 2020 •

edited

Loading

adtac commented Nov 23, 2020 •

edited

Loading