Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

E2E Flakiness: Eliminates client-side rate-limiting for AP&F drown-out test #96798

Merged
merged 3 commits into from
Nov 30, 2020

Conversation

yue9944882
Copy link
Member

@yue9944882 yue9944882 commented Nov 23, 2020

NONE

/sig api-machinery
/kind flakiness

a few clues i noticed could be related w/ the e2e flakiness:

  1. the e2e framework has a default client-side tbf rate-limiter of qps/burst=20/50, while our "elephant" client in the e2e test's qps is 100 which will be always rate-limitted.

ClientQPS: 20,
ClientBurst: 50,

  1. the rate-limiter is actually shared by "elephant" and "mouse" client, which makes the two competing at the client-side.

  2. in the fairness test, the matching flow-schema use "*" to match {high,low}qps. this will be involving impact from the controllers/nodes into the test scenarios.

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 23, 2020
@k8s-ci-robot
Copy link
Contributor

@yue9944882: The label(s) kind/flakiness cannot be applied, because the repository doesn't have them

In response to this:

NONE

/sig api-machinery
/kind flakiness

a few clues i noticed could be related w/ the e2e flakiness:

  1. the e2e framework has a default client-side tbf rate-limiter of qps/burst=20/50, while our "elephant" client in the e2e test's qps is 100 which will be always rate-limitted.

ClientQPS: 20,
ClientBurst: 50,

  1. the rate-limiter is actually shared by "elephant" and "mouse" client, which makes the two competing at the client-side.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 23, 2020
@k8s-ci-robot k8s-ci-robot requested review from ncdc and thockin November 23, 2020 03:26
@yue9944882
Copy link
Member Author

(Maybe) Fixes: #96710

@k8s-ci-robot k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Nov 23, 2020
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 23, 2020
@yue9944882
Copy link
Member Author

/retest

1 similar comment
@yue9944882
Copy link
Member Author

/retest

@yue9944882
Copy link
Member Author

/retest

(retesting a few times to see if the flakiness is reproducible in this thread)

test/e2e/apimachinery/flowcontrol.go Outdated Show resolved Hide resolved
test/e2e/apimachinery/flowcontrol.go Outdated Show resolved Hide resolved
@@ -321,6 +323,7 @@ func createFlowSchema(f *framework.Framework, flowSchemaName string, matchingPre
func makeRequest(f *framework.Framework, username string) *http.Response {
config := f.ClientConfig()
config.Impersonate.UserName = username
config.RateLimiter = nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, since each request uses a separate config (a deep copy of the internal config), do you think we should do f.ClientConfig() once and pass that config around instead? alternatively, we can make ClientConfig return the right config with the rate limiter set to nil somehow

setting the rate limiter to nil on each ClientConfig, I don't think this actually removes the rate limiting

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't recall now, but I thought you had to set qps/burst to -1 to turn this off?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like setting this to nil will make the REST library to use the e2e framework-configured QPS limit:

if rateLimiter == nil {
qps := config.QPS
if config.QPS == 0.0 {
qps = DefaultQPS
}

The e2e framework configured limit is 20 reqs/sec:

config.QPS = f.Options.ClientQPS

I couldn't find any references to the -1 thing though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch! i crafted a new instance of rate-limiter with qps/burst set set to -1/0, the clientside rate-limiting should be bypassed now.

@adtac
Copy link
Member

adtac commented Nov 23, 2020

as discussed in the issue, it might be worth it to reduce the percentage thresholds too -- let's say 75% for highqps (doesn't really matter, since the real thing we're testing is lowqps's percentage) and 90% for lowqps (allows 5/50 requests to fail or not complete)? 5 requests is 1 second worth.

@lavalamp
Copy link
Member

Let's change the thresholds (and maybe add some waiting) in a second PR?

@MikeSpreitzer
Copy link
Member

@MikeSpreitzer

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 24, 2020
@yue9944882
Copy link
Member Author

Let's change the thresholds (and maybe add some waiting) in a second PR?

i updated the failing threshold for priority test to 80% for both highqps and lowqps clients, while for the fairness test, i the failing threshold is updated to 75% and 90%. wdyt? @lavalamp @adtac

@yue9944882
Copy link
Member Author

Fixes: #96803 (comment)

added a random suffix to the e2e resources and usernames to support parallel run.

@adtac
Copy link
Member

adtac commented Nov 24, 2020

/test pull-kubernetes-e2e-kind-ipv6
/test pull-kubernetes-e2e-kind
(unrelated flakes)

LGTM, thanks Min! I'll leave it to @lavalamp to take a look.

@fedebongio
Copy link
Contributor

/assign @adtac @lavalamp
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 24, 2020
@MikeSpreitzer
Copy link
Member

Sorry I did not review these tests in the first place. They are both mis-directed.

Note that these tests are evaluating whether the apiserver under test has served requests at an expected rate. That's a dicey proposition, since we have:

  • no strong guarantees on the power of that server,
  • no strong guarantees on what else is going on at the same time,
  • no control over (or really even an idea of) how long it takes to serve one request.

The feature under test directly regulates concurrency, not rate. Better to test that.

The "should ensure that requests can't be drowned out (priority)" test is flaking because it is requiring a rate of service that can be crowded out by other activity when the server and/or its node is busy. Better to take the approach in https://github.com/kubernetes/kubernetes/blob/release-1.19/test/integration/apiserver/flowcontrol/concurrency_test.go , which looks to see whether the two priority levels deliver their promised amount of concurrency. BTW: while that integration test examines whether the less demanding flow gets all its allowed concurrency, we could create an additional test that instead uses a single thread for the less demanding flow and checks that this flow suffers essentially no queuing (we can not insist on absolutely no queuing because the code path always goes through the queue, so there will always be some small amount of time there). For end-to-end testing both sorts of test make sense as well.

The "should ensure that requests can't be drowned out (fairness)" test seems to be aimed at testing the fair queuing but uses a priority level that does no queuing! No small tweak will fix this, an entirely different test is required. Look at https://github.com/kubernetes/kubernetes/blob/release-1.19/staging/src/k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/queueset/queueset_test.go for ways to test the fair queuing.

Also, I added a coupe of comments on #96646 . But really, my remarks here are my belated review of that PR.

@MikeSpreitzer
Copy link
Member

Concurrency is, on average, the product of rate and duration. A test that controls rate but not duration is not even posing a defined quantity of challenge.

}
clients := []client{
// "highqps" refers to a client that creates requests at a much higher
// QPS than its counter-part and well above its concurrency share limit.
// In contrast, "lowqps" stays under its concurrency shares.
// Additionally, the "highqps" client also has a higher matching
// precedence for its flow schema.
{username: "highqps", qps: 100.0, concurrencyMultiplier: 2.0, matchingPrecedence: 999},
{username: "lowqps", qps: 5.0, concurrencyMultiplier: 0.5, matchingPrecedence: 1000},
{username: highQPSClientName, qps: 100.0, concurrencyMultiplier: 2.0, matchingPrecedence: 999, expectedCompletedPercentage: 0.75},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there any quantitative reasoning for picking 100 here? If not and we just want this test to stop flaking, why not try a lower number?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm i guess we chose 100 for the highqps client as a random number? @adtac

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there isn't much reasoning behind 100 other than that it's a nice round number that's large enough -- I don't have any objections to reducing the QPS a bit (as long as the ratio between the two clients stays large enough)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So #96874 is mainly about reducing the expected as well as attempted throughput.

@yue9944882 yue9944882 force-pushed the flaky/apnf-e2e-drown-test branch from a621ae6 to b0c52fd Compare November 26, 2020 10:38
@adtac
Copy link
Member

adtac commented Nov 30, 2020

/test pull-kubernetes-e2e-kind
(unrelated flake)

@lavalamp
Copy link
Member

This seems like it should make the test less flaky. I agree w/ Mike's comments that we need to make another pass on these tests though.

/lgtm
/approve
/milestone v1.20

@k8s-ci-robot k8s-ci-robot added this to the v1.20 milestone Nov 30, 2020
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 30, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lavalamp, yue9944882

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 30, 2020
@spiffxp
Copy link
Member

spiffxp commented Nov 30, 2020

/kind flake

@k8s-ci-robot k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. and removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Nov 30, 2020
@k8s-ci-robot k8s-ci-robot merged commit e0c587b into kubernetes:master Nov 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants