priority & fairness: support dynamic max seats #118601

andrewsykim · 2023-06-10T02:01:23Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

'Max seats' calculated by work estimator is currently hard coded to 10. When using lower values for --max-requests-inflight, a single LIST request taking up 10 seats could end up using all if not most seats in the priority level. This change updates the work estimator such that 'max seats' is now calcuated as MIN(0.15 x nominalCL, nomincalCL / handSize), with an upper limit of 10. This ensures seats taken from any request is proportional to the total available seats.

Which issue(s) this PR fixes:

ref #115409

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Update kube-apiserver's priority & fairness work estimator such that 'max seats' is MIN(0.15 x nominalCL, nominalCL / handSize)

This fixes a bug where clients with requests using hand size x max seats greater than the nominal concurrency limit can starve other requests in the same priority level.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

andrewsykim · 2023-06-10T12:20:09Z

/retest

andrewsykim · 2023-06-12T01:59:34Z

staging/src/k8s.io/apiserver/pkg/util/flowcontrol/request/config.go

+	if maximumSeats == 0 || maximumSeats > maximumSeatsUpperBound {
+		maximumSeats = maximumSeatsUpperBound
+	}
+
 	return &WorkEstimatorConfig{
 		MinimumSeats:                minimumSeats,
 		MaximumSeats:                maximumSeats,


Debating if it's worthwhile to have separate min/max seat configuration for ListWorkEstimatorConfig and MutatingWorkEstimatorConfig -- this way this change only applies for LIST estimator which we know is more problematic at the moment. It's possible mutating work estimator has similar issues but we just don't have the feedback yet since it was introduced in a newer version.

Leaning towards separating the config and revisting until we have more feedback on watches.

I lean the other way. The problem affects both. I would not do the work to separate them unless we have a specific need to.

andrewsykim · 2023-06-12T02:07:06Z

/assgin @wojtek-t @MikeSpreitzer

andrewsykim · 2023-06-12T02:21:12Z

/assign @wojtek-t @MikeSpreitzer

wojtek-t · 2023-06-12T09:59:35Z

staging/src/k8s.io/apiserver/pkg/util/flowcontrol/request/config.go

-func DefaultWorkEstimatorConfig() *WorkEstimatorConfig {
+func DefaultWorkEstimatorConfig(serverConcurrencyLimit int) *WorkEstimatorConfig {
+	// By default maximumSeats is 1% of total max inflight requests with an upper bound of 10
+	maximumSeats := uint64(math.Ceil((float64(serverConcurrencyLimit) * maximumSeatsFractionOfTotal)))


While I understand the motivation behind this change (and I guess I agree with it), I'm not convinced the change as of now will not bring more damage then gain...

We are running smallest control-planes with max-inflights < 100 - this change automatically means that every single request will occupy 1 seat - is that really what we want?

We run large control-planes (in particular scalability tests) with < 1000 inflights, so we will never reach 10 max-seats for them. I'm fairly sure it's not really what we want.

So I guess I have two main concerns about this change:

1% seems too low to me

A second question is whether we want this to set that based on total max-seats, or rather per-priority-level (where per-priority-level in the borrowing world I rather mean sth like "max possible in-flights for a given priority assuming I borrow everything I can" (or sth like that - so that it doesn't change frequently).

@MikeSpreitzer

We are running smallest control-planes with max-inflights < 100 - this change automatically means that every single request will occupy 1 seat - is that really what we want?

This is the idea, but maybe 1% is too aggressive. I have an assumption here that most small control planes are not really using up to 10 seats anyways because they don't have a lot of objects stored, and this change will only really impact the medium-large clusters that are running hundreds or thousands of Pods.

But for example sake, let's say we have a small control plane with --max-requests-inflight=100 and 500 pods, a single LIST would use up all seats in priority levels like global-default (assuming it has 10% of total seats), which I don't think we want either. Even looking at workload-low (assuming 40% of total seats) -- it would only take 4 daemonsets listing pods with a field selector to use all of workload-low. Even with borrowing, I'm skeptical it will alleviate seat capacity much.

We run large control-planes (in particular scalability tests) with < 1000 inflights, so we will never reach 10 max-seats for them. I'm fairly sure it's not really what we want.

Actually, could we run this scalability test on this PR? Yes, max seats will never reach 10, but it will still be single digit value based on what is used in scalability tests, I'm curious if the impact will be noticeable if max seats is reduced from 10 to 7 (as an example).

Actually, could we run this scalability test on this PR? Yes, max seats will never reach 10, but it will still be single digit value based on what is used in scalability tests, I'm curious if the impact will be noticeable if max seats is reduced from 10 to 7 (as an example).

Unfortunately, scalability test is not doing a good enough job in saturating control-plane so i don't think we will have useful data from it...

Now thinking more about my concerns:

Let's say someone sets inflight to 50 - then 4% has exactly the same problem. And if we bump it too high, then it will become uneffective.
So maybe my concerns boils down to the fact that a single value is not what we want.
We may need a function that takes total inflights as a parameter and returns a percent?

For low numbers it returns something higher, for higher numbers is decreases.
That said, I don't have a good suggestion how to create this function in more data-driven way...

But for example sake, let's say we have a small control plane with --max-requests-inflight=100 and 500 pods

That's unclear to me - listing 500 pods is quite a bit of work too.
But I can definitely imagine that this value:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/util/flowcontrol/request/config.go#L28
when we run better tests, is just too low...

So yet another option would be to also compute this number based on max-requests-inflight.

For that, I agree with Mike that my concern is much more related to the issue about the fact that max-width can exceed the allocated capacity of one priority level.
I guess I'm fine of having that discussion separately from this PR.

wojtek-t · 2023-06-12T09:59:51Z

/hold

To have some discussion on it.

MikeSpreitzer · 2023-06-12T18:56:47Z

I agree that 1% seems too low. I would use the total server capacity, which is max-mutating + max-readonly. If we were to use 17% and round then that would give the same behavior as today in the default configuration (200 mutating + 400 readonly).

We already have an issue about the fact that max-width can exceed the allocated capacity of one priority level. Let's address that fully and head-on, as a separate issue.

MikeSpreitzer · 2023-06-12T19:03:07Z

Oh, wait, I was off by a factor of 10. It would be 1.7%. Maybe use millis instead of percent? Or 2% and accep the change to 12?

alexzielenski · 2023-06-13T20:03:05Z

/triage accepted

wojtek-t · 2023-06-16T11:02:53Z

Oh, wait, I was off by a factor of 10. It would be 1.7%. Maybe use millis instead of percent? Or 2% and accep the change to 12?

@MikeSpreitzer - I think 2% is still too low (unless we make it configurable).
For smallest/lightest clusters we're setting the total in-flights to <100, so 2% of it effctively means <2...

MikeSpreitzer · 2023-06-16T19:59:44Z

We could set maxWidth to round(sqrt(serverCapacity/6.0)). For a server capacity (max mutating inflight + max read-only inflight) of 600, that gives the current value of 10. For a server capacity of 50, that gives a value of 3. For a server capacity of 1000, that gives a value of 13.

…ax seats Max seats from prioriy & fairness work estimator is now min(0.15 x nominalCL, nominalCL/handSize) 'Max seats' calculated by work estimator is currently hard coded to 10. When using lower values for --max-requests-inflight, a single LIST request taking up 10 seats could end up using all if not most seats in the priority level. This change updates the default work estimator config such that 'max seats' is at most 10% of the maximum concurrency limit for a priority level, with an upper limit of 10. This ensures seats taken from LIST request is proportional to the total available seats. Signed-off-by: Andrew Sy Kim <andrewsy@google.com>

MikeSpreitzer

/lgtm
Thanks!

k8s-ci-robot · 2023-07-17T19:31:02Z

LGTM label has been added.

Git tree hash: 37313f51268123ef1ccb19ddb23c61e8d993b705

MikeSpreitzer · 2023-07-17T21:03:29Z

/assign @deads2k

wojtek-t · 2023-07-18T07:09:07Z

/lgtm
/approve

/hold cancel

Thanks!

k8s-ci-robot · 2023-07-18T07:09:32Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andrewsykim, MikeSpreitzer, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~staging/src/k8s.io/apiserver/OWNERS~~ [wojtek-t]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wojtek-t · 2023-07-26T06:19:07Z

After discussing it, we consider it more of a bug fix than feature. We will be backporting it so re-tagging.

/kind bug
/remove-kind feature

andrewsykim · 2023-07-26T14:22:10Z

Thanks for retagging -- I also updated the release notes to reflect this

…#118601-origin-release-1.25 Automated cherry pick of #118601: priority & fairness: support dynamic max seats

…#118601-origin-release-1.27 Automated cherry pick of #118601: priority & fairness: support dynamic max seats

…#118601-origin-release-1.26 Automated cherry pick of #118601: priority & fairness: support dynamic max seats

jim-barber-he · 2023-09-07T09:52:58Z

This is a long shot, but I'm wondering if this change is related to some problems we've been having since moving from k8s 1.27.4 to 1.27.5

A number of times since the upgrade we have had the kube-apiserver fail its liveness probes and get restarted.
This has never happened to us before and wondering if some sort of throttling on the kube-apiserver has started to happen.

As far as I can tell we are not hitting rate limits on our kube-apiservers and the control-plane nodes have plenty of spare CPU and RAM.
We have --max-mutating-requests-inflight=400 and --max-requests-inflight=800 passed to kube-apiserver, but at the time of the problems a PromQL query of sum by (instance)(apiserver_current_inflight_requests) shows a maximum of 30 inflight requests to one of the API servers.

We also have the EventRateLimit admission plugin enabled for the apiserver with a configuration like so.

apiVersion: eventratelimit.admission.k8s.io/v1alpha1
kind: Configuration
limits:
- burst: 500
  qps: 250
  type: Namespace
- burst: 250
  qps: 50
  type: User

Using queries like sum by (instance) (rate(apiserver_request_total{job="apiserver"}[1m])) don't look like these limits are close to being hit either.
There are no logs from the kube-apiserver pods suggesting any sort of event rate limiting has kicked in (although I don't know if there would be logs of that kind if rate limiting did kick in).

The cluster has roughly 35 modes running around 750 pods.

Just looking through the changes to 1.27.5 this was one of them and is playing with the requests to the kube-apiserver so it's worth asking.

andrewsykim · 2023-09-07T16:17:26Z

@jim-barber-he I have a hard time seeing how this change can impact liveness probe of apiserver, but anything is possible haha.

Can you open a separate issue with the full apiserver logs (not just on the event rate limiter)? That might include more details on why the livenss probe fails

jim-barber-he · 2023-09-08T00:32:13Z

Thanks.
I have raised the following which has a link to a gist for the logs.
#120510

k8s-ci-robot requested review from hzxuzhonghu and yue9944882 June 10, 2023 02:01

k8s-ci-robot added area/apiserver sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jun 10, 2023

andrewsykim force-pushed the apf-tune-max-seats branch from 365059d to a0e487d Compare June 12, 2023 01:55

andrewsykim changed the title ~~[WIP] kube-apiserver: max seats calculated by APF work estimator should be no more than 1% of total inflight requests~~ kube-apiserver: max seats calculated by APF work estimator should be no more than 1% of total inflight requests Jun 12, 2023

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 12, 2023

andrewsykim commented Jun 12, 2023

View reviewed changes

k8s-ci-robot assigned MikeSpreitzer and wojtek-t Jun 12, 2023

wojtek-t reviewed Jun 12, 2023

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 12, 2023

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 13, 2023

andrewsykim force-pushed the apf-tune-max-seats branch from 2373a7f to d3ef2d4 Compare July 17, 2023 19:27

MikeSpreitzer approved these changes Jul 17, 2023

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 17, 2023

k8s-ci-robot assigned deads2k Jul 17, 2023

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 18, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 18, 2023

k8s-ci-robot merged commit f6bcef0 into kubernetes:master Jul 18, 2023

k8s-ci-robot added this to the v1.28 milestone Jul 18, 2023

wojtek-t mentioned this pull request Jul 24, 2023

APF fails when conc. limit is not bigger than hand size X max width #115409

Closed

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. and removed kind/feature Categorizes issue or PR as related to a new feature. labels Jul 26, 2023

k8s-ci-robot added a commit that referenced this pull request Jul 28, 2023

Merge pull request #119574 from andrewsykim/automated-cherry-pick-of-…

265a23f

…#118601-origin-release-1.25 Automated cherry pick of #118601: priority & fairness: support dynamic max seats

k8s-ci-robot added a commit that referenced this pull request Jul 28, 2023

Merge pull request #119572 from andrewsykim/automated-cherry-pick-of-…

ecd4504

…#118601-origin-release-1.27 Automated cherry pick of #118601: priority & fairness: support dynamic max seats

k8s-ci-robot added a commit that referenced this pull request Jul 28, 2023

Merge pull request #119573 from andrewsykim/automated-cherry-pick-of-…

c352bbb

…#118601-origin-release-1.26 Automated cherry pick of #118601: priority & fairness: support dynamic max seats

liggitt mentioned this pull request Sep 11, 2023

kube-apiserver instability for k8s 1.27.5 #120510

Closed

HarshadRanganathan mentioned this pull request Apr 5, 2024

EKS Upgrades HarshadRanganathan/harshadranganathan.github.io#98

Open

MikeSpreitzer mentioned this pull request Jan 16, 2025

Allow WorkEstimatorConfig to be configured #128628

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

priority & fairness: support dynamic max seats #118601

priority & fairness: support dynamic max seats #118601

andrewsykim commented Jun 10, 2023 •

edited

Loading

andrewsykim commented Jun 10, 2023

andrewsykim Jun 12, 2023

andrewsykim Jun 12, 2023

MikeSpreitzer Jun 16, 2023

andrewsykim commented Jun 12, 2023

andrewsykim commented Jun 12, 2023

wojtek-t Jun 12, 2023

andrewsykim Jun 12, 2023

wojtek-t Jun 16, 2023

wojtek-t commented Jun 12, 2023

MikeSpreitzer commented Jun 12, 2023

MikeSpreitzer commented Jun 12, 2023

alexzielenski commented Jun 13, 2023

wojtek-t commented Jun 16, 2023

MikeSpreitzer commented Jun 16, 2023

MikeSpreitzer left a comment

k8s-ci-robot commented Jul 17, 2023

MikeSpreitzer commented Jul 17, 2023

wojtek-t commented Jul 18, 2023

k8s-ci-robot commented Jul 18, 2023

wojtek-t commented Jul 26, 2023

andrewsykim commented Jul 26, 2023

jim-barber-he commented Sep 7, 2023

andrewsykim commented Sep 7, 2023

jim-barber-he commented Sep 8, 2023 •

edited

Loading

priority & fairness: support dynamic max seats #118601

priority & fairness: support dynamic max seats #118601

Conversation

andrewsykim commented Jun 10, 2023 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

andrewsykim commented Jun 10, 2023

andrewsykim Jun 12, 2023

Choose a reason for hiding this comment

andrewsykim Jun 12, 2023

Choose a reason for hiding this comment

MikeSpreitzer Jun 16, 2023

Choose a reason for hiding this comment

andrewsykim commented Jun 12, 2023

andrewsykim commented Jun 12, 2023

wojtek-t Jun 12, 2023

Choose a reason for hiding this comment

andrewsykim Jun 12, 2023

Choose a reason for hiding this comment

wojtek-t Jun 16, 2023

Choose a reason for hiding this comment

wojtek-t commented Jun 12, 2023

MikeSpreitzer commented Jun 12, 2023

MikeSpreitzer commented Jun 12, 2023

alexzielenski commented Jun 13, 2023

wojtek-t commented Jun 16, 2023

MikeSpreitzer commented Jun 16, 2023

MikeSpreitzer left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jul 17, 2023

MikeSpreitzer commented Jul 17, 2023

wojtek-t commented Jul 18, 2023

k8s-ci-robot commented Jul 18, 2023

wojtek-t commented Jul 26, 2023

andrewsykim commented Jul 26, 2023

jim-barber-he commented Sep 7, 2023

andrewsykim commented Sep 7, 2023

jim-barber-he commented Sep 8, 2023 • edited Loading

andrewsykim commented Jun 10, 2023 •

edited

Loading

jim-barber-he commented Sep 8, 2023 •

edited

Loading