Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

priority & fairness: support dynamic max seats #118601

Merged
merged 1 commit into from
Jul 18, 2023

Conversation

andrewsykim
Copy link
Member

@andrewsykim andrewsykim commented Jun 10, 2023

What type of PR is this?

/kind feature

What this PR does / why we need it:

'Max seats' calculated by work estimator is currently hard coded to 10. When using lower values for --max-requests-inflight, a single LIST request taking up 10 seats could end up using all if not most seats in the priority level. This change updates the work estimator such that 'max seats' is now calcuated as MIN(0.15 x nominalCL, nomincalCL / handSize), with an upper limit of 10. This ensures seats taken from any request is proportional to the total available seats.

Which issue(s) this PR fixes:

ref #115409

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Update kube-apiserver's priority & fairness work estimator such that 'max seats' is MIN(0.15 x nominalCL, nominalCL / handSize)

This fixes a bug where clients with requests using hand size x max seats greater than the nominal concurrency limit can starve other requests in the same priority level.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 10, 2023
@k8s-ci-robot k8s-ci-robot added area/apiserver sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jun 10, 2023
@andrewsykim
Copy link
Member Author

/retest

@andrewsykim andrewsykim changed the title [WIP] kube-apiserver: max seats calculated by APF work estimator should be no more than 1% of total inflight requests kube-apiserver: max seats calculated by APF work estimator should be no more than 1% of total inflight requests Jun 12, 2023
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 12, 2023
if maximumSeats == 0 || maximumSeats > maximumSeatsUpperBound {
maximumSeats = maximumSeatsUpperBound
}

return &WorkEstimatorConfig{
MinimumSeats: minimumSeats,
MaximumSeats: maximumSeats,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debating if it's worthwhile to have separate min/max seat configuration for ListWorkEstimatorConfig and MutatingWorkEstimatorConfig -- this way this change only applies for LIST estimator which we know is more problematic at the moment. It's possible mutating work estimator has similar issues but we just don't have the feedback yet since it was introduced in a newer version.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaning towards separating the config and revisting until we have more feedback on watches.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I lean the other way. The problem affects both. I would not do the work to separate them unless we have a specific need to.

@andrewsykim
Copy link
Member Author

/assgin @wojtek-t @MikeSpreitzer

@andrewsykim
Copy link
Member Author

/assign @wojtek-t @MikeSpreitzer

func DefaultWorkEstimatorConfig() *WorkEstimatorConfig {
func DefaultWorkEstimatorConfig(serverConcurrencyLimit int) *WorkEstimatorConfig {
// By default maximumSeats is 1% of total max inflight requests with an upper bound of 10
maximumSeats := uint64(math.Ceil((float64(serverConcurrencyLimit) * maximumSeatsFractionOfTotal)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I understand the motivation behind this change (and I guess I agree with it), I'm not convinced the change as of now will not bring more damage then gain...

  1. We are running smallest control-planes with max-inflights < 100 - this change automatically means that every single request will occupy 1 seat - is that really what we want?

  2. We run large control-planes (in particular scalability tests) with < 1000 inflights, so we will never reach 10 max-seats for them. I'm fairly sure it's not really what we want.

So I guess I have two main concerns about this change:

  1. 1% seems too low to me
  2. A second question is whether we want this to set that based on total max-seats, or rather per-priority-level (where per-priority-level in the borrowing world I rather mean sth like "max possible in-flights for a given priority assuming I borrow everything I can" (or sth like that - so that it doesn't change frequently).

@MikeSpreitzer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are running smallest control-planes with max-inflights < 100 - this change automatically means that every single request will occupy 1 seat - is that really what we want?

This is the idea, but maybe 1% is too aggressive. I have an assumption here that most small control planes are not really using up to 10 seats anyways because they don't have a lot of objects stored, and this change will only really impact the medium-large clusters that are running hundreds or thousands of Pods.

But for example sake, let's say we have a small control plane with --max-requests-inflight=100 and 500 pods, a single LIST would use up all seats in priority levels like global-default (assuming it has 10% of total seats), which I don't think we want either. Even looking at workload-low (assuming 40% of total seats) -- it would only take 4 daemonsets listing pods with a field selector to use all of workload-low. Even with borrowing, I'm skeptical it will alleviate seat capacity much.

We run large control-planes (in particular scalability tests) with < 1000 inflights, so we will never reach 10 max-seats for them. I'm fairly sure it's not really what we want.

Actually, could we run this scalability test on this PR? Yes, max seats will never reach 10, but it will still be single digit value based on what is used in scalability tests, I'm curious if the impact will be noticeable if max seats is reduced from 10 to 7 (as an example).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, could we run this scalability test on this PR? Yes, max seats will never reach 10, but it will still be single digit value based on what is used in scalability tests, I'm curious if the impact will be noticeable if max seats is reduced from 10 to 7 (as an example).

Unfortunately, scalability test is not doing a good enough job in saturating control-plane so i don't think we will have useful data from it...

Now thinking more about my concerns:

  1. Let's say someone sets inflight to 50 - then 4% has exactly the same problem. And if we bump it too high, then it will become uneffective.
    So maybe my concerns boils down to the fact that a single value is not what we want.
    We may need a function that takes total inflights as a parameter and returns a percent?

For low numbers it returns something higher, for higher numbers is decreases.
That said, I don't have a good suggestion how to create this function in more data-driven way...

But for example sake, let's say we have a small control plane with --max-requests-inflight=100 and 500 pods

That's unclear to me - listing 500 pods is quite a bit of work too.
But I can definitely imagine that this value:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/util/flowcontrol/request/config.go#L28
when we run better tests, is just too low...

So yet another option would be to also compute this number based on max-requests-inflight.

  1. For that, I agree with Mike that my concern is much more related to the issue about the fact that max-width can exceed the allocated capacity of one priority level.
    I guess I'm fine of having that discussion separately from this PR.

@wojtek-t
Copy link
Member

/hold

To have some discussion on it.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 12, 2023
@MikeSpreitzer
Copy link
Member

I agree that 1% seems too low. I would use the total server capacity, which is max-mutating + max-readonly. If we were to use 17% and round then that would give the same behavior as today in the default configuration (200 mutating + 400 readonly).

We already have an issue about the fact that max-width can exceed the allocated capacity of one priority level. Let's address that fully and head-on, as a separate issue.

@MikeSpreitzer
Copy link
Member

Oh, wait, I was off by a factor of 10. It would be 1.7%. Maybe use millis instead of percent? Or 2% and accep the change to 12?

@alexzielenski
Copy link
Contributor

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 13, 2023
@wojtek-t
Copy link
Member

Oh, wait, I was off by a factor of 10. It would be 1.7%. Maybe use millis instead of percent? Or 2% and accep the change to 12?

@MikeSpreitzer - I think 2% is still too low (unless we make it configurable).
For smallest/lightest clusters we're setting the total in-flights to <100, so 2% of it effctively means <2...

@MikeSpreitzer
Copy link
Member

We could set maxWidth to round(sqrt(serverCapacity/6.0)). For a server capacity (max mutating inflight + max read-only inflight) of 600, that gives the current value of 10. For a server capacity of 50, that gives a value of 3. For a server capacity of 1000, that gives a value of 13.

…ax seats

Max seats from prioriy & fairness work estimator is now min(0.15 x
nominalCL, nominalCL/handSize)

'Max seats' calculated by work estimator is currently hard coded to 10.
When using lower values for --max-requests-inflight, a single
LIST request taking up 10 seats could end up using all if not most seats in
the priority level. This change updates the default work estimator
config such that 'max seats' is at most 10% of the
maximum concurrency limit for a priority level, with an upper limit of 10.
This ensures seats taken from LIST request is proportional to the total
available seats.

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Copy link
Member

@MikeSpreitzer MikeSpreitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
Thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 17, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 37313f51268123ef1ccb19ddb23c61e8d993b705

@MikeSpreitzer
Copy link
Member

/assign @deads2k

@wojtek-t
Copy link
Member

/lgtm
/approve

/hold cancel

Thanks!

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 18, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andrewsykim, MikeSpreitzer, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@wojtek-t
Copy link
Member

After discussing it, we consider it more of a bug fix than feature. We will be backporting it so re-tagging.

/kind bug
/remove-kind feature

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. and removed kind/feature Categorizes issue or PR as related to a new feature. labels Jul 26, 2023
@andrewsykim
Copy link
Member Author

Thanks for retagging -- I also updated the release notes to reflect this

k8s-ci-robot added a commit that referenced this pull request Jul 28, 2023
…#118601-origin-release-1.25

Automated cherry pick of #118601: priority & fairness: support dynamic max seats
k8s-ci-robot added a commit that referenced this pull request Jul 28, 2023
…#118601-origin-release-1.27

Automated cherry pick of #118601: priority & fairness: support dynamic max seats
k8s-ci-robot added a commit that referenced this pull request Jul 28, 2023
…#118601-origin-release-1.26

Automated cherry pick of #118601: priority & fairness: support dynamic max seats
@jim-barber-he
Copy link

This is a long shot, but I'm wondering if this change is related to some problems we've been having since moving from k8s 1.27.4 to 1.27.5

A number of times since the upgrade we have had the kube-apiserver fail its liveness probes and get restarted.
This has never happened to us before and wondering if some sort of throttling on the kube-apiserver has started to happen.

As far as I can tell we are not hitting rate limits on our kube-apiservers and the control-plane nodes have plenty of spare CPU and RAM.
We have --max-mutating-requests-inflight=400 and --max-requests-inflight=800 passed to kube-apiserver, but at the time of the problems a PromQL query of sum by (instance)(apiserver_current_inflight_requests) shows a maximum of 30 inflight requests to one of the API servers.

We also have the EventRateLimit admission plugin enabled for the apiserver with a configuration like so.

apiVersion: eventratelimit.admission.k8s.io/v1alpha1
kind: Configuration
limits:
- burst: 500
  qps: 250
  type: Namespace
- burst: 250
  qps: 50
  type: User

Using queries like sum by (instance) (rate(apiserver_request_total{job="apiserver"}[1m])) don't look like these limits are close to being hit either.
There are no logs from the kube-apiserver pods suggesting any sort of event rate limiting has kicked in (although I don't know if there would be logs of that kind if rate limiting did kick in).

The cluster has roughly 35 modes running around 750 pods.

Just looking through the changes to 1.27.5 this was one of them and is playing with the requests to the kube-apiserver so it's worth asking.

@andrewsykim
Copy link
Member Author

@jim-barber-he I have a hard time seeing how this change can impact liveness probe of apiserver, but anything is possible haha.

Can you open a separate issue with the full apiserver logs (not just on the event rate limiter)? That might include more details on why the livenss probe fails

@jim-barber-he
Copy link

jim-barber-he commented Sep 8, 2023

Thanks.
I have raised the following which has a link to a gist for the logs.
#120510

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants