Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-4793: Revise APF Default Configuration #4795

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

linxiulei
Copy link

  • One-line PR description: Revise APF Default Configuration
  • Other comments:

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: linxiulei
Once this PR has been reviewed and has the lgtm label, please assign fedebongio for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Aug 19, 2024
@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Aug 19, 2024
@linxiulei linxiulei mentioned this pull request Aug 19, 2024
4 tasks
@linxiulei
Copy link
Author

/cc @MikeSpreitzer

Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just briefly skimmed through the proposal (without looking into PRR, upgrades, etc.)

- There will no fundamental changes in APF's own implementation

## Proposal

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The KEP would benefit from a high-level description of how borrowing works. Basically to show that the nominal shares serve a bit like "starting values", but the system is not necessary trying to get to those values, but rather is accommodating to the incoming traffic.

Maybe even some simple example would be useful (that's probably true also for the generic APF documentation - many people are making false assumptions about how borrowing works... - @MikeSpreitzer )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added description for APF and borrowing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MikeSpreitzer - I was thinking about the website - keps are for devs only and it would be useful for cluster administrators.
The example I was thinking about (although feel free to suggest other examples) was the following:

  • let's say we have a default cluster configuration, no requests
  • suddenly we start getting infinite (for some definitely of infinite :) ) number of requests in one of the PLs
  • how the nominal shares for each PLs will be changing
  • then those requests stop coming
  • what will happen now?

@linxiulei - this description is generic enough that I don't think it brings much value currently...

Comment on lines 297 to 302
- group:
name: '*'
kind: Group
- user:
name: '*'
kind: User
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/kubernetes/api/blob/v0.31.0/flowcontrol/v1/types.go#L207 is the recommended way to say "match regardless of subject".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think only system:authenticated is needed here. Let catch-all handle system:unauthenticated

resources:
- events
verbs:
- '*'
Copy link
Member

@MikeSpreitzer MikeSpreitzer Aug 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we want to put only Event creations in this jail? I am thinking of an admin who wants to debug something while the cluster is under extreme duress. Maybe also some kind(s) of monitoring?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense to me. But I feel we need put all modification verbs here (i.e. create/update/delete), which should not be actions for an admin during extreme stress.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only those verbs?

@wojtek-t pls see comments here. I also added explanation to make it clear.

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Aug 20, 2024
- There will no fundamental changes in APF's own implementation

## Proposal

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MikeSpreitzer - I was thinking about the website - keps are for devs only and it would be useful for cluster administrators.
The example I was thinking about (although feel free to suggest other examples) was the following:

  • let's say we have a default cluster configuration, no requests
  • suddenly we start getting infinite (for some definitely of infinite :) ) number of requests in one of the PLs
  • how the nominal shares for each PLs will be changing
  • then those requests stop coming
  • what will happen now?

@linxiulei - this description is generic enough that I don't think it brings much value currently...


Name | Nominal Shares | Lendable | Proposed Borrowing Limit
--------------- | -------------: | -------: | -----------------------:
exempt | 0 | 50% | none
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that proposed borrowing limit is always "none" - I suggest removing this column completely.

Also, I suggest adding one more column being "guaranteed shares" (how many shares we will always have) [well, modulo exempt borrowing from others] so that we can see how that changes.

Also - provide a sum - whether it changes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with merging and more columns.

Borrowing Limit is not none for Event because I want to avoid it borrowing too many and result in other PLs borrowing little. But a better solution is to have weighted borrowing.

- nonResourceRules:
resourceRules:
- apiGroups:
- events.k8s.io
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of components are still using core events.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, didn't know that. Should we just specify "*" here since the resource is "Event" already?

@linxiulei linxiulei force-pushed the revise_apf branch 3 times, most recently from 9d5f261 to 0681509 Compare August 21, 2024 14:08
workload-low | 100 | 90% | none | 10 | 40 | 75% | none | 10
global-default | 20 | 50% | none | 10 | 10 | 50% | none | 10
catch-all | 5 | 0% | none | 5 | 5 | 0% | none | 5
event | NA (new) | NA (new) | NA (new) | NA (new) | 5 | 0% | 100% | 5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we shouldn't adjust things in a way that SUM remains the same as it was exactly.
In other words if we provide some shares for this PL, subtract it from somewhere else.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also - why these shares are guaranteed? You wrote explicitly that events are best-effort, so I would expect these shares to be fully lendable.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two things we can remain the same:

  1. the guaranteed shares
  2. the nominal shares

But we can't keep both the same. In current version, I tend to keep the guaranteed shares the same as suggested by @MikeSpreitzer because it's really difficult to not change nominal shares when we want to increase shares for more critical PLs while minimizing wastes. For example, we have to significant increase nominal shares for leader-election, to compensate this and keep the same SUM of all nominal shares, we have to more significantly reduce the nominal shares of less critical PLs such as workload-high/low (in current KEP they are already reduced significantly so little room to further reduce).

re: events, I simply copied catch-all as I think events should have minimal guarantee at least. But I don't have strong opinion on it.

might increase or decrease according to actual usage with borrowing. However,
the owner priority level should have full Nominal Share as its current
concurrency share if in full utilization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main question that is not touched on in this KEP is: how are we going to test it/validate that we will not visibly break someone.

I can easily buy that there are cases where this configuration helps a lot.
But I don't know how we mitigate a negative impact on some other setups.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good callout and this is very difficult as this KEP increases the SUM of nominal shares so it will definite dilute any custom APF configuration therefore visible user impacts.

I'm thinking out loud -- we can be only certain that it won't break k8s-promised scalability and performance characteristic (e.g. https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md) since there are tests to verify. Then we will be precautious on graduating this KEP to have user feedback to evaluate the risk of breaking existing configuration and provide mitigations. Though I admit that this is not ideal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since there are tests to verify.

Tests are never perfect.

Then we will be precautious on graduating this KEP to have user feedback to evaluate the risk of breaking existing configuration and provide mitigations. Though I admit that this is not ideal.

Up until this is enabled by default, noone uses it. If we enable it by default, it's too late, because we already break someone.
This is not a valid mitigation strategy....

Signed-off-by: Eric Lin <exlin@google.com>
@k8s-ci-robot
Copy link
Contributor

@linxiulei: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-enhancements-verify a927005 link true /test pull-enhancements-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants