-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-4793: Revise APF Default Configuration #4795
base: master
Are you sure you want to change the base?
Conversation
linxiulei
commented
Aug 19, 2024
- One-line PR description: Revise APF Default Configuration
- Issue link: Revise APF default configuration #4793
- Other comments:
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: linxiulei The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cc @MikeSpreitzer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just briefly skimmed through the proposal (without looking into PRR, upgrades, etc.)
keps/sig-api-machinery/4793-revise-apf-default-configuration/README.md
Outdated
Show resolved
Hide resolved
- There will no fundamental changes in APF's own implementation | ||
|
||
## Proposal | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The KEP would benefit from a high-level description of how borrowing works. Basically to show that the nominal shares serve a bit like "starting values", but the system is not necessary trying to get to those values, but rather is accommodating to the incoming traffic.
Maybe even some simple example would be useful (that's probably true also for the generic APF documentation - many people are making false assumptions about how borrowing works... - @MikeSpreitzer )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wojtek-t: I could submit an update to https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/1040-priority-and-fairness/README.md; is that what you are suggesting? Or would it go somewhere in the website (https://kubernetes.io/docs/concepts/cluster-administration/flow-control/ , or a new page?)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added description for APF and borrowing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MikeSpreitzer - I was thinking about the website - keps are for devs only and it would be useful for cluster administrators.
The example I was thinking about (although feel free to suggest other examples) was the following:
- let's say we have a default cluster configuration, no requests
- suddenly we start getting infinite (for some definitely of infinite :) ) number of requests in one of the PLs
- how the nominal shares for each PLs will be changing
- then those requests stop coming
- what will happen now?
@linxiulei - this description is generic enough that I don't think it brings much value currently...
keps/sig-api-machinery/4793-revise-apf-default-configuration/README.md
Outdated
Show resolved
Hide resolved
- group: | ||
name: '*' | ||
kind: Group | ||
- user: | ||
name: '*' | ||
kind: User |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/kubernetes/api/blob/v0.31.0/flowcontrol/v1/types.go#L207 is the recommended way to say "match regardless of subject".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I think only system:authenticated
is needed here. Let catch-all handle system:unauthenticated
resources: | ||
- events | ||
verbs: | ||
- '*' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we want to put only Event creations in this jail? I am thinking of an admin who wants to debug something while the cluster is under extreme duress. Maybe also some kind(s) of monitoring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense to me. But I feel we need put all modification verbs here (i.e. create/update/delete), which should not be actions for an admin during extreme stress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only those verbs?
@wojtek-t pls see comments here. I also added explanation to make it clear.
- There will no fundamental changes in APF's own implementation | ||
|
||
## Proposal | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MikeSpreitzer - I was thinking about the website - keps are for devs only and it would be useful for cluster administrators.
The example I was thinking about (although feel free to suggest other examples) was the following:
- let's say we have a default cluster configuration, no requests
- suddenly we start getting infinite (for some definitely of infinite :) ) number of requests in one of the PLs
- how the nominal shares for each PLs will be changing
- then those requests stop coming
- what will happen now?
@linxiulei - this description is generic enough that I don't think it brings much value currently...
|
||
Name | Nominal Shares | Lendable | Proposed Borrowing Limit | ||
--------------- | -------------: | -------: | -----------------------: | ||
exempt | 0 | 50% | none |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that proposed borrowing limit is always "none" - I suggest removing this column completely.
Also, I suggest adding one more column being "guaranteed shares" (how many shares we will always have) [well, modulo exempt borrowing from others] so that we can see how that changes.
Also - provide a sum - whether it changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done with merging and more columns.
Borrowing Limit is not none for Event because I want to avoid it borrowing too many and result in other PLs borrowing little. But a better solution is to have weighted borrowing.
keps/sig-api-machinery/4793-revise-apf-default-configuration/README.md
Outdated
Show resolved
Hide resolved
- nonResourceRules: | ||
resourceRules: | ||
- apiGroups: | ||
- events.k8s.io |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of components are still using core events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, didn't know that. Should we just specify "*" here since the resource is "Event" already?
9d5f261
to
0681509
Compare
keps/sig-api-machinery/4793-revise-apf-default-configuration/README.md
Outdated
Show resolved
Hide resolved
keps/sig-api-machinery/4793-revise-apf-default-configuration/README.md
Outdated
Show resolved
Hide resolved
workload-low | 100 | 90% | none | 10 | 40 | 75% | none | 10 | ||
global-default | 20 | 50% | none | 10 | 10 | 50% | none | 10 | ||
catch-all | 5 | 0% | none | 5 | 5 | 0% | none | 5 | ||
event | NA (new) | NA (new) | NA (new) | NA (new) | 5 | 0% | 100% | 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we shouldn't adjust things in a way that SUM remains the same as it was exactly.
In other words if we provide some shares for this PL, subtract it from somewhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also - why these shares are guaranteed? You wrote explicitly that events are best-effort, so I would expect these shares to be fully lendable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two things we can remain the same:
- the guaranteed shares
- the nominal shares
But we can't keep both the same. In current version, I tend to keep the guaranteed shares the same as suggested by @MikeSpreitzer because it's really difficult to not change nominal shares when we want to increase shares for more critical PLs while minimizing wastes. For example, we have to significant increase nominal shares for leader-election, to compensate this and keep the same SUM of all nominal shares, we have to more significantly reduce the nominal shares of less critical PLs such as workload-high/low (in current KEP they are already reduced significantly so little room to further reduce).
re: events, I simply copied catch-all as I think events should have minimal guarantee at least. But I don't have strong opinion on it.
might increase or decrease according to actual usage with borrowing. However, | ||
the owner priority level should have full Nominal Share as its current | ||
concurrency share if in full utilization. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main question that is not touched on in this KEP is: how are we going to test it/validate that we will not visibly break someone.
I can easily buy that there are cases where this configuration helps a lot.
But I don't know how we mitigate a negative impact on some other setups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good callout and this is very difficult as this KEP increases the SUM of nominal shares so it will definite dilute any custom APF configuration therefore visible user impacts.
I'm thinking out loud -- we can be only certain that it won't break k8s-promised scalability and performance characteristic (e.g. https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md) since there are tests to verify. Then we will be precautious on graduating this KEP to have user feedback to evaluate the risk of breaking existing configuration and provide mitigations. Though I admit that this is not ideal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since there are tests to verify.
Tests are never perfect.
Then we will be precautious on graduating this KEP to have user feedback to evaluate the risk of breaking existing configuration and provide mitigations. Though I admit that this is not ideal.
Up until this is enabled by default, noone uses it. If we enable it by default, it's too late, because we already break someone.
This is not a valid mitigation strategy....
Signed-off-by: Eric Lin <exlin@google.com>
@linxiulei: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |