-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add Authorization tracking request/error counts and latency metrics #117211
Conversation
Please note that we're already in Test Freeze for the Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Tue Apr 11 14:07:05 UTC 2023. |
Hi @HirazawaUi. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cc @shyamjvs issue owner for initial review. Thank you |
resultLabel = successLabel | ||
case authorized == authorizer.DecisionNoOpinion: | ||
// continue to the next authorizer | ||
resultLabel = successLabel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be quite confusing to return a success when the authorizer didn't make a decision. I would understand a result="success" label of authorization_attempts_total to represent an authorization, which is not the case with DecisionNoOpinion.
Maybe we could just have a decision
label instead of result
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be quite confusing to return a success when the authorizer didn't make a decision. I would understand a result="success" label of authorization_attempts_total to represent an authorization, which is not the case with DecisionNoOpinion.
Maybe we could just have a
decision
label instead ofresult
?
In the Authorize interface, the action of Authorize does not exactly match the presence of error, when error is not nil and action is not allowed, withAuthorization will return an error, this situation cannot be described by the decision label
So, I don't think it is possible to simply use the decision
label instead of result
.
metrics should be divided into the following cases in order
- when action is allow, the error is ignored and the count of metrics with
decision="allow"
andresult="success"
is added to 1 - when error is non-empty, the metrics count of
{result="error"}
is added to 1 - When action is forbid, also ignore the error and add 1 to the metrics count of
decision="fobid
andresult="success"
. - When action is NoOpinion, add 1 to the metrics count of
{decision="no opinion"}
.
The above is just a shallow understanding after reading the code, do you think this is reasonable?
Please review my latest commit code, it follows the above logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I see thank you for the clarification. I have the feeling that separating decision and result will just be confusing to the general users and that the difference is an implementation detail that users don't necessarily need to be aware of.
I have two suggestions to make it simpler for the users:
- a single metric:
authorization_attempts_total{result="allowed|forbidden|no-opinion|error"}
- two metrics:
authorization_attempts_total{decision="allowed|forbidden|no-opinion|"}
authorization_attempt_errors_total{}
The second is more idiomatic when computing error rate since you don't have to do requests_total - requests_total{error} in order to get the total number of errors, but at the same time, I am not super happy with having an empty decision label in case of errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your explanation, I agree with you and have used the first suggestion to make changes, please review code
|
||
authorizationDecisionAnnotationCounter = metrics.NewCounterVec( | ||
&metrics.CounterOpts{ | ||
Name: "authorization_decision_annotations_total", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please share some lights on the use-case for this metric?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please share some lights on the use-case for this metric?
It is the metrics of the decision label, it seems that the name here is not very accurate, in the latest commit has been modified, its use case in the last comment in the reply I have described, please review, and hope you provide comments to tell me whether it is reasonable
8ff18f0
to
ab8dc07
Compare
Please change the PR title from Authentication -> Authorization for accuracy. /ok-to-test |
@shyamjvs: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/priority important-soon |
@HirazawaUi IMO this change deserves a release note, to let users know that they can now track their authz webhooks through these new apiserver metrics. |
LGTM label has been added. Git tree hash: 247a4837d2bc5b2cefc7d33caa2e1b2331bc5c63
|
@HirazawaUi could you perhaps update the release note with the correct metric names and description? |
Okay, I've modified it |
I still see the incorrect metric names in the release note - |
I recommend this for the release note:
|
@@ -38,6 +39,10 @@ const ( | |||
successLabel = "success" | |||
failureLabel = "failure" | |||
errorLabel = "error" | |||
|
|||
allowedLabel = "allowed" | |||
forbiddenLabel = "forbidden" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
forbiddenLabel = "forbidden" | |
deniedLabel = "denied" |
deny|denied
is the language used in the Kubernetes authorization docs and the API to refer to this case, although I see that the forbid|forbidden
language is used in the code. IMO we should align the metric name with the language in the public facing docs and the API instead of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion, changing it to denied seems to be a more reasonable behavior
//Follow the behavior in withAuthorization, even encounter evaluation errors and still allow the request | ||
//Will not judge whether there is an error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//Follow the behavior in withAuthorization, even encounter evaluation errors and still allow the request | |
//Will not judge whether there is an error | |
// In withAuthorization(), it is possible to allow the request and still encounter an evaluation error. |
But also, doesn't this mean we should have a separate errors_total metric? Because in the case where there are errors and they are also allowed, we are not reporting them in metrics, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This issue was discussed with @dgrisonnet earlier and the conclusion was that whether or not to expose the metrics as an result error but still authorize successfully depends on whether or not the user should know the implementation details of the code
But we all feel as if the user doesn't need to care about how it is implemented
If it's because I left a comment that causes confusion, I'll remove it
Okay, I've changed it as recommended |
9a8fb80
to
104c208
Compare
/retest (unrelated failure) |
@shyamjvs: The
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
LGTM label has been added. Git tree hash: 7cf7a9b1bf4089ef2d9f63d155b09be97e8ff58f
|
/retest |
This is great - thanks! /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dgrisonnet, HirazawaUi, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
…tency metrics Description: "Apiserver adds two new metrics `authorization_attempts_total` and `authorization_duration_seconds` that allow users to monitor requests to authorization webhooks, split by result." (PR description) Upstream PR, Issue, KEP, etc. links: * Take from upstream Kubernetes PR kubernetes#117211 (kubernetes#117211), which, at the time this patch was added, had not been merged yet. * Partially fixes Kubernetes Issue kubernetes#117167 (kubernetes#117167) If this patch is based on an upstream commit, how (if at all) do this patch and the upstream source differ? TBD after change is merged by upstream If this patch's changes have not been added by upstream, why not? Change is proposed to upstream Other patches related to this patch: None Changes made to this patch after its initial creation and reasons for these changes: None Kubernetes version this patch can be dropped: TBD after change is merged by upstream
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
add Authentication tracking request/error counts and latency metrics
Which issue(s) this PR fixes:
Fixes #117167
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: