-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add authentication overall latency metrics #82409
Add authentication overall latency metrics #82409
Conversation
23e37ab
to
48ac11b
Compare
/retest |
48ac11b
to
0ea2476
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments, otherwise this looks good. Thanks!
staging/src/k8s.io/apiserver/pkg/endpoints/filters/authentication.go
Outdated
Show resolved
Hide resolved
staging/src/k8s.io/apiserver/pkg/endpoints/filters/authentication.go
Outdated
Show resolved
Hide resolved
65a03ec
to
e71827b
Compare
staging/src/k8s.io/apiserver/pkg/endpoints/filters/authentication.go
Outdated
Show resolved
Hide resolved
&metrics.HistogramOpts{ | ||
Name: "authentication_duration_seconds", | ||
Help: "Authentication duration in seconds broken out by result.", | ||
Buckets: prometheus.ExponentialBuckets(0.001, 2, 10), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means the largest bucket is only just over a second right? I can see webhook tail latency being much higher. 15 steps would give you over 30 seconds which is generally the timeout for a request. Granted I forget what we set the timeout for webhooks to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your information.
I can't agree more.
79aacf2
to
34c28f2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
on the metrics-side of things.
Add alpha tags to authentication_attempts explicitly.
34c28f2
to
0c0d69e
Compare
@enj @mikedanese Minor update after last time review: |
/test pull-kubernetes-integration |
/test pull-kubernetes-integration fake test. |
/lgtm |
&metrics.HistogramOpts{ | ||
Name: "authentication_duration_seconds", | ||
Help: "Authentication duration in seconds broken out by result.", | ||
Buckets: metrics.ExponentialBuckets(0.001, 2, 15), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bucketing might be a little abrasive depending on the actual latency. We may need to adjust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you give me some specific instructions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I read this right, the buckets are:
n | 0.001 n^2
1 | 0.001
2 | 0.004
3 | 0.009
4 | 0.016
5 | 0.025
6 | 0.036
7 | 0.049
8 | 0.064
9 | 0.081
10 | 0.1
11 | 0.121
12 | 0.144
13 | 0.169
14 | 0.196
15 | 0.225
What if authentication takes a second or ten seconds? It'll get put in the 15th bucket along with the ones the requests that took 0.2 seconds. In my experience, I've never been thrilled with exponential bucketing. Unless you actually have a good idea of performance before setting the parameters, it tends to cause weird data issues when you actually want to know how long stuff takes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The buckets seem not what you think. Wait a moment, I try to dump from my local cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comes from my local cluster.
# HELP authentication_duration_seconds [ALPHA] Authentication duration in seconds broken out by result.
# TYPE authentication_duration_seconds histogram
authentication_duration_seconds_bucket{result="success",le="0.001"} 1143
authentication_duration_seconds_bucket{result="success",le="0.002"} 1143
authentication_duration_seconds_bucket{result="success",le="0.004"} 1144
authentication_duration_seconds_bucket{result="success",le="0.008"} 1144
authentication_duration_seconds_bucket{result="success",le="0.016"} 1144
authentication_duration_seconds_bucket{result="success",le="0.032"} 1144
authentication_duration_seconds_bucket{result="success",le="0.064"} 1144
authentication_duration_seconds_bucket{result="success",le="0.128"} 1144
authentication_duration_seconds_bucket{result="success",le="0.256"} 1144
authentication_duration_seconds_bucket{result="success",le="0.512"} 1144
authentication_duration_seconds_bucket{result="success",le="1.024"} 1144
authentication_duration_seconds_bucket{result="success",le="2.048"} 1144
authentication_duration_seconds_bucket{result="success",le="4.096"} 1144
authentication_duration_seconds_bucket{result="success",le="8.192"} 1144
authentication_duration_seconds_bucket{result="success",le="16.384"} 1144
authentication_duration_seconds_bucket{result="success",le="+Inf"} 1144
authentication_duration_seconds_sum{result="success"} 0.14704390199999995
authentication_duration_seconds_count{result="success"} 1144
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@enj has pointed the similar issue. So, I enlarged the bucketed number from 10
to 15
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I did n^2 instead of 2^n. Those look fine for now although we may want higher density above 1 second in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. By the way, most latency metric using the same bucket.
Now let's call @liggitt help review and approve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @brancz @logicalhan on bucket count/size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bucket count/sizes look okay to me.
/assign @liggitt |
/lgtm |
I'll defer to @kubernetes/sig-instrumentation-pr-reviews on bucket size/count |
/approve /hold |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: liggitt, mikedanese, RainbowMango The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I don’t know the surrounding code too well. Is this metric describing a network request or largely in process? If it’s a network request then these buckets are fine. |
it is recording metrics for the authentication filter, which can involve a network request (to check a credential stored in etcd or verify a token against a remote webhook) /hold |
/hold cancel |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Which issue(s) this PR fixes:
Part of #81028.
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
/assign @mikedanese
/cc @enj
/sig auth