Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add authentication overall latency metrics #82409

Conversation

RainbowMango
Copy link
Member

What type of PR is this?
/kind feature

What this PR does / why we need it:

Which issue(s) this PR fixes:
Part of #81028.

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Added metrics 'authentication_latency_seconds' that can be used to understand the latency of authentication.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

/assign @mikedanese
/cc @enj
/sig auth

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. sig/auth Categorizes an issue or PR as relevant to SIG Auth. labels Sep 6, 2019
@k8s-ci-robot k8s-ci-robot requested a review from enj September 6, 2019 09:51
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/apiserver sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Sep 6, 2019
@RainbowMango RainbowMango force-pushed the pr_add_authentication_overall_latency_metrics branch from 23e37ab to 48ac11b Compare September 6, 2019 10:31
@RainbowMango
Copy link
Member Author

/retest

@RainbowMango RainbowMango force-pushed the pr_add_authentication_overall_latency_metrics branch from 48ac11b to 0ea2476 Compare September 6, 2019 12:55
Copy link
Member

@mikedanese mikedanese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments, otherwise this looks good. Thanks!

@RainbowMango RainbowMango force-pushed the pr_add_authentication_overall_latency_metrics branch 3 times, most recently from 65a03ec to e71827b Compare September 9, 2019 02:56
&metrics.HistogramOpts{
Name: "authentication_duration_seconds",
Help: "Authentication duration in seconds broken out by result.",
Buckets: prometheus.ExponentialBuckets(0.001, 2, 10),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means the largest bucket is only just over a second right? I can see webhook tail latency being much higher. 15 steps would give you over 30 seconds which is generally the timeout for a request. Granted I forget what we set the timeout for webhooks to.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your information.
I can't agree more.

@RainbowMango RainbowMango force-pushed the pr_add_authentication_overall_latency_metrics branch 2 times, most recently from 79aacf2 to 34c28f2 Compare September 9, 2019 08:38
Copy link
Member

@logicalhan logicalhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

on the metrics-side of things.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 9, 2019
Add alpha tags to authentication_attempts explicitly.
@RainbowMango RainbowMango force-pushed the pr_add_authentication_overall_latency_metrics branch from 34c28f2 to 0c0d69e Compare September 18, 2019 02:04
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 18, 2019
@RainbowMango
Copy link
Member Author

@enj @mikedanese
Please take a look.

Minor update after last time review:
Update prometheus.ExponentialBuckets to metrics.ExponentialBuckets, because metrics has provided a wrapper for bucket functionality.

@RainbowMango
Copy link
Member Author

/test pull-kubernetes-integration

@RainbowMango
Copy link
Member Author

/test pull-kubernetes-integration

fake test.

@mikedanese
Copy link
Member

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 19, 2019
&metrics.HistogramOpts{
Name: "authentication_duration_seconds",
Help: "Authentication duration in seconds broken out by result.",
Buckets: metrics.ExponentialBuckets(0.001, 2, 15),
Copy link
Member

@mikedanese mikedanese Sep 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bucketing might be a little abrasive depending on the actual latency. We may need to adjust.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you give me some specific instructions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I read this right, the buckets are:

n | 0.001 n^2
1 | 0.001
2 | 0.004
3 | 0.009
4 | 0.016
5 | 0.025
6 | 0.036
7 | 0.049
8 | 0.064
9 | 0.081
10 | 0.1
11 | 0.121
12 | 0.144
13 | 0.169
14 | 0.196
15 | 0.225

What if authentication takes a second or ten seconds? It'll get put in the 15th bucket along with the ones the requests that took 0.2 seconds. In my experience, I've never been thrilled with exponential bucketing. Unless you actually have a good idea of performance before setting the parameters, it tends to cause weird data issues when you actually want to know how long stuff takes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The buckets seem not what you think. Wait a moment, I try to dump from my local cluster.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comes from my local cluster.

# HELP authentication_duration_seconds [ALPHA] Authentication duration in seconds broken out by result.
# TYPE authentication_duration_seconds histogram
authentication_duration_seconds_bucket{result="success",le="0.001"} 1143
authentication_duration_seconds_bucket{result="success",le="0.002"} 1143
authentication_duration_seconds_bucket{result="success",le="0.004"} 1144
authentication_duration_seconds_bucket{result="success",le="0.008"} 1144
authentication_duration_seconds_bucket{result="success",le="0.016"} 1144
authentication_duration_seconds_bucket{result="success",le="0.032"} 1144
authentication_duration_seconds_bucket{result="success",le="0.064"} 1144
authentication_duration_seconds_bucket{result="success",le="0.128"} 1144
authentication_duration_seconds_bucket{result="success",le="0.256"} 1144
authentication_duration_seconds_bucket{result="success",le="0.512"} 1144
authentication_duration_seconds_bucket{result="success",le="1.024"} 1144
authentication_duration_seconds_bucket{result="success",le="2.048"} 1144
authentication_duration_seconds_bucket{result="success",le="4.096"} 1144
authentication_duration_seconds_bucket{result="success",le="8.192"} 1144
authentication_duration_seconds_bucket{result="success",le="16.384"} 1144
authentication_duration_seconds_bucket{result="success",le="+Inf"} 1144
authentication_duration_seconds_sum{result="success"} 0.14704390199999995
authentication_duration_seconds_count{result="success"} 1144

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@enj has pointed the similar issue. So, I enlarged the bucketed number from 10 to 15.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I did n^2 instead of 2^n. Those look fine for now although we may want higher density above 1 second in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. By the way, most latency metric using the same bucket.

Now let's call @liggitt help review and approve.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @brancz @logicalhan on bucket count/size

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bucket count/sizes look okay to me.

@RainbowMango
Copy link
Member Author

/assign @liggitt

@mikedanese
Copy link
Member

/lgtm

@liggitt
Copy link
Member

liggitt commented Sep 24, 2019

I'll defer to @kubernetes/sig-instrumentation-pr-reviews on bucket size/count

@k8s-ci-robot k8s-ci-robot added the sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. label Sep 24, 2019
@liggitt
Copy link
Member

liggitt commented Sep 24, 2019

/approve

/hold
for @logicalhan or @brancz or delegate to take a look at the bucketing

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 24, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liggitt, mikedanese, RainbowMango

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 24, 2019
@brancz
Copy link
Member

brancz commented Sep 24, 2019

I don’t know the surrounding code too well. Is this metric describing a network request or largely in process? If it’s a network request then these buckets are fine.

@liggitt
Copy link
Member

liggitt commented Sep 24, 2019

Is this metric describing a network request or largely in process? If it’s a network request then these buckets are fine.

it is recording metrics for the authentication filter, which can involve a network request (to check a credential stored in etcd or verify a token against a remote webhook)

/hold
cancel

@RainbowMango
Copy link
Member Author

/hold cancel
:)

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 25, 2019
@k8s-ci-robot k8s-ci-robot merged commit 07025a5 into kubernetes:master Sep 25, 2019
@k8s-ci-robot k8s-ci-robot added this to the v1.17 milestone Sep 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants