Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix admission metrics in true units #72343

Merged
merged 2 commits into from
Jan 29, 2019
Merged

Conversation

danielqsj
Copy link
Contributor

@danielqsj danielqsj commented Dec 26, 2018

What type of PR is this?

/kind bug

What this PR does / why we need it:

Admission metrics name is *_admission_latencies_seconds and *_admission_latencies_seconds_summary, the units from metrics name are seconds, but actually the return metrics are in microseconds, this PR aims to fix these metrics in seconds.

Which issue(s) this PR fixes:

Fixes #72342

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Fix admission metrics in seconds.
Add metrics `*_admission_latencies_milliseconds` and `*_admission_latencies_milliseconds_summary` for backward compatible, but will be removed in a future release.

@k8s-ci-robot k8s-ci-robot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. and removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Dec 26, 2018
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/apiserver sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 26, 2018
Copy link
Member

@logicalhan logicalhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 26, 2018
@liggitt
Copy link
Member

liggitt commented Dec 27, 2018

/unassign
/assign @jpbetz @sttts

@@ -206,9 +206,9 @@ func (m *metricSet) reset() {

// Observe records an observed admission event to all metrics in the metricSet.
func (m *metricSet) observe(elapsed time.Duration, labels ...string) {
elapsedMicroseconds := float64(elapsed / time.Microsecond)
m.latencies.WithLabelValues(labels...).Observe(elapsedMicroseconds)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will changing the values here cause problems for monitoring already set up to track/alarm on these?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly, I can think of two ways this metric is (and can be) broken for people who have monitoring set up against these.

The first group would be people who are using this metric and are aware that the metric is emitting in microseconds even though the label is in seconds. If they have set their alerts accordingly (this would be weird but not impossible), then we would break their monitoring with this fix.

The second group are people who are using this metric as if this metric was working correctly, i.e. emitting latency in seconds. In that case, thresholds which are currently set for alerting would be off by orders of magnitude and this fix would actually make those alerts start working as intended.

Personally, I think we should just fix it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also for changing as this is an actual bug, but can you add an item to the changelog that this is a change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of the metrics overhaul planned for 1.14 where a number of metrics are changing and we're documenting every single case including what to change. As a middle ground, let's add an already deprecated metric that is called admission_latencies_milliseconds_summary so people who are affected by the break would only have to change the metric name and not do a unit conversion. I think this would work well, as 1.14 is the "metric migration" release, where we have deprecated metrics as well as the (new) best practice following metrics and the deprecated ones will be removed in 1.15.

This one is an interesting case as it's not just not following the best practice, but also incorrectly labels its unit. It will either way need a separate, additional changelog notice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with @brancz proposal.
Added admission_latencies_milliseconds and admission_latencies_milliseconds_summary for Backward compatible. PTAL

@danielqsj
Copy link
Contributor Author

/cc @brancz

@k8s-ci-robot k8s-ci-robot requested a review from brancz January 8, 2019 05:21
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 18, 2019
@danielqsj
Copy link
Contributor Author

/retest

Copy link
Member

@brancz brancz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small thing on consistency. Otherwise looks good.

Namespace: namespace,
Subsystem: subsystem,
Name: fmt.Sprintf("%s_admission_latencies_milliseconds", name),
Help: fmt.Sprintf(helpTemplate, "latency histogram in milliseconds (deprecated)"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Namespace: namespace,
Subsystem: subsystem,
Name: fmt.Sprintf("%s_admission_latencies_milliseconds_summary", name),
Help: fmt.Sprintf(helpTemplate, "latency summary in milliseconds (deprecated)"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be consistent with the deprecation warning. Let’s make sure the help text is preceded with (Deprecated) like the other metrics we have already deprecated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brancz fixed.

@brancz
Copy link
Member

brancz commented Jan 18, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 18, 2019
@danielqsj
Copy link
Contributor Author

@sttts @deads2k if you have time, can you help review this? Thanks

@jpbetz
Copy link
Contributor

jpbetz commented Jan 22, 2019

Apologies for mis-labeling this metric. That's clearly my fault.

Note that we'll be doubling the memory utilization of the metrics for admission which might matter for those that use a lot of admission controllers. But given that we reduced the carnality of these metrics in the last release, I don't think that will be a show stopper. Let's just document the deprecation plan so we know when we can remove the old metrics for good.

Namespace: namespace,
Subsystem: subsystem,
Name: fmt.Sprintf("%s_admission_latencies_milliseconds_summary", name),
Help: fmt.Sprintf("(Deprecated) "+helpTemplate, "latency summary in milliseconds"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add which k8s version these will be removed in here in the help string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is consistent with the deprecation warning in other metrics we have deprecated.
But surely, we will announce the metrics migration/deprecation plan in release notes or in other ways.
cc @brancz

Copy link
Member

@brancz brancz Jan 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.14 the "old" metrics are deprecated and 1.15 is targeted for removal. Let's explicitly document this in the KEP.

@brancz
Copy link
Member

brancz commented Jan 23, 2019

@danielqsj let's make sure the deprecation plan is more thoroughly documented in the KEP. Do you want to take care of that?

@danielqsj
Copy link
Contributor Author

@brancz sure. I will update KEP about the deprecation plan and the latest PRs which not covered.

@danielqsj
Copy link
Contributor Author

@sttts @deads2k @smarterclayton if you have time, can you help review this? Thanks

@sttts
Copy link
Contributor

sttts commented Jan 28, 2019

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danielqsj, sttts

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 28, 2019
@brancz
Copy link
Member

brancz commented Jan 28, 2019

/retest

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

3 similar comments
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@k8s-ci-robot
Copy link
Contributor

@danielqsj: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-kops-aws d9c57e7 link /test pull-kubernetes-e2e-kops-aws

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Admission metrics value not match their units
8 participants