-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix admission metrics in true units #72343
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
@@ -206,9 +206,9 @@ func (m *metricSet) reset() { | |||
|
|||
// Observe records an observed admission event to all metrics in the metricSet. | |||
func (m *metricSet) observe(elapsed time.Duration, labels ...string) { | |||
elapsedMicroseconds := float64(elapsed / time.Microsecond) | |||
m.latencies.WithLabelValues(labels...).Observe(elapsedMicroseconds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will changing the values here cause problems for monitoring already set up to track/alarm on these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mainly, I can think of two ways this metric is (and can be) broken for people who have monitoring set up against these.
The first group would be people who are using this metric and are aware that the metric is emitting in microseconds even though the label is in seconds. If they have set their alerts accordingly (this would be weird but not impossible), then we would break their monitoring with this fix.
The second group are people who are using this metric as if this metric was working correctly, i.e. emitting latency in seconds. In that case, thresholds which are currently set for alerting would be off by orders of magnitude and this fix would actually make those alerts start working as intended.
Personally, I think we should just fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also for changing as this is an actual bug, but can you add an item to the changelog that this is a change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of the metrics overhaul planned for 1.14 where a number of metrics are changing and we're documenting every single case including what to change. As a middle ground, let's add an already deprecated metric that is called admission_latencies_milliseconds_summary
so people who are affected by the break would only have to change the metric name and not do a unit conversion. I think this would work well, as 1.14 is the "metric migration" release, where we have deprecated metrics as well as the (new) best practice following metrics and the deprecated ones will be removed in 1.15.
This one is an interesting case as it's not just not following the best practice, but also incorrectly labels its unit. It will either way need a separate, additional changelog notice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree with @brancz proposal.
Added admission_latencies_milliseconds
and admission_latencies_milliseconds_summary
for Backward compatible. PTAL
/cc @brancz |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a small thing on consistency. Otherwise looks good.
Namespace: namespace, | ||
Subsystem: subsystem, | ||
Name: fmt.Sprintf("%s_admission_latencies_milliseconds", name), | ||
Help: fmt.Sprintf(helpTemplate, "latency histogram in milliseconds (deprecated)"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
Namespace: namespace, | ||
Subsystem: subsystem, | ||
Name: fmt.Sprintf("%s_admission_latencies_milliseconds_summary", name), | ||
Help: fmt.Sprintf(helpTemplate, "latency summary in milliseconds (deprecated)"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be consistent with the deprecation warning. Let’s make sure the help text is preceded with (Deprecated) like the other metrics we have already deprecated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brancz fixed.
/lgtm |
Apologies for mis-labeling this metric. That's clearly my fault. Note that we'll be doubling the memory utilization of the metrics for admission which might matter for those that use a lot of admission controllers. But given that we reduced the carnality of these metrics in the last release, I don't think that will be a show stopper. Let's just document the deprecation plan so we know when we can remove the old metrics for good. |
Namespace: namespace, | ||
Subsystem: subsystem, | ||
Name: fmt.Sprintf("%s_admission_latencies_milliseconds_summary", name), | ||
Help: fmt.Sprintf("(Deprecated) "+helpTemplate, "latency summary in milliseconds"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add which k8s version these will be removed in here in the help string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is consistent with the deprecation warning in other metrics we have deprecated.
But surely, we will announce the metrics migration/deprecation plan in release notes or in other ways.
cc @brancz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1.14 the "old" metrics are deprecated and 1.15 is targeted for removal. Let's explicitly document this in the KEP.
@danielqsj let's make sure the deprecation plan is more thoroughly documented in the KEP. Do you want to take care of that? |
@brancz sure. I will update KEP about the deprecation plan and the latest PRs which not covered. |
@sttts @deads2k @smarterclayton if you have time, can you help review this? Thanks |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: danielqsj, sttts The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
/retest Review the full test history for this PR. Silence the bot with an |
3 similar comments
/retest Review the full test history for this PR. Silence the bot with an |
/retest Review the full test history for this PR. Silence the bot with an |
/retest Review the full test history for this PR. Silence the bot with an |
@danielqsj: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Admission metrics name is
*_admission_latencies_seconds
and*_admission_latencies_seconds_summary
, the units from metrics name areseconds
, but actually the return metrics are inmicroseconds
, this PR aims to fix these metrics inseconds
.Which issue(s) this PR fixes:
Fixes #72342
Special notes for your reviewer:
Does this PR introduce a user-facing change?: