-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding initial EndpointSlice metrics. #83257
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR need more discussion.
@@ -186,6 +187,8 @@ func (c *Controller) Run(workers int, stopCh <-chan struct{}) { | |||
return | |||
} | |||
|
|||
endpointslicemetrics.RegisterMetrics() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it common practice to do it in Run?
I believe this can be done in init()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure if this works in init(), I was just going off of the few examples I found in other controllers (https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/persistentvolume/pv_controller_base.go#L302 and https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/attachdetach/attach_detach_controller.go#L363)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Their code sample registers in the init(). I'm going to assume either way it fine (threadsafe)
https://godoc.org/github.com/prometheus/client_golang/prometheus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MustRegister grabs the Registry lock:
https://github.com/prometheus/client_golang/blob/master/prometheus/registry.go#L277
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved it to init, I think it does make more sense there.
Help: "Number of endpoints added on each Service sync", | ||
StabilityLevel: metrics.ALPHA, | ||
}, | ||
[]string{"service_name", "service_namespace"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think you can use service namespace/name as the labels since they are unbounded in a cluster. This may cause high memory usage for Prometheus collector
5ad4354
to
1b07419
Compare
Help: "Number of endpoints added on each Service sync", | ||
StabilityLevel: metrics.ALPHA, | ||
}, | ||
[]string{"service_name", "service_namespace"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to be careful of any dimensions that can vary based on # of a class of objects: services, namespaces.
Help: "The number of EndpointSlice changes", | ||
StabilityLevel: metrics.ALPHA, | ||
}, | ||
[]string{"service_name", "service_namespace", "operation"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally speaking, you cannot put user input-derived fields in a metric, as the # of metrics will potentially grow w/out bound, not to mention if they are stored in internal systems, they would constitute PIIl.
1b07419
to
be23b14
Compare
ef89228
to
cd03db4
Compare
@bowei and @freehan Thanks for the feedback! I've made some big changes in response to that. The metrics are different now in hopes that they'll be more useful globally without relying on labels. There's also a new efficiency metric that shows the efficiency of endpoint distribution (essentially expected slices / actual slices) with 1 representing ideal allocation and anything less than that being less than ideal. |
cd03db4
to
46e19fc
Compare
|
||
// UpdateServicePortCache updates a ServicePortCache in the global cache for a | ||
// given Service and updates the corresponding metrics. | ||
func (c *Cache) UpdateServicePortCache(serviceNN types.NamespacedName, spCache ServicePortCache) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is worth comment on the detail of input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added comments describing the parameters here, wasn't quite sure how to format them though.
46e19fc
to
aa78d94
Compare
/retest |
1 similar comment
/retest |
aa78d94
to
e1ff9d6
Compare
@bowei I reworked the ServicePortCache interface a bit so it's not just passing a map around. Not quite sure this is what you were suggesting, but I think it's at least a bit closer. |
@@ -186,6 +187,8 @@ func (c *Controller) Run(workers int, stopCh <-chan struct{}) { | |||
return | |||
} | |||
|
|||
endpointslicemetrics.RegisterMetrics() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Their code sample registers in the init(). I'm going to assume either way it fine (threadsafe)
https://godoc.org/github.com/prometheus/client_golang/prometheus
@@ -186,6 +187,8 @@ func (c *Controller) Run(workers int, stopCh <-chan struct{}) { | |||
return | |||
} | |||
|
|||
endpointslicemetrics.RegisterMetrics() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MustRegister grabs the Registry lock:
https://github.com/prometheus/client_golang/blob/master/prometheus/registry.go#L277
e1ff9d6
to
48775e9
Compare
/retest |
/retest |
48775e9
to
724b142
Compare
/retest |
3 similar comments
/retest |
/retest |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
/retest |
/verify-owners |
1 similar comment
/verify-owners |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: freehan, robscott The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR adds some initial EndpointSlice metrics that cover the number of endpoints added or removed per sync, the number of endpoints in each slice, the total number of endpoints desired, and the total number of changes to endpoint slices, grouped by operation.
Special notes for your reviewer:
I'm very open to feedback and ideas for improving/changing these metrics. They've been helpful for me during some EndpointSlice scale testing but I'm interested in any other ideas.
Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
/sig network
/cc @bowei @freehan
/priority important-longterm