-
Notifications
You must be signed in to change notification settings - Fork 7.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding metrics to CNI repair controller #29911
Conversation
Skipping CI for Draft Pull Request. |
c16f507
to
0d90ac0
Compare
cni/pkg/repair/repair.go
Outdated
@@ -145,6 +155,7 @@ func (bpr BrokenPodReconciler) deleteBrokenPod(pod v1.Pod) error { | |||
return nil | |||
} | |||
log.Infof("Pod detected as broken, deleting: %s/%s", pod.Namespace, pod.Name) | |||
defer bpr.Metrics.PodsRepaired.Increment() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why defer here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stewartbutler can you clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only used defer here to avoid using a temp variable for the return value of the delete operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow, what is the temp value you are referring to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please ignore that. I was just being an idiot. @therealmitchconnors you can remove the defer.
(My insomnia-addled rationale was 'If I defer this, it will increment the metric only if the call succeeds and I don't have to create a temp variable to store the return value of the delete call', but with a defer it'd be called even if the repair called failed so there is no reason not to just increment it inline. Just put it down to me being dumb.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you wanted to do this properly you could store the return of the Delete operation, and only increment if err == null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we add a bit more granularity here? Perhaps record the successful and failed repairs. I am assuming this records all attempts at the pod repairs/deletes
cni/cmd/istio-cni-repair/main.go
Outdated
@@ -166,6 +176,12 @@ func logCurrentOptions(bpr *repair.BrokenPodReconciler, options *ControllerOptio | |||
} | |||
} | |||
|
|||
func init() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to init here, metric can init in the package where it defines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is placed in init() so that all metrics for the package can be stored in the metric struct. @stewartbutler can you clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that was the goal. That way you could pass the metric struct to the reconciler and have a guarantee that it was initialized. I'm not sure if that requirement is changed now that you are using istio.io/pkg/monitoring though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metric could just live and init in other package. As long as the package is imported directly or indirectly by the main package, the init function there will be triggered.
/retest |
Can we add labels to this metric? nodeName, Namespace, podName would be useful to start with |
@@ -44,19 +45,26 @@ type Filters struct { | |||
LabelSelectors string `json:"label_selectors"` | |||
} | |||
|
|||
type Metrics struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious: why intro this struct? Is this just for testing?
it seems like an exported repair.PodsRepaired
metric would work just as well? As it is not dimensioned, it doesn't seem like it needs to be passed around like a context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was the only way I could think of to set it up so I could mock the metrics in tests. If there's a more idiomatic way to do it, it can be removed.
cni/cmd/istio-cni-repair/main.go
Outdated
@@ -166,6 +176,12 @@ func logCurrentOptions(bpr *repair.BrokenPodReconciler, options *ControllerOptio | |||
} | |||
} | |||
|
|||
func init() { | |||
metrics.PodsRepaired = monitoring.NewSum("istio_cni_repair_pods_repaired_total", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: do we need repair
in the name twice? would something like: istio_cni_repaired_pods_total
be just as clear?
fwiw, repaired_pods
is slightly preferable to pods_repaired
to my ears. But I'm biased by https://cloud.google.com/apis/design/naming_convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was treating istio_cni_repair
as a prefix, so that they are all in a block and someone quickly scanning them can find all the repair-related metrics if we later introduce other cni metrics.
Not particularly tied to that, but that was my rationale.
What do we intend to do with the Some high-level guidance on labels: https://prometheus.io/docs/practices/instrumentation/#do-not-overuse-labels |
I can understand the pod name concern. But the For eg:- we could have networking issues in certain nodes happening in an availability zone. Using the istio CNI metrics could be a good way to track this. |
For node - the metric from CNI doesn't need to return the node, since it already works on a per-node basis and prometheus scraping can/will/should/might be configured to already add the node automatically? I may be wrong here though, maybe it just adds pod/namespace generally |
Oh yes, prometheus does auto-add the instance. That should be enough |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as a start. Thanks!
return bpr.client.CoreV1().Pods(pod.Namespace).Delete(context.TODO(), pod.Name, metav1.DeleteOptions{}) | ||
err := bpr.client.CoreV1().Pods(pod.Namespace).Delete(context.TODO(), pod.Name, metav1.DeleteOptions{}) | ||
if err == nil { | ||
bpr.Metrics.PodsRepaired.Increment() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels weird to get a metric for success and not have a corresponding one for failures. Failures are arguable more important?
Also, very nit, the pattern of err == nil
is a bit odd
if err := bpr.client.CoreV1().Pods(pod.Namespace).Delete(context.TODO(), pod.Name, metav1.DeleteOptions{}); err != nil {
return err
}
increment()
return nil
seems more standard to me
/retest |
Adds prometheus metrics to CNI repair controller
Fixes #19300