Adding metrics to CNI repair controller #29911

stewartbutler · 2021-01-07T16:11:13Z

Adds prometheus metrics to CNI repair controller

Fixes #19300

istio-testing · 2021-01-07T16:11:23Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

cni/cmd/istio-cni-repair/main.go

manifests/charts/istio-cni/templates/daemonset.yaml

bianpengyuan · 2021-02-11T18:33:47Z

cni/pkg/repair/repair.go

@@ -145,6 +155,7 @@ func (bpr BrokenPodReconciler) deleteBrokenPod(pod v1.Pod) error {
 		return nil
 	}
 	log.Infof("Pod detected as broken, deleting: %s/%s", pod.Namespace, pod.Name)
+	defer bpr.Metrics.PodsRepaired.Increment()


Why defer here?

@stewartbutler can you clarify?

I only used defer here to avoid using a temp variable for the return value of the delete operation.

I don't follow, what is the temp value you are referring to?

Please ignore that. I was just being an idiot. @therealmitchconnors you can remove the defer.

(My insomnia-addled rationale was 'If I defer this, it will increment the metric only if the call succeeds and I don't have to create a temp variable to store the return value of the delete call', but with a defer it'd be called even if the repair called failed so there is no reason not to just increment it inline. Just put it down to me being dumb.)

If you wanted to do this properly you could store the return of the Delete operation, and only increment if err == null.

shouldn't we add a bit more granularity here? Perhaps record the successful and failed repairs. I am assuming this records all attempts at the pod repairs/deletes

cni/pkg/repair/repair.go

cni/cmd/istio-cni-repair/main.go

bianpengyuan · 2021-02-11T18:47:41Z

cni/cmd/istio-cni-repair/main.go

@@ -166,6 +176,12 @@ func logCurrentOptions(bpr *repair.BrokenPodReconciler, options *ControllerOptio
 	}
 }

+func init() {


You don't need to init here, metric can init in the package where it defines.

I think this is placed in init() so that all metrics for the package can be stored in the metric struct. @stewartbutler can you clarify?

Yes, that was the goal. That way you could pass the metric struct to the reconciler and have a guarantee that it was initialized. I'm not sure if that requirement is changed now that you are using istio.io/pkg/monitoring though.

The metric could just live and init in other package. As long as the package is imported directly or indirectly by the main package, the init function there will be triggered.

cni/cmd/istio-cni-repair/main.go

therealmitchconnors · 2021-02-12T05:36:47Z

/retest

cni/cmd/istio-cni-repair/main.go

manifests/charts/istio-cni/templates/daemonset.yaml

tariq1890 · 2021-02-18T21:42:14Z

Can we add labels to this metric? nodeName, Namespace, podName would be useful to start with

douglas-reid · 2021-02-18T22:39:44Z

cni/pkg/repair/repair.go

@@ -44,19 +45,26 @@ type Filters struct {
 	LabelSelectors                  string `json:"label_selectors"`
 }

+type Metrics struct {


curious: why intro this struct? Is this just for testing?

it seems like an exported repair.PodsRepaired metric would work just as well? As it is not dimensioned, it doesn't seem like it needs to be passed around like a context.

It was the only way I could think of to set it up so I could mock the metrics in tests. If there's a more idiomatic way to do it, it can be removed.

douglas-reid · 2021-02-18T22:53:48Z

cni/cmd/istio-cni-repair/main.go

@@ -166,6 +176,12 @@ func logCurrentOptions(bpr *repair.BrokenPodReconciler, options *ControllerOptio
 	}
 }

+func init() {
+	metrics.PodsRepaired = monitoring.NewSum("istio_cni_repair_pods_repaired_total",


super nit: do we need repair in the name twice? would something like: istio_cni_repaired_pods_total be just as clear?

fwiw, repaired_pods is slightly preferable to pods_repaired to my ears. But I'm biased by https://cloud.google.com/apis/design/naming_convention.

I was treating istio_cni_repair as a prefix, so that they are all in a block and someone quickly scanning them can find all the repair-related metrics if we later introduce other cni metrics.

Not particularly tied to that, but that was my rationale.

douglas-reid · 2021-02-19T17:18:21Z

Can we add labels to this metric? nodeName, Namespace, podName would be useful to start with

What do we intend to do with the podName label? Generally, that sort of detail is better left to logs for debugging, etc. I don't know how often repairs have to be made, but the number potential pod names is unbounded and could grow quite large. nodeName is less concerning, but I'm still not entirely clear on the utility -- do we need to isolate issues at the node level in monitoring?

Some high-level guidance on labels: https://prometheus.io/docs/practices/instrumentation/#do-not-overuse-labels

tariq1890 · 2021-02-19T20:35:47Z

I can understand the pod name concern. But the nodeName is definitely useful for us. Isolating things at the node level is very important.

For eg:- we could have networking issues in certain nodes happening in an availability zone. Using the istio CNI metrics could be a good way to track this.

howardjohn · 2021-02-19T20:38:04Z

For node - the metric from CNI doesn't need to return the node, since it already works on a per-node basis and prometheus scraping can/will/should/might be configured to already add the node automatically? I may be wrong here though, maybe it just adds pod/namespace generally

tariq1890 · 2021-02-19T20:51:00Z

Oh yes, prometheus does auto-add the instance. That should be enough

howardjohn

LGTM as a start. Thanks!

howardjohn · 2021-02-26T01:37:48Z

cni/pkg/repair/repair.go

-	return bpr.client.CoreV1().Pods(pod.Namespace).Delete(context.TODO(), pod.Name, metav1.DeleteOptions{})
+	err := bpr.client.CoreV1().Pods(pod.Namespace).Delete(context.TODO(), pod.Name, metav1.DeleteOptions{})
+	if err == nil {
+		bpr.Metrics.PodsRepaired.Increment()


It feels weird to get a metric for success and not have a corresponding one for failures. Failures are arguable more important?

Also, very nit, the pattern of err == nil is a bit odd

if err := bpr.client.CoreV1().Pods(pod.Namespace).Delete(context.TODO(), pod.Name, metav1.DeleteOptions{}); err != nil { return err } increment() return nil

seems more standard to me

therealmitchconnors · 2021-03-04T23:06:37Z

/retest

stewartbutler requested a review from justinpettit January 7, 2021 16:11

istio-testing added the do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. label Jan 7, 2021

google-cla bot added the cla: yes Set by the Google CLA bot to indicate the author of a PR has signed the Google CLA. label Jan 7, 2021

istio-testing added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 7, 2021

howardjohn reviewed Jan 7, 2021

View reviewed changes

cni/cmd/istio-cni-repair/main.go Show resolved Hide resolved

cni/cmd/istio-cni-repair/main.go Outdated Show resolved Hide resolved

istio-testing added the needs-rebase Indicates a PR needs to be rebased before being merged label Feb 7, 2021

istio-policy-bot added the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Feb 7, 2021

stewartbutler and others added 3 commits February 11, 2021 10:05

Adding metrics to CNI repair controller

8fcb4a9

translate prom calls to istio.io/monitoring

4ebefca

fmt

0d90ac0

therealmitchconnors force-pushed the cni-metrics branch from c16f507 to 0d90ac0 Compare February 11, 2021 18:06

istio-testing removed the needs-rebase Indicates a PR needs to be rebased before being merged label Feb 11, 2021

istio-policy-bot removed the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Feb 11, 2021

therealmitchconnors requested a review from douglas-reid February 11, 2021 18:07

howardjohn reviewed Feb 11, 2021

View reviewed changes

manifests/charts/istio-cni/templates/daemonset.yaml Outdated Show resolved Hide resolved

bianpengyuan reviewed Feb 11, 2021

View reviewed changes

therealmitchconnors added 2 commits February 11, 2021 11:23

address feedback

8630f17

include stopch

5680fb6

therealmitchconnors removed the do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. label Feb 12, 2021

therealmitchconnors marked this pull request as ready for review February 12, 2021 00:46

therealmitchconnors requested review from a team as code owners February 12, 2021 00:46

bianpengyuan reviewed Feb 12, 2021

View reviewed changes

cni/cmd/istio-cni-repair/main.go Outdated Show resolved Hide resolved

howardjohn reviewed Feb 12, 2021

View reviewed changes

cni/cmd/istio-cni-repair/main.go Show resolved Hide resolved

manifests/charts/istio-cni/templates/daemonset.yaml Outdated Show resolved Hide resolved

douglas-reid reviewed Feb 18, 2021

View reviewed changes

therealmitchconnors added 2 commits February 18, 2021 16:40

update unit test to use opencensus

906503a

address pr feedback

08b18d7

therealmitchconnors added 2 commits February 19, 2021 13:57

update port num in manifest

639989d

add release node

ea40022

stewartbutler requested review from a team as code owners February 22, 2021 22:41

reformat release note

8e1f2be

howardjohn approved these changes Feb 26, 2021

View reviewed changes

istio-testing merged commit eb71f58 into istio:master Mar 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding metrics to CNI repair controller #29911

Adding metrics to CNI repair controller #29911

stewartbutler commented Jan 7, 2021 •

edited by howardjohn

Loading

istio-testing commented Jan 7, 2021

bianpengyuan Feb 11, 2021

therealmitchconnors Feb 11, 2021

stewartbutler Feb 11, 2021

bianpengyuan Feb 12, 2021

stewartbutler Feb 12, 2021 •

edited

Loading

stewartbutler Feb 12, 2021

tariq1890 Feb 18, 2021

bianpengyuan Feb 11, 2021 •

edited

Loading

therealmitchconnors Feb 11, 2021

stewartbutler Feb 11, 2021

bianpengyuan Feb 12, 2021

therealmitchconnors commented Feb 12, 2021

tariq1890 commented Feb 18, 2021

douglas-reid Feb 18, 2021

stewartbutler Feb 19, 2021

douglas-reid Feb 18, 2021

stewartbutler Feb 19, 2021

douglas-reid commented Feb 19, 2021

tariq1890 commented Feb 19, 2021

howardjohn commented Feb 19, 2021

tariq1890 commented Feb 19, 2021 •

edited

Loading

howardjohn left a comment

howardjohn Feb 26, 2021

therealmitchconnors commented Mar 4, 2021

Adding metrics to CNI repair controller #29911

Adding metrics to CNI repair controller #29911

Conversation

stewartbutler commented Jan 7, 2021 • edited by howardjohn Loading

istio-testing commented Jan 7, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stewartbutler Feb 12, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bianpengyuan Feb 11, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

therealmitchconnors commented Feb 12, 2021

tariq1890 commented Feb 18, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

douglas-reid commented Feb 19, 2021

tariq1890 commented Feb 19, 2021

howardjohn commented Feb 19, 2021

tariq1890 commented Feb 19, 2021 • edited Loading

howardjohn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

therealmitchconnors commented Mar 4, 2021

stewartbutler commented Jan 7, 2021 •

edited by howardjohn

Loading

stewartbutler Feb 12, 2021 •

edited

Loading

bianpengyuan Feb 11, 2021 •

edited

Loading

tariq1890 commented Feb 19, 2021 •

edited

Loading