Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding metrics to CNI repair controller #29911

Merged
merged 10 commits into from
Mar 4, 2021

Conversation

stewartbutler
Copy link
Contributor

@stewartbutler stewartbutler commented Jan 7, 2021

Adds prometheus metrics to CNI repair controller

Fixes #19300

@istio-testing istio-testing added the do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. label Jan 7, 2021
@google-cla google-cla bot added the cla: yes Set by the Google CLA bot to indicate the author of a PR has signed the Google CLA. label Jan 7, 2021
@istio-testing istio-testing added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 7, 2021
@istio-testing
Copy link
Collaborator

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

cni/cmd/istio-cni-repair/main.go Show resolved Hide resolved
cni/cmd/istio-cni-repair/main.go Outdated Show resolved Hide resolved
@istio-testing istio-testing added the needs-rebase Indicates a PR needs to be rebased before being merged label Feb 7, 2021
@istio-policy-bot istio-policy-bot added the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Feb 7, 2021
@istio-testing istio-testing removed the needs-rebase Indicates a PR needs to be rebased before being merged label Feb 11, 2021
@istio-policy-bot istio-policy-bot removed the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Feb 11, 2021
@@ -145,6 +155,7 @@ func (bpr BrokenPodReconciler) deleteBrokenPod(pod v1.Pod) error {
return nil
}
log.Infof("Pod detected as broken, deleting: %s/%s", pod.Namespace, pod.Name)
defer bpr.Metrics.PodsRepaired.Increment()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why defer here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stewartbutler can you clarify?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only used defer here to avoid using a temp variable for the return value of the delete operation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow, what is the temp value you are referring to?

Copy link
Contributor Author

@stewartbutler stewartbutler Feb 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ignore that. I was just being an idiot. @therealmitchconnors you can remove the defer.

(My insomnia-addled rationale was 'If I defer this, it will increment the metric only if the call succeeds and I don't have to create a temp variable to store the return value of the delete call', but with a defer it'd be called even if the repair called failed so there is no reason not to just increment it inline. Just put it down to me being dumb.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you wanted to do this properly you could store the return of the Delete operation, and only increment if err == null.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we add a bit more granularity here? Perhaps record the successful and failed repairs. I am assuming this records all attempts at the pod repairs/deletes

cni/pkg/repair/repair.go Outdated Show resolved Hide resolved
cni/cmd/istio-cni-repair/main.go Outdated Show resolved Hide resolved
cni/cmd/istio-cni-repair/main.go Show resolved Hide resolved
@@ -166,6 +176,12 @@ func logCurrentOptions(bpr *repair.BrokenPodReconciler, options *ControllerOptio
}
}

func init() {
Copy link
Contributor

@bianpengyuan bianpengyuan Feb 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to init here, metric can init in the package where it defines.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is placed in init() so that all metrics for the package can be stored in the metric struct. @stewartbutler can you clarify?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was the goal. That way you could pass the metric struct to the reconciler and have a guarantee that it was initialized. I'm not sure if that requirement is changed now that you are using istio.io/pkg/monitoring though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metric could just live and init in other package. As long as the package is imported directly or indirectly by the main package, the init function there will be triggered.

@therealmitchconnors therealmitchconnors removed the do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. label Feb 12, 2021
@therealmitchconnors therealmitchconnors marked this pull request as ready for review February 12, 2021 00:46
@therealmitchconnors therealmitchconnors requested review from a team as code owners February 12, 2021 00:46
@therealmitchconnors
Copy link
Contributor

/retest

@tariq1890
Copy link
Contributor

Can we add labels to this metric? nodeName, Namespace, podName would be useful to start with

@@ -44,19 +45,26 @@ type Filters struct {
LabelSelectors string `json:"label_selectors"`
}

type Metrics struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious: why intro this struct? Is this just for testing?

it seems like an exported repair.PodsRepaired metric would work just as well? As it is not dimensioned, it doesn't seem like it needs to be passed around like a context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was the only way I could think of to set it up so I could mock the metrics in tests. If there's a more idiomatic way to do it, it can be removed.

@@ -166,6 +176,12 @@ func logCurrentOptions(bpr *repair.BrokenPodReconciler, options *ControllerOptio
}
}

func init() {
metrics.PodsRepaired = monitoring.NewSum("istio_cni_repair_pods_repaired_total",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: do we need repair in the name twice? would something like: istio_cni_repaired_pods_total be just as clear?

fwiw, repaired_pods is slightly preferable to pods_repaired to my ears. But I'm biased by https://cloud.google.com/apis/design/naming_convention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was treating istio_cni_repair as a prefix, so that they are all in a block and someone quickly scanning them can find all the repair-related metrics if we later introduce other cni metrics.

Not particularly tied to that, but that was my rationale.

@douglas-reid
Copy link
Contributor

Can we add labels to this metric? nodeName, Namespace, podName would be useful to start with

What do we intend to do with the podName label? Generally, that sort of detail is better left to logs for debugging, etc. I don't know how often repairs have to be made, but the number potential pod names is unbounded and could grow quite large. nodeName is less concerning, but I'm still not entirely clear on the utility -- do we need to isolate issues at the node level in monitoring?

Some high-level guidance on labels: https://prometheus.io/docs/practices/instrumentation/#do-not-overuse-labels

@tariq1890
Copy link
Contributor

I can understand the pod name concern. But the nodeName is definitely useful for us. Isolating things at the node level is very important.

For eg:- we could have networking issues in certain nodes happening in an availability zone. Using the istio CNI metrics could be a good way to track this.

@howardjohn
Copy link
Member

For node - the metric from CNI doesn't need to return the node, since it already works on a per-node basis and prometheus scraping can/will/should/might be configured to already add the node automatically? I may be wrong here though, maybe it just adds pod/namespace generally

@tariq1890
Copy link
Contributor

tariq1890 commented Feb 19, 2021

Oh yes, prometheus does auto-add the instance. That should be enough

@stewartbutler stewartbutler requested review from a team as code owners February 22, 2021 22:41
Copy link
Member

@howardjohn howardjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as a start. Thanks!

return bpr.client.CoreV1().Pods(pod.Namespace).Delete(context.TODO(), pod.Name, metav1.DeleteOptions{})
err := bpr.client.CoreV1().Pods(pod.Namespace).Delete(context.TODO(), pod.Name, metav1.DeleteOptions{})
if err == nil {
bpr.Metrics.PodsRepaired.Increment()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels weird to get a metric for success and not have a corresponding one for failures. Failures are arguable more important?

Also, very nit, the pattern of err == nil is a bit odd

if err := bpr.client.CoreV1().Pods(pod.Namespace).Delete(context.TODO(), pod.Name, metav1.DeleteOptions{}); err != nil {
return err
}
increment()
return nil

seems more standard to me

@therealmitchconnors
Copy link
Contributor

/retest

@istio-testing istio-testing merged commit eb71f58 into istio:master Mar 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes Set by the Google CLA bot to indicate the author of a PR has signed the Google CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Istio CNI should provide metrics
8 participants