Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AlertRoute for merging alertmanager configs for different teams #2927

Closed
sev3ryn opened this issue Dec 27, 2019 · 17 comments
Closed

Add AlertRoute for merging alertmanager configs for different teams #2927

sev3ryn opened this issue Dec 27, 2019 · 17 comments

Comments

@sev3ryn
Copy link
Contributor

sev3ryn commented Dec 27, 2019

What is missing?
I believe, ability for each team/project to configure their alertmanager routes, grouping options and own receivers is a missing part in Prometheus operator.

Why do we need it?
K8s CRD( let's name it AlertRoute) together with ServiceMonitor and PrometheusRule will give teams full control over their monitoring and alerting.

Is it something you will consider merging if I contribute a PR for it?

@sev3ryn sev3ryn changed the title Add AlertRouteDiscovery for merging Add AlertRoute for merging alertmanager configs for different teams Dec 27, 2019
@brancz
Copy link
Contributor

brancz commented Jan 6, 2020

I think this is a valid use case and one that we would want. Merging the Alertmanager routing tree from different configs is tricky but I think it can be done. If you want to write up a design for it that would be fantastic! Then we can discuss the design and ultimately we/you can get to implementing this! :)

@spencergilbert
Copy link

I know that Rancher does some level of merging of Alertmanager configs for it's monitoring solution, it could be a useful reference

@brancz
Copy link
Contributor

brancz commented Jan 8, 2020

Do you have a link to that?

@spencergilbert
Copy link

I think most of that logic lives here

@brancz
Copy link
Contributor

brancz commented Jan 9, 2020

I can't find config merging there (doesn't mean it doesn't exist but I couldn't find it). What you linked looks like a copy of the config.go file from the Alertmanager project.

@spencergilbert
Copy link

Sorry, let me keep digging - I haven't looked at the internals much, just the user end

@sev3ryn
Copy link
Contributor Author

sev3ryn commented Jan 12, 2020

I'm new to writing k8s CRD - please correct me if I'm saying something that is absolutely wrong:

the way I see design of it - we need 2 new k8s CRDs:

  1. AlertReceiver - same config as any of receivers https://prometheus.io/docs/alerting/configuration/#receiver but instead of fields with secrets you have to specify the k8s secret. Also there will be additional field templates: in which you specify list of configmaps to use as templates
  2. AlertRoute - same as https://prometheus.io/docs/alerting/configuration/#route . All created AlertRoutes will be treated as 1st level routes (after default one) and there will be option to specify subroutes in it.

all other configs - global settings, inhibit_rules will stay as now in Alertmanager CRD - just to store it in the end in configmap(same as prometheus config) and not secret. Default route should be created in AlertRoute CRD but with some special name.

@brancz
Copy link
Contributor

brancz commented Jan 13, 2020

I think that’s pretty close to what I would have designed as well. I think the templates are probably fine to inline into the receiver API, and the default route is something that I think I would require to be specified in the alertmanager if the “distributed config” Mode discussed here is chosen (maybe a reference to an object).

The tricky part about all of this is how to merge the route configs so that the first matching route doesn’t prevent an alert to not go into a later route that it was actually meant for.

@sev3ryn
Copy link
Contributor Author

sev3ryn commented Jan 13, 2020

I believe, issue with merging routes should be not that big - mostly people will use labels "team", "project", "app" for their specific alert route. At first, line in documentation "main routes are merged in random order" should do the job :)

@brancz
Copy link
Contributor

brancz commented Jan 13, 2020

I don't think those types of organizations are really the ones that benefit most from this feature. Organizations that are that well structured could already with a static alertmanager config accomplish everything. It's those that have very different requirements, maybe multi tenant environments, that really need this, and we need to ensure that we don't make it easy to do the wrong things 🙂 . I think this is perfectly doable with carefully thinking about continue fields in the generated alertmanager routing tree.

@brancz
Copy link
Contributor

brancz commented Jan 14, 2020

This is a dupe of #1528. Won't close this though as the discussion here is more active than on the other thread :) . The continue: true part was already captured there back then.

The way I see it the alertmanager object should have a config option that allows either enforcing "tenancy" based on the namespace label or configure them as user provided. In the "tenant aware" case we probably need to still be able to allow-list individual objects to not require to be namespaced for example for kubelet or node monitoring as they don't nominally belong to a namespace but rather infrastructure monitoring.

@simonpasquier
Copy link
Contributor

Hello! I've started a design document:
https://docs.google.com/document/d/1aVVttvocop8zNezwrNFbHh_HKRW_fhJUZqhmrr8fGK0/edit?usp=sharing

Feel free to comment there...

@spencergilbert
Copy link

Any progress on the design document @simonpasquier ?

@simonpasquier
Copy link
Contributor

@spencergilbert I'd say the document is more or less. I need to find time to work on the implementation but given the global situation, it's hard to tell when it will happen...

@ghostsquad
Copy link

ghostsquad commented Jun 22, 2020

@brancz

I don't think those types of organizations are really the ones that benefit most from this feature. Organizations that are that well structured could already with a static alertmanager config accomplish everything. It's those that have very different requirements, maybe multi tenant environments, that really need this, and we need to ensure that we don't make it easy to do the wrong things 🙂 . I think this is perfectly doable with carefully thinking about continue fields in the generated alertmanager routing tree.

Is this suggesting that it's desirable to have one global state/configuration for alertmanager? I'm attempting to enable the teams I support to be autonomous and not to step on each others toes, so I feel strongly against this statement. I'm asking for clarification though, not to start an argument. (this statement added for clarity)

When you say multi-tenant are you referring to untrusted entities co-existing? AFAIK, most k8s clusters are "multi-tenant" in that you have many teams running many different applications on the same cluster.

@brancz
Copy link
Contributor

brancz commented Jun 23, 2020

I was alluding to the fact that in a well structured environment with enforced practices/workflows, it's not even strictly necessary to have routes per team, as what Alertmanager can do out of the box can be perfectly sufficient to route to different slack channels or pagerduty receivers etc.

That said, it may still be desirable for other reasons to decentralize.

enable the teams I support to be autonomous and not to step on each others toes

That's practically what the Alertmanager CRD is about. The trust part doesn't really make a difference for this feature, I'm aware of both types existing as in multiple distinct "untrusted" entities sharing a cluster, as well as just split between teams, although the first being a more rare extreme.

@simonpasquier
Copy link
Contributor

Closed by #3451

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants