Add AlertRoute for merging alertmanager configs for different teams #2927

sev3ryn · 2019-12-27T20:09:26Z

What is missing?
I believe, ability for each team/project to configure their alertmanager routes, grouping options and own receivers is a missing part in Prometheus operator.

Why do we need it?
K8s CRD( let's name it AlertRoute) together with ServiceMonitor and PrometheusRule will give teams full control over their monitoring and alerting.

Is it something you will consider merging if I contribute a PR for it?

brancz · 2020-01-06T13:21:36Z

I think this is a valid use case and one that we would want. Merging the Alertmanager routing tree from different configs is tricky but I think it can be done. If you want to write up a design for it that would be fantastic! Then we can discuss the design and ultimately we/you can get to implementing this! :)

spencergilbert · 2020-01-08T16:49:16Z

I know that Rancher does some level of merging of Alertmanager configs for it's monitoring solution, it could be a useful reference

brancz · 2020-01-08T19:05:35Z

Do you have a link to that?

spencergilbert · 2020-01-08T19:13:43Z

I think most of that logic lives here

brancz · 2020-01-09T09:17:11Z

I can't find config merging there (doesn't mean it doesn't exist but I couldn't find it). What you linked looks like a copy of the config.go file from the Alertmanager project.

spencergilbert · 2020-01-09T15:22:17Z

Sorry, let me keep digging - I haven't looked at the internals much, just the user end

sev3ryn · 2020-01-12T15:00:03Z

I'm new to writing k8s CRD - please correct me if I'm saying something that is absolutely wrong:

the way I see design of it - we need 2 new k8s CRDs:

AlertReceiver - same config as any of receivers https://prometheus.io/docs/alerting/configuration/#receiver but instead of fields with secrets you have to specify the k8s secret. Also there will be additional field templates: in which you specify list of configmaps to use as templates
AlertRoute - same as https://prometheus.io/docs/alerting/configuration/#route . All created AlertRoutes will be treated as 1st level routes (after default one) and there will be option to specify subroutes in it.

all other configs - global settings, inhibit_rules will stay as now in Alertmanager CRD - just to store it in the end in configmap(same as prometheus config) and not secret. Default route should be created in AlertRoute CRD but with some special name.

brancz · 2020-01-13T08:42:19Z

I think that’s pretty close to what I would have designed as well. I think the templates are probably fine to inline into the receiver API, and the default route is something that I think I would require to be specified in the alertmanager if the “distributed config” Mode discussed here is chosen (maybe a reference to an object).

The tricky part about all of this is how to merge the route configs so that the first matching route doesn’t prevent an alert to not go into a later route that it was actually meant for.

sev3ryn · 2020-01-13T12:12:28Z

I believe, issue with merging routes should be not that big - mostly people will use labels "team", "project", "app" for their specific alert route. At first, line in documentation "main routes are merged in random order" should do the job :)

brancz · 2020-01-13T12:43:17Z

I don't think those types of organizations are really the ones that benefit most from this feature. Organizations that are that well structured could already with a static alertmanager config accomplish everything. It's those that have very different requirements, maybe multi tenant environments, that really need this, and we need to ensure that we don't make it easy to do the wrong things 🙂 . I think this is perfectly doable with carefully thinking about continue fields in the generated alertmanager routing tree.

brancz · 2020-01-14T12:33:17Z

This is a dupe of #1528. Won't close this though as the discussion here is more active than on the other thread :) . The continue: true part was already captured there back then.

The way I see it the alertmanager object should have a config option that allows either enforcing "tenancy" based on the namespace label or configure them as user provided. In the "tenant aware" case we probably need to still be able to allow-list individual objects to not require to be namespaced for example for kubelet or node monitoring as they don't nominally belong to a namespace but rather infrastructure monitoring.

simonpasquier · 2020-02-18T09:11:07Z

Hello! I've started a design document:
https://docs.google.com/document/d/1aVVttvocop8zNezwrNFbHh_HKRW_fhJUZqhmrr8fGK0/edit?usp=sharing

Feel free to comment there...

spencergilbert · 2020-04-10T00:33:37Z

Any progress on the design document @simonpasquier ?

simonpasquier · 2020-04-10T15:15:29Z

@spencergilbert I'd say the document is more or less. I need to find time to work on the implementation but given the global situation, it's hard to tell when it will happen...

ghostsquad · 2020-06-22T00:57:26Z

@brancz

I don't think those types of organizations are really the ones that benefit most from this feature. Organizations that are that well structured could already with a static alertmanager config accomplish everything. It's those that have very different requirements, maybe multi tenant environments, that really need this, and we need to ensure that we don't make it easy to do the wrong things 🙂 . I think this is perfectly doable with carefully thinking about continue fields in the generated alertmanager routing tree.

Is this suggesting that it's desirable to have one global state/configuration for alertmanager? I'm attempting to enable the teams I support to be autonomous and not to step on each others toes, so I feel strongly against this statement. I'm asking for clarification though, not to start an argument. (this statement added for clarity)

When you say multi-tenant are you referring to untrusted entities co-existing? AFAIK, most k8s clusters are "multi-tenant" in that you have many teams running many different applications on the same cluster.

brancz · 2020-06-23T15:22:39Z

I was alluding to the fact that in a well structured environment with enforced practices/workflows, it's not even strictly necessary to have routes per team, as what Alertmanager can do out of the box can be perfectly sufficient to route to different slack channels or pagerduty receivers etc.

That said, it may still be desirable for other reasons to decentralize.

enable the teams I support to be autonomous and not to step on each others toes

That's practically what the Alertmanager CRD is about. The trust part doesn't really make a difference for this feature, I'm aware of both types existing as in multiple distinct "untrusted" entities sharing a cluster, as well as just split between teams, although the first being a more rare extreme.

simonpasquier · 2020-10-28T16:33:22Z

Closed by #3451

sev3ryn added the kind/feature label Dec 27, 2019

sev3ryn changed the title ~~Add AlertRouteDiscovery for merging~~ Add AlertRoute for merging alertmanager configs for different teams Dec 27, 2019

brancz mentioned this issue Jan 20, 2020

Add Alertmanager config options to Alertmanager crd #2766

Closed

brancz mentioned this issue Jan 29, 2020

Allow to specify alert manager configuration path #2957

Closed

simonpasquier mentioned this issue Apr 10, 2020

Allow the configuration to be spread across multiple files. prometheus/alertmanager#1150

Closed

This was referenced Jun 4, 2020

[Proposal] AlertmanagerRule CRD #3261

Closed

WIP: *: Add AlertmanagerConfig CRD #3265

Closed

bigangryrobot mentioned this issue Jun 17, 2020

[Proposal] Change Config Format prometheus/alertmanager#2302

Open

grdryn mentioned this issue Aug 27, 2020

Add AlertmanagerConfig CRD #3451

Merged

simonpasquier closed this as completed Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AlertRoute for merging alertmanager configs for different teams #2927

Add AlertRoute for merging alertmanager configs for different teams #2927

sev3ryn commented Dec 27, 2019 •

edited

Loading

brancz commented Jan 6, 2020

spencergilbert commented Jan 8, 2020

brancz commented Jan 8, 2020

spencergilbert commented Jan 8, 2020

brancz commented Jan 9, 2020

spencergilbert commented Jan 9, 2020

sev3ryn commented Jan 12, 2020

brancz commented Jan 13, 2020

sev3ryn commented Jan 13, 2020

brancz commented Jan 13, 2020

brancz commented Jan 14, 2020 •

edited

Loading

simonpasquier commented Feb 18, 2020

spencergilbert commented Apr 10, 2020

simonpasquier commented Apr 10, 2020

ghostsquad commented Jun 22, 2020 •

edited

Loading

brancz commented Jun 23, 2020 •

edited

Loading

simonpasquier commented Oct 28, 2020

Add AlertRoute for merging alertmanager configs for different teams #2927

Add AlertRoute for merging alertmanager configs for different teams #2927

Comments

sev3ryn commented Dec 27, 2019 • edited Loading

brancz commented Jan 6, 2020

spencergilbert commented Jan 8, 2020

brancz commented Jan 8, 2020

spencergilbert commented Jan 8, 2020

brancz commented Jan 9, 2020

spencergilbert commented Jan 9, 2020

sev3ryn commented Jan 12, 2020

brancz commented Jan 13, 2020

sev3ryn commented Jan 13, 2020

brancz commented Jan 13, 2020

brancz commented Jan 14, 2020 • edited Loading

simonpasquier commented Feb 18, 2020

spencergilbert commented Apr 10, 2020

simonpasquier commented Apr 10, 2020

ghostsquad commented Jun 22, 2020 • edited Loading

brancz commented Jun 23, 2020 • edited Loading

simonpasquier commented Oct 28, 2020

sev3ryn commented Dec 27, 2019 •

edited

Loading

brancz commented Jan 14, 2020 •

edited

Loading

ghostsquad commented Jun 22, 2020 •

edited

Loading

brancz commented Jun 23, 2020 •

edited

Loading