[Stack Monitoring] PoC for kibana instrumentation using opentelemetry metrics sdk #128755
Description
We discussed a number of possible implementations for ongoing kibana instrumentations in (internal) https://github.com/elastic/observability-dev/issues/2054
In this issue we'll build a proof of concept for how that might work.
Here are the two options we'd like to PoC on. They should both be very similar at the code level, the main difference is the collection mechanism (pull from metricbeat vs push to apm-server).
option 2: OpenTelemetry Metrics API prometheus endpoint with Elastic Agent prometheus input
Here we use the official otel metrics sdk and expose that via prometheus protocol for elastic-agent to poll via the underlying metricbeat prometheus module.
graph LR
subgraph ElasticDeployment["Elastic Deployment"]
subgraph kibana
OtelMetricsSDK["Otel Metrics SDK"]
OtelMetricsPrometheusExporter["/metrics (prometheus-protocol)"]
OtelMetricsSDK-->OtelMetricsPrometheusExporter
click OtelMetricsSDK "https://opentelemetry.io/docs/instrumentation/js/getting-started/nodejs/#metrics"
end
subgraph elastic-agent
Metricbeat["metrics/prometheus"]
end
Metricbeat-->|"poll (prometheus protocol)"|OtelMetricsPrometheusExporter
Metricbeat-->|_bulk|elasticsearch
end
option 3: OpenTelemetry Metrics API exported as OpenTelemetry Protocol
Here we use the official otel metrics sdk and push that via OpenTelemetry Protocol. OpenTelemetry Protocol is natively supported by Elastic APM so we use that to receive the data. There are some caveats for otel collection, but none of them should hinder the collection of platform observability metrics today.
Ideally this apm-server is managed by elastic-agent, but that work is still TBD. See 2022-01 - Elastic Agent Pipeline Runtime Environment for latest info.
graph LR
subgraph ElasticDeployment["Elastic Deployment"]
subgraph kibana
OtelMetricsSDK["Otel Metrics SDK"]
end
subgraph elastic-agent
APMServer["apm-server"]
end
OtelMetricsSDK-->|"push (OTLP)"|APMServer["apm-server"]
APMServer-->|_bulk|elasticsearch
end
Some consumers to keep in mind (see internal companion issue):
- Stack Monitoring
- High Level Health API
- APM instrumentation of stack
- Telemetry (Event based telemetry) - could maybe leave this as it's own entity, the above are more critical to align
Steps
- Get otel SDK into kibana
- Add otel SDK instrumentation similar to those already found in stack monitoring (using [ResponseOps] Visualize alerting metrics in Stack Monitoring #123726 as a recent/fresh example)
- Build comparable visualizations of otel data
- Validate both collection options
- Test deployment & validity on an ESS cluster
AC: Recording of PoC as walkthrough