[Feature Request] Application metrics support in platform #252
Open
Description
Objectives
- Enable users to track and trend performance over time.
- Help users benchmark performance against industry peers.
- Help users correlate metrics and trace data to make sense out of it.
- Enable users to improve service availability by helping them diagnose and understand the underlying drivers of performance gaps.
- Provide timely and quantifiable information to users related to infrastructure failure or possible outage, which will minimize downtime and increase efficiency.
- Insight into the root cause and if the problem is caused by the app or the infrastructure, which will help in reducing MTTD and MTTR.
- Prescribe actions to improve performance of the system whenever possible.
Use Case
Metrics are basically Aggregated summary statistics which track parameters like latency, traffic, errors, saturation etc for entities over the time whereas alerting helps in notifying user when metrics shows the deviation of application from normal or expected behavior. They have some interdependent use-cases, but both at the end satisfy the larger use-case i.e. detection and resolution of performance issues.
- Early problem detection
- Correlations
- Alerting
- Enhanced Availability
- Detect and resolve performance issues
- Decision Making
- Baselining and SLAs/ Change detection
- Predictions/ forecasting
Proposal
What application metrics should we support in Platform?
Ideally, we will collect four golden signals out-of-the box for all the applications and services, as below:
- Latency: The time to complete requests
- Traffic: Number of requests per second served
- Errors: Application errors that occur when processing client requests or accessing resources.
- Saturation: The percentage or amount of resources currently being used
User stories
User Story | Importance |
---|---|
As a service/API/DB owner, I should be able to see application metrics related to my service/API/DB like latency, traffic, error and saturation metrics in the platform. | P1 |
As a system administrator, I should be able to identify issues with critical services without having to switch through different platforms. (eg. Alert taking you to grafana and then from grafana you need to come to hypertrace to search for traces within that timerange for x service.) | P1 |
As a service/API/DB owner, I should be able to change granularity and data points in consideration for metrics or create custom metrics as per the requirements. | P1 |
As a system administrator, I have already setup Prometheus to collect metrics, Hypertrace should be able to accept metrics from Prometheus and display it in platform. | P1 |
As a system administrator, I should be able to see SLA/ SLO deviation in application metrics and also trigger alert on the basis of deviation. | P2 |
As a system administrator, metrics in the Observability platform should be single source of truth for me. Considering that, those metrics should be reliable and trustworthy. Wherever possible, things should be configurable and control should be granular in order to avoid any false positives. | P1/ P2 |
As a administrator, I should be able to create custom dashboards with metrics from different services/ APIs/ DBs to keep an eye on critical metrics for critical applications. | P1/ P2 |
As a product owner, I should be able to get summary statistics of application metrics for a particular time range. | P2 |
As a product owner/ operator, I should be able to see high level summary of what’s going on in guts of my software and ask for greater detail on-demand. This can be achieved in different ways dashboards that present the most commonly viewed data in an immediately intelligible manner can help users understand system state at a glance. Many different custom dashboard views can be created for different job functions or areas of interest. |
P2 |
As a system admin/ product owner, I should be able to drill down from within summary displays to surface the information most pertinent to the current task. One example of this can be correlating application metrics and traces for an instant. Clicking on a particular part in metrics should show me all the related traces. |
P2 |
As a system admin, I should be able to dynamically adjust the scale of graphs, toggle off unnecessary metrics, and overlay information from multiple systems to make the tool useful for investigations or root cause analysis. (related issue by Razorpay: #218 ) | P3 |
Questions to address (if any)
- NA
Work items
Phase 1: supporting existing metrics via Metric Store
Ingestion Tasks:
- explore the existing SQL (pinot) based query to promQL queries #295
- extract the metrics from rawServiceView for call_count for service entities #294
- layout initial metrics ingestion pipeline with NoOP enricher (works on extracted metrics) #297
- explore Prometheus exposition format with a timestamp #307
- Implementation of exporter, see if we can directly pull from kafka #304
- validate out of order metrics ingestion #308
- Replace internal OtlpMetricsSerde with kafka stream protobuf serde #312
- Enhance metrics extractor/generator to address lag in pipeline
- Enhance metrics extractor/generartor for error_count and latency metric
- Enable metrics pipeline in hypertrace-ingester and docker-compose setup #327
Querying Tasks:
- define metric-handler with required metadata information
- implement selection criteria for metric-handler #324
- Build query service request to PromQL parser #320
- RestClient for executing PromQL queries, and support for Response parser #322
- Add support for parallel requests and merging of response for them for multi-metric request QS #325
- Build response builder - handle order by and merging of multiple queries response #326
Backlogs:
- Abstract out
dateTimeConvert
pinot construct query-service#112 - Function parity in Prometheus Request Handler query-service#114
- Support query with orderBy in Prometheus Request Handler query-service#115
- Update metricMap config in Prometheus View Definition query-service#116
- Conditional support for composite filter with
OR
operator in Prometheus Request Handler query-service#117 - Handle different metric types in Prometheus Request Handler query-service#118
Metadata
Assignees
Labels
No labels