[Feature Request] Application metrics support in platform

## Objectives

- Enable users to track and trend performance over time.
- Help users benchmark performance against industry peers.
- Help users correlate metrics and trace data to make sense out of it.
- Enable users to improve service availability by helping them diagnose and understand the underlying drivers of performance gaps.
- Provide timely and quantifiable information to users related to infrastructure failure or possible outage, which will minimize downtime and increase efficiency.
- Insight into the root cause and if the problem is caused by the app or the infrastructure, which will help in reducing MTTD and MTTR.
- Prescribe actions to improve performance of the system whenever possible.

## Use Case
Metrics are basically Aggregated summary statistics which track parameters like latency, traffic, errors, saturation etc for entities over the time whereas alerting helps in notifying user when metrics shows the deviation of application from normal or expected behavior. They have some interdependent use-cases, but both at the end satisfy the larger use-case i.e. detection and resolution of performance issues.
- Early problem detection
    - Correlations
    - Alerting
    - Enhanced Availability
    - Detect and resolve performance issues
- Decision Making
    - Baselining and SLAs/ Change detection
    - Predictions/ forecasting


## Proposal
### What application metrics should we support in Platform? 
Ideally, we will collect four golden signals out-of-the box for all the applications and services, as below:
- Latency: The time to complete requests
- Traffic: Number of requests per second served
- Errors: Application errors that occur when processing client requests or accessing resources.
- Saturation: The percentage or amount of resources currently being used

### User stories

| User Story                                                                                                                                                                                                                                                                                                                                                                                                                                                       | Importance |
| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------- |
| As a service/API/DB owner, I should be able to see application metrics related to my service/API/DB like latency, traffic, error and saturation metrics in the platform.                                                                                                                                                                                                                                                                                         | P1         |
| As a system administrator, I should be able to identify issues with critical services without having to switch through different platforms. (eg. Alert taking you to grafana and then from grafana you need to come to hypertrace to search for traces within that timerange for x service.)                                                                                                                                                                     | P1         |
| As a service/API/DB owner, I should be able to change granularity and data points in consideration for metrics or create custom metrics as per the requirements.                                                                                                                                                                                                                                                                                                 | P1         |
| As a system administrator, I have already setup Prometheus to collect metrics, Hypertrace should be able to accept metrics from Prometheus and display it in platform.                                                                                                                                                                                                                                                                                           | P1         |
| As a system administrator, I should be able to see SLA/ SLO deviation in application metrics and also trigger alert on the basis of deviation.                                                                                                                                                                                                                                                                                                                   | P2         |
| As a system administrator, metrics in the Observability platform should be single source of truth for me. Considering that, those metrics should be reliable and trustworthy. Wherever possible, things should be configurable and control should be granular in order to avoid any false positives.                                                                                                                                                             | P1/ P2     |
| As a administrator, I should be able to create custom dashboards with metrics from different services/ APIs/ DBs to keep an eye on critical metrics for critical applications.                                                                                                                                                                                                                                                                                   | P1/ P2     |
| As a product owner, I should be able to get summary statistics of application metrics for a particular time range.                                                                                                                                                                                                                                                                                                                                               | P2         |
| As a product owner/ operator, I should be able to see high level summary of what’s going on in guts of my software and ask for greater detail on-demand. This can be achieved in different ways<br>dashboards that present the most commonly viewed data in an immediately intelligible manner can help users understand system state at a glance.<br><br>Many different custom dashboard views can be created for different job functions or areas of interest. | P2         |
| As a system admin/ product owner, I should be able to drill down from within summary displays to surface the information most pertinent to the current task.<br>One example of this can be correlating application metrics and traces for an instant. Clicking on a particular part in metrics should show me all the related traces.                                                                                                                            | P2         |
| [As a system admin, I should be able to dynamically adjust the scale of graphs, toggle off unnecessary metrics, and overlay information from multiple systems to make the tool useful for investigations or root cause analysis. (related issue by Razorpay:](https://github.com/hypertrace/hypertrace/issues/218) [https://github.com/hypertrace/hypertrace/issues/218](https://github.com/hypertrace/hypertrace/issues/218) )                                  | P3         |

## Questions to address (if any)
- NA

## Work items
Phase 1: supporting existing metrics via Metric Store

Ingestion Tasks:

- [x] #295
- [ ] #294
- [x] #297
- [x] #307
- [x] #304
- [ ] #308
- [ ] #312
- [ ] Enhance metrics extractor/generator to address lag in pipeline
- [ ] Enhance metrics extractor/generartor for error_count and latency metric
- [ ] #327

Querying Tasks:

- [ ] define metric-handler with required metadata information
- [x] #324
- [x] #320
- [x] #322
- [ ] #325
- [ ] #326

Backlogs:
- [ ] https://github.com/hypertrace/query-service/issues/112
- [ ] https://github.com/hypertrace/query-service/issues/114
- [ ] https://github.com/hypertrace/query-service/issues/115
- [ ] https://github.com/hypertrace/query-service/issues/116
- [ ] https://github.com/hypertrace/query-service/issues/117 
- [ ] https://github.com/hypertrace/query-service/issues/118

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Application metrics support in platform #252

Objectives

Use Case

Proposal

What application metrics should we support in Platform?

User stories

Questions to address (if any)

Work items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

User Story	Importance
As a service/API/DB owner, I should be able to see application metrics related to my service/API/DB like latency, traffic, error and saturation metrics in the platform.	P1
As a system administrator, I should be able to identify issues with critical services without having to switch through different platforms. (eg. Alert taking you to grafana and then from grafana you need to come to hypertrace to search for traces within that timerange for x service.)	P1
As a service/API/DB owner, I should be able to change granularity and data points in consideration for metrics or create custom metrics as per the requirements.	P1
As a system administrator, I have already setup Prometheus to collect metrics, Hypertrace should be able to accept metrics from Prometheus and display it in platform.	P1
As a system administrator, I should be able to see SLA/ SLO deviation in application metrics and also trigger alert on the basis of deviation.	P2
As a system administrator, metrics in the Observability platform should be single source of truth for me. Considering that, those metrics should be reliable and trustworthy. Wherever possible, things should be configurable and control should be granular in order to avoid any false positives.	P1/ P2
As a administrator, I should be able to create custom dashboards with metrics from different services/ APIs/ DBs to keep an eye on critical metrics for critical applications.	P1/ P2
As a product owner, I should be able to get summary statistics of application metrics for a particular time range.	P2
As a product owner/ operator, I should be able to see high level summary of what’s going on in guts of my software and ask for greater detail on-demand. This can be achieved in different ways dashboards that present the most commonly viewed data in an immediately intelligible manner can help users understand system state at a glance. Many different custom dashboard views can be created for different job functions or areas of interest.	P2
As a system admin/ product owner, I should be able to drill down from within summary displays to surface the information most pertinent to the current task. One example of this can be correlating application metrics and traces for an instant. Clicking on a particular part in metrics should show me all the related traces.	P2
As a system admin, I should be able to dynamically adjust the scale of graphs, toggle off unnecessary metrics, and overlay information from multiple systems to make the tool useful for investigations or root cause analysis. (related issue by Razorpay: #218 )	P3