Skip to content

[Feature Request] Application metrics support in platform #252

Open
@JBAhire

Description

Objectives

  • Enable users to track and trend performance over time.
  • Help users benchmark performance against industry peers.
  • Help users correlate metrics and trace data to make sense out of it.
  • Enable users to improve service availability by helping them diagnose and understand the underlying drivers of performance gaps.
  • Provide timely and quantifiable information to users related to infrastructure failure or possible outage, which will minimize downtime and increase efficiency.
  • Insight into the root cause and if the problem is caused by the app or the infrastructure, which will help in reducing MTTD and MTTR.
  • Prescribe actions to improve performance of the system whenever possible.

Use Case

Metrics are basically Aggregated summary statistics which track parameters like latency, traffic, errors, saturation etc for entities over the time whereas alerting helps in notifying user when metrics shows the deviation of application from normal or expected behavior. They have some interdependent use-cases, but both at the end satisfy the larger use-case i.e. detection and resolution of performance issues.

  • Early problem detection
    • Correlations
    • Alerting
    • Enhanced Availability
    • Detect and resolve performance issues
  • Decision Making
    • Baselining and SLAs/ Change detection
    • Predictions/ forecasting

Proposal

What application metrics should we support in Platform?

Ideally, we will collect four golden signals out-of-the box for all the applications and services, as below:

  • Latency: The time to complete requests
  • Traffic: Number of requests per second served
  • Errors: Application errors that occur when processing client requests or accessing resources.
  • Saturation: The percentage or amount of resources currently being used

User stories

User Story Importance
As a service/API/DB owner, I should be able to see application metrics related to my service/API/DB like latency, traffic, error and saturation metrics in the platform. P1
As a system administrator, I should be able to identify issues with critical services without having to switch through different platforms. (eg. Alert taking you to grafana and then from grafana you need to come to hypertrace to search for traces within that timerange for x service.) P1
As a service/API/DB owner, I should be able to change granularity and data points in consideration for metrics or create custom metrics as per the requirements. P1
As a system administrator, I have already setup Prometheus to collect metrics, Hypertrace should be able to accept metrics from Prometheus and display it in platform. P1
As a system administrator, I should be able to see SLA/ SLO deviation in application metrics and also trigger alert on the basis of deviation. P2
As a system administrator, metrics in the Observability platform should be single source of truth for me. Considering that, those metrics should be reliable and trustworthy. Wherever possible, things should be configurable and control should be granular in order to avoid any false positives. P1/ P2
As a administrator, I should be able to create custom dashboards with metrics from different services/ APIs/ DBs to keep an eye on critical metrics for critical applications. P1/ P2
As a product owner, I should be able to get summary statistics of application metrics for a particular time range. P2
As a product owner/ operator, I should be able to see high level summary of what’s going on in guts of my software and ask for greater detail on-demand. This can be achieved in different ways
dashboards that present the most commonly viewed data in an immediately intelligible manner can help users understand system state at a glance.

Many different custom dashboard views can be created for different job functions or areas of interest.
P2
As a system admin/ product owner, I should be able to drill down from within summary displays to surface the information most pertinent to the current task.
One example of this can be correlating application metrics and traces for an instant. Clicking on a particular part in metrics should show me all the related traces.
P2
As a system admin, I should be able to dynamically adjust the scale of graphs, toggle off unnecessary metrics, and overlay information from multiple systems to make the tool useful for investigations or root cause analysis. (related issue by Razorpay: #218 ) P3

Questions to address (if any)

  • NA

Work items

Phase 1: supporting existing metrics via Metric Store

Ingestion Tasks:

Querying Tasks:

Backlogs:

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions