- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives Considered
- Infrastructure Needed (Optional)
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
The Kubelet Resource Metrics Endpoint is a new kubelet metrics endpoint which serves metrics required by the cluster-level Resource Metrics API. The proposed design uses the prometheus text format, and provides the minimum required metrics for serving the Resource Metrics API.
The Kubelet Summary API is a source of both Resource and Monitoring Metrics. Because of it’s dual purpose, it does a poor job of both. It provides much more information than required by the Metrics Server, as demonstrated by kubernetes/kubernetes#68841. Additionally, we have pushed back on adding metrics to the Summary API for monitoring, such as DiskIO or tcp/udp metrics, because they are expensive to collect, and not required by all users.
This proposal deals with the first problem, which is that the Summary API is a poor provider of Resource Metrics. It proposes a purpose-built API for supplying Resource Metrics.
The Monitoring Architecture proposal established separate pipelines for Resource Metrics, and for Monitoring Metrics. The Core Metrics proposal describes the set of metrics that we consider core, and their uses. Note that the term “core” is overloaded, and this document will refer to these as Resource Metrics, since they are for first class kubernetes resources and are served by the Resource Metrics API at the cluster-level.
A previous proposal by @DirectXMan12 also proposed a prometheus endpoint. The kubernetes metrics overhaul KEP acknowledges the need to export fewer metrics from the kubelet. This new API is a step in that direction, as it eliminates the Metric Server's dependency on the Summary API.
For the purposes of this document, I will use the following definitions:
- Resource Metrics: Metrics for the consumption of first-class resources (CPU, Memory, Ephemeral Storage) which are aggregated by the Metrics Server, and served by the Resource Metrics API
- Monitoring Metrics: Metrics for observability and introspection of the cluster, which are used by end-users, operators, devs, etc.
The Kubelet’s JSON Summary API is currently used by the Metrics Server. It contains far more metrics than are required by the Metrics Server.
Prometheus is commonly used for exposing metrics for kubernetes components, and the Prometheus Operator, which Sig-Instrumentation works on, is commonly used to deploy and manage metrics collection.
OpenMetrics is a new prometheus-based metric standard which supports both text and protobuf.
GRPC is commonly used for interfaces between components in kubernetes, such as the Container Runtime Interface. GRPC uses protocol-buffers (protobuf) for serialization and deserialization, which is more performant than other formats.
- [Primary] Provide the minimum set of metrics required to serve the Resource Metrics API
- [Secondary] Minimize the CPU and Memory footprint of the metrics server due to collecting metrics
- Perform efficiently at frequent (sub-second) rates of metrics collection
- [Secondary] Use a format that is familiar to the kubernetes community, which can be consumed by common monitoring pipelines, and is interoperable with commonly-used monitoring pipelines.
- Deprecate or remove the Summary API
- Add new Resource Metrics to the metrics server (e.g. Ephemeral Storage)
- Detail how the kubelet will collect metrics to support this API.
- Determine what the pipeline for “Monitoring” metrics will look like
The kubelet will expose an endpoint at /metrics/resource
in prometheus text exposition format using the prometheus client library.
The metrics in this endpoint will make use of the Kubernetes Metrics Stability framework for stability and deprecation policies.
# Cumulative cpu time consumed by a container in seconds
Name: container_cpu_usage_seconds_total
Labels: container, pod, namespace
# Current working set of a container in bytes
Name: container_memory_working_set_bytes
Labels: container, pod, namespace
# Cumulative cpu time consumed by the node in seconds
Name: node_cpu_usage_seconds_total
Labels:
# Current working set of the node in bytes
Name: node_memory_working_set_bytes
Labels:
Explicit timestamps (see the prometheus exposition format docs) will be added to metrics because metrics are (currently) collected out-of-band and cached. We make no guarantees about the age of metrics, but include the timestamp to allow readers to correctly calculate rates, etc. Timestamps are currently required because metrics are collected out-of-band by cAdvisor. This deviates from the prometheus best practices, and we should attempt to migrate to synchronous collection during each scrape in the future.
Use separate metrics for node and containers to avoid “magic” container names, such as “machine”.
Currently the Metrics Server uses a 10s average of CPU usage provided by the kubelet summary API. The kubelet should provide the raw cumulative CPU usage so the metrics server can determine the time period over which it wants to take the rate.
Labels are named in accordance with the kubernetes instrumentation guidelines, and thus are named pod
, rather than pod_name
.
Example implementation: https://github.com/kubernetes/kubernetes/compare/master...dashpole:prometheus_core_metrics
OpenMetrics is an upcoming prometheus-based standard which has support for protocol buffers. By using this format when it becomes available, we can further improve the efficiency of the Resource Metrics Pipeline, while maintaining compatibility with other monitoring pipelines.
This experiment compares the current JSON Summary API to prometheus and GRPC at 1s and 30s scrape intervals. Prometheus uses basic text parsing, and grpc uses a basic Get()
API.
The setup has 10 nodes, 500 pods, and 6500 containers (running pause). Nodes have 1 CPU core, and 3.75Gb memory. The same cluster was used for all benchmarks for consistency, with a different Metrics Server running. The values below are the maximum values reported during a 10 minute period.
We can see that GRPC has the lowest CPU usage of all formats tested, and is an order-of-magnitude improvement over the current JSON Summary API. Memory Usage for both GRPC and Prometheus are similarly lower than the JSON Summary API.
After learning that the prometheus server achieves better performance with caching, I performed an additional round of tests. These used a metrics-server which caches metric descriptors it has parsed before, and tested with larger numbers of container metrics.
This experiment compares basic prometheus, optimized prometheus parsing and GRPC at 1s scrape intervals with higher numbers of container metrics. "Unoptimized Prometheus" uses basic text parsing, "Prometheus w/ Caching" borrows caching logic from the prometheus server to avoid re-parsing metric descriptors it has already parsed and grpc uses a basic Get()
API.
The setup has 10 nodes, and up to 40,000 containers (running pause). Nodes have 2 CPU core, and 7.5Gb memory. The same cluster was used for all benchmarks for consistency, with a different Metrics Server running. The values below are the maximum values reported during a 10 minute period.
This experiment "fakes" large numbers of containers by having the kubelet return 100 container metrics for each actual container run on the node.
Both gRPC and the optimized prometheus were able to scale to 40k containers. The gRPC implementation was more efficient by a factor of approx. 3.
[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
<package>
:<date>
-<test coverage>
- :
Test the new endpoint with a node-e2e test similar to the current summary API test. Testgrid: https://testgrid.k8s.io/sig-node-kubelet#node-kubelet-features-master&include-filter-by-regex=ResourceMetricsAPI
Alpha:
- Implement the kubelet resource metrics endpoint as described above
Beta:
- Modify the metrics server to consume the kubelet resource metrics endpoint 3 releases after it is added to the kubelet
GA:
- Add node-e2e test
The kubelet can be upgraded or downgraded normally with respect to this feature. Users of the metrics endpoint, such as the metrics server, should use other kubelet metrics endpoints (such as the summary api) before downgrading.
This feature affects only the kubelet - in that it will expose the resource metrics for kubelet in a new endpoint, so there is no issue with version skew with other components.
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name:
- Components depending on the feature gate:
- Other
- Describe the mechanism: This feature exposes the /metrics/resource endpoint for kubelet, with all metrics annotated as STABLE. Note: Because this feature was built before the PRR process was established, it unfortunately does not adhere to the best practices of feature enablement/disablement
- Will enabling / disabling the feature require downtime of the control plane? No
- Will enabling / disabling the feature require downtime or reprovisioning of a node? No
It will expose the /metrics/resource endpoint for kubelet by default
No, this feature can not be disabled once it has been enabled since we do not have a feature flag for this. To rollback, one will have to downgrade the kubernetes version. Note: This version was added in v1.14, so to disable this feature, one would need to switch back to a version older than v1.14
/metrics/resource endpoint for kubelet will become available
Since there is no feature gate involved for this, there are no feature enablement/disablement test
A rollback can impact running workloads if clients, such as the metrics server, are relying on metrics provided by the endpoint. The rollback could break cluster functions, such as HPA, if the metrics were no longer available.
The following metrics exposed by /kubelet/resource endpoint could be used:
- node_memory_working_set_bytes
- pod_memory_working_set_bytes
We could compute node_memory_working_set_bytes - sum(pod_memory_working_set_bytes) to know if there's a memory leak.
No, because the feature was enabled (with no way to disable) since v1.14.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No
By checking kubelet's /metrics/resource endpoint
- Events
- Event Reason:
- API .status
- Condition name:
- Other field:
- Other (treat as last resort)
- Details: /metrics/resource endpoint for kubelet should show resource metrics
This feature introduces a metrics endpoint that can used to establish SLOs
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
This feature introduces a metrics endpoint that can be used to determine health of kubelet.
Are there any missing metrics that would be useful to have to improve observability of this feature?
No
Kubelet
No
No
No
No
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
No, infact CPU usage is reduced as compared to the Summary API's usage which was previously used my the metrics server.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No
No impact
/metrics/resource endpoint is not available
Memory leaks should be checked by looking at node_memory_working_set_bytes - sum(pod_memory_working_set_bytes) If the problem is severe, kubernetes version should be downgraded so that the /metrics/resource endpoint is not exposed for kubelet. Keep in mind, users of these metrics should use other metrics endpoints (such as the summary api) before downgrading.
- 2019-01-24: Initial KEP published.
- 2019-01-29: Presentation to Sig-Node
- 2019-02-04: KEP gets LGTM and Approval
- 2019-02-07: Presentation to Sig-Instrumentation
- 2020-01-14: [1.18] Endpoint copied from /metrics/resource/v1alpha1 to /metrics/resource, and adopting the metrics stability framework: kubernetes/kubernetes#86282
- 2020-09-01: [1.20] /metrics/resource/v1alpha1 removed: kubernetes/kubernetes#94272
- 2021-06-28: Use kubelet's /metrics/resource endpoint in metrics-server: kubernetes-sigs/metrics-server#777
- 2023-08-23: [1.29] GA graduation, non conformance test added kubernetes/kubernetes#116897
- 2023-09-08: [1.29] Promoted test to conformance test kubernetes/kubernetes#120473
As demonstrated in the benchmarks above, the proto-based gRPC endpoint is the most efficient in terms of CPU and Memory usage. Such an endpoint could potentially be improved by using streaming, rather than scraping to be even more efficient at high rates of collection.
However, given the prevalence of the Prometheus format within the kubernetes community, gRPC is not as compatible with common monitoring pipelines. The endpoint would only be useful for supplying metrics for the Metrics Server, or monitoring components that integrate directly with it.
When using caching in the Metrics Server, the prometheus text format performs well enough for us to prefer prometheus over gRPC given the prevalence of prometheus in the community. When the OpenMetrics format becomes stable, we can get even closer to the performance of gRPC by using the proto-based format.
// Usage is a set of resources consumed
message Usage {
int64 time = 1;
uint64 cpu_usage_core_nanoseconds_total = 2;
uint64 memory_working_set_bytes = 3;
}
// ContainerUsage is the resource usage for a single container
message ContainerUsage {
string name = 1;
Usage usage = 2;
}
// PodUsage is the resource usage for a pod
message PodUsage {
string name = 1;
string namespace = 2;
repeated ContainerUsage containers = 3;
}
// MetricsResponse is sent by plugin to kubelet in response to MetricsRequest RPC
message MetricsResponse {
Usage node = 1;
repeated PodUsage pods = 2;
}
// MetricsRequest is the empty request message for Kubelet
message MetricsRequest {}
// ResourceMetrics is the service advertised by the kubelet for usage metrics.
service ResourceMetrics {
rpc Get(MetricsRequest) returns (MetricsResponse) {}
}