Don't copy whole response during response marshalling #129304

serathius · 2024-12-19T11:05:16Z

What would you like to be added?

When experimenting with measuring memory usage for large LIST requests I noticed one thing that surprised me. It's expected that apiserver requires a lot of memory when listing from etcd. It needs to fetch the data, decode etcd, however what about listing from cache?

I was surprised when listing from cache still required gigabytes of data (10 concurrent list of 1.5GB data increased memory usage by 22GB). Why? apiserver has all the data it needs, we copy structure of data (e.g. converting between types), but data for that should be miniscule. Most data is stored should come from strings, which are immutable. This lead me to revisit old discussion I had with @mborsz and his experimental work on streaming lists from etcd master...mborsz:kubernetes:streaming, he proposed to implement custom streaming encoder. I looked at current implementation of encoding:

kubernetes/staging/src/k8s.io/apimachinery/pkg/runtime/serializer/json/json.go

Lines 245 to 246 in 29101e9

encoder := json.NewEncoder(w)

return encoder.Encode(obj)
https://github.com/golang/go/blob/4f0561f9d354233787de7aa9eff8119a2d4cd5c6/src/encoding/json/stream.go#L223-L233

Built In json library encoder still marshalls whole value and writes it at once. While this is ok for single objects that weight 2MB, it's bad for LIST responses which can be up to 2GB.

I did a PoC that confirmed my suspicions. master...serathius:kubernetes:streaming-encoder-list Simple hack over encoder reduced memory needed from 26GB to 4 GB.

Proposal:

Validate different options for streaming JSON and Proto encoder for LIST responses and enable it.

Options:

Custom encoder for List objects based on reflections master...serathius:kubernetes:streaming-encoder-list
Define a interface to streaming encoding and generate code for all implementations
Pick some of available streaming encoding libraries.

Other thoughts:

Can be also expanded to client decoders
Similar to https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/3157-watch-list/README.md

Why is this needed?

For LISTs served from watch cache it prevents allocating data proportional to size of response. This makes memory usage more proportional to CPU usage improving cost estimations of APF which cares only about memory.

For LISTs served from etcd there is a ongoing proposal to serve them from cache.

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2024-12-19T11:05:26Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

serathius · 2024-12-19T11:05:41Z

/cc @mborsz @wojtek-t @jpbetz @deads2k @p0lyn0mial

serathius · 2024-12-19T11:05:53Z

/sig api-machinery

dims · 2024-12-19T11:40:58Z

cc @mengqiy @hakuna-matatah

mborsz · 2024-12-19T12:15:47Z

Great observation Marek!

I think there are two main unknowns with the proposal:

The potential impact on the other resource types --- IIUC the current experiments were for configmaps for which this approach is clearly beneficial (as nearly all the data is in strings). The next step would be probably to benchmark different scenarios to understand potential gain. Having similar benchmarks for e.g. pods which may have a lot of nested structs and not that much data in strings may help us understand a potential impact in other scenarios. Note that https://github.com/mborsz/kubernetes/blob/b49532753b4077a6edb6b5e6407ae8738e41ac90/staging/src/k8s.io/apimachinery/pkg/runtime/serializer/protobuf/streaming.go contains some experimental implementation of streaming protobuf encoder from my previous exploration in this area -- I hope it will be useful for further experiements here.
Complexity to deliver some production version of this logic -- availability of a good serialization libraries, whether we can deliver all fancy features like "as=Table" https://kubernetes.io/docs/reference/using-api/api-concepts/#receiving-resources-as-tables or at least skip them for now in a clean way

Having better visibility for both will be helpful to understand the tradeoff we are making.

serathius · 2024-12-20T15:11:49Z

The potential impact on the other resource types --- IIUC the current experiments were for configmaps for which this approach is clearly beneficial (as nearly all the data is in strings). The next step would be probably to benchmark different scenarios to understand potential gain. Having similar benchmarks for e.g. pods which may have a lot of nested structs and not that much data in strings may help us understand a potential impact in other scenarios.

Tested pods, spread data in 1KB chunks around containers, initcontainers volumes and condition fields to create a large structure. The test results showed smaller, but still substential benefits. Memory usage went down from 27GB to 9GB. With pods the base memory is around 7GB, so it means we reduce allocations from 20GB to 2GB. I can accept that :P

serathius · 2024-12-20T15:20:46Z

Complexity to deliver some production version of this logic -- availability of a good serialization libraries, whether we can deliver all fancy features like "as=Table"

I think we can skip them for now. Our main focus should on default client behavior. Which is base JSON. One sad thing is that we don't really have any JSON performance. Scalability tests run everything in proto based on fact that K8s would not hit 5k nodes if we used JSON at all. Still would like to have Proto implemented just to see the impact in scalability tests.

sftim · 2024-12-21T17:48:07Z

I imagine we'd like to triage this as accepted. What are the downsides?

serathius added the kind/feature Categorizes issue or PR as related to a new feature. label Dec 19, 2024

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 19, 2024

k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 19, 2024

serathius mentioned this issue Dec 20, 2024

Streaming json list encoder #129334

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't copy whole response during response marshalling #129304

Don't copy whole response during response marshalling #129304

serathius commented Dec 19, 2024

k8s-ci-robot commented Dec 19, 2024

serathius commented Dec 19, 2024

serathius commented Dec 19, 2024

dims commented Dec 19, 2024

mborsz commented Dec 19, 2024

serathius commented Dec 20, 2024 •

edited

Loading

serathius commented Dec 20, 2024 •

edited

Loading

sftim commented Dec 21, 2024

Don't copy whole response during response marshalling #129304

Don't copy whole response during response marshalling #129304

Comments

serathius commented Dec 19, 2024

What would you like to be added?

Why is this needed?

k8s-ci-robot commented Dec 19, 2024

serathius commented Dec 19, 2024

serathius commented Dec 19, 2024

dims commented Dec 19, 2024

mborsz commented Dec 19, 2024

serathius commented Dec 20, 2024 • edited Loading

serathius commented Dec 20, 2024 • edited Loading

sftim commented Dec 21, 2024

serathius commented Dec 20, 2024 •

edited

Loading

serathius commented Dec 20, 2024 •

edited

Loading