Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't copy whole response during response marshalling #129304

Open
serathius opened this issue Dec 19, 2024 · 8 comments
Open

Don't copy whole response during response marshalling #129304

serathius opened this issue Dec 19, 2024 · 8 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.

Comments

@serathius
Copy link
Contributor

What would you like to be added?

When experimenting with measuring memory usage for large LIST requests I noticed one thing that surprised me. It's expected that apiserver requires a lot of memory when listing from etcd. It needs to fetch the data, decode etcd, however what about listing from cache?

I was surprised when listing from cache still required gigabytes of data (10 concurrent list of 1.5GB data increased memory usage by 22GB). Why? apiserver has all the data it needs, we copy structure of data (e.g. converting between types), but data for that should be miniscule. Most data is stored should come from strings, which are immutable. This lead me to revisit old discussion I had with @mborsz and his experimental work on streaming lists from etcd master...mborsz:kubernetes:streaming, he proposed to implement custom streaming encoder. I looked at current implementation of encoding:

Built In json library encoder still marshalls whole value and writes it at once. While this is ok for single objects that weight 2MB, it's bad for LIST responses which can be up to 2GB.

I did a PoC that confirmed my suspicions. master...serathius:kubernetes:streaming-encoder-list Simple hack over encoder reduced memory needed from 26GB to 4 GB.

Proposal:

  • Validate different options for streaming JSON and Proto encoder for LIST responses and enable it.

Options:

Other thoughts:

Why is this needed?

For LISTs served from watch cache it prevents allocating data proportional to size of response. This makes memory usage more proportional to CPU usage improving cost estimations of APF which cares only about memory.

For LISTs served from etcd there is a ongoing proposal to serve them from cache.

@serathius serathius added the kind/feature Categorizes issue or PR as related to a new feature. label Dec 19, 2024
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 19, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@serathius
Copy link
Contributor Author

@serathius
Copy link
Contributor Author

/sig api-machinery

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 19, 2024
@dims
Copy link
Member

dims commented Dec 19, 2024

cc @mengqiy @hakuna-matatah

@mborsz
Copy link
Member

mborsz commented Dec 19, 2024

Great observation Marek!

I think there are two main unknowns with the proposal:

Having better visibility for both will be helpful to understand the tradeoff we are making.

@serathius
Copy link
Contributor Author

serathius commented Dec 20, 2024

The potential impact on the other resource types --- IIUC the current experiments were for configmaps for which this approach is clearly beneficial (as nearly all the data is in strings). The next step would be probably to benchmark different scenarios to understand potential gain. Having similar benchmarks for e.g. pods which may have a lot of nested structs and not that much data in strings may help us understand a potential impact in other scenarios.

Tested pods, spread data in 1KB chunks around containers, initcontainers volumes and condition fields to create a large structure. The test results showed smaller, but still substential benefits. Memory usage went down from 27GB to 9GB. With pods the base memory is around 7GB, so it means we reduce allocations from 20GB to 2GB. I can accept that :P

@serathius
Copy link
Contributor Author

serathius commented Dec 20, 2024

Complexity to deliver some production version of this logic -- availability of a good serialization libraries, whether we can deliver all fancy features like "as=Table"

I think we can skip them for now. Our main focus should on default client behavior. Which is base JSON. One sad thing is that we don't really have any JSON performance. Scalability tests run everything in proto based on fact that K8s would not hit 5k nodes if we used JSON at all. Still would like to have Proto implemented just to see the impact in scalability tests.

@sftim
Copy link
Contributor

sftim commented Dec 21, 2024

I imagine we'd like to triage this as accepted. What are the downsides?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.
Projects
None yet
Development

No branches or pull requests

5 participants