Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[meta] Remote write 2.0 #13105

Open
9 of 24 tasks
cstyan opened this issue Nov 8, 2023 · 28 comments
Open
9 of 24 tasks

[meta] Remote write 2.0 #13105

cstyan opened this issue Nov 8, 2023 · 28 comments

Comments

@cstyan
Copy link
Member

cstyan commented Nov 8, 2023

This issue serves as a public place to track and discuss the progress of remote write v1.1 2.0, which is described in this spec document, and was first worked on here.

Still in progress via #13052 and a number of other PRs.

UPDATE Dec 12, 2023: we're changing this to be 2.0, we don't intend for the protobuf format in the new version to be backwards compatible

Issues (as up-to-date as possible)

In-progress

TODO

Complete

@cstyan cstyan changed the title Remote write 1.1 [meta] Remote write 1.1 Nov 8, 2023
@npazosmendez
Copy link
Contributor

On the question of: do we need string interning for Remote Write 1.1? Can't we reduce the payload sizes by just moving from Snappy to other compression algorithms? E.g. zstd (Victoria Metrics), gzip (OTLP).

The experiment

Comparison of Remote Write v1 vs the WIP Remote Write v1.1 (string interning), using the e2e bench scripts from #13102, which spawn multiple sender->receiver pairs with different versions. Clarifications:

  • Senders and receivers are fully working, with their own variation of the protocol. Senders scrape and send, receivers receive and write to their tsdb correctly. No stubs.
  • All senders are scraping an entire real kubernetes namespace, which includes over 100 pods exposing metrics. Many of these instances are microservices from mimir.
  • I'm using the flate package from https://github.com/klauspost/compress , which showed better results than compress/flate
  • I'm using the pure-go zstd package from https://github.com/klauspost/compress to avoid CGO wrappers.
  • *-v11-min-* are the instances using string interning.
  • The string interning technique is the following: the request has a single long string which consists of the concatenation of all strings. References are pairs of (offset_in_longstring uint32, length_of_string uint32). At the time of writing this is not the exact format from remote write 2.0: new proto format with string interning  #13052, but one we are still working on and will be sharing there soon.
  • I've tried to optimize the compression and decompression code for all algorithms. Especially regarding garbage, reusing buffers when possible. The hacky experimental code can be found here.

The results

image

The main takeway is: although just changing the current protocol from Snappy to (for instance) zstd already reduces the payload sizes significantly, string interning reduces the receiver's CPU significantly in addition to network usage. This is because the uncompressed payload is smaller and easier to decode, which makes the unmarshaling much faster and memory efficient. On top of string interning, changing the compression algorithm can be considered, to get even better network usage and mantaining the same CPU gains on the receiver (potentially, based on these results, this needs more digging).

Extra notes

  • These results are not the final performance of the string-interned protocol, as we are still tweaking things and experimenting. However, we think they are enough to make a case for string interning vs just changing the compression in the current protocol.
  • Additionally changing the compression from Snappy to another algorithm on top of string interning will be considered separately. There's also other algorithms we've been looking into, like brotli, S2, and also moving knobs in all algorithms (i.e. zstd-fast, zstd-best-compression, etc). The results I'm sharing here are only meant to answer the question above.

@cstyan
Copy link
Member Author

cstyan commented Nov 11, 2023

Incoming wall of text, skip to the end if you just want the conclusion instead of the steps to get there.

An update for everyone. We've essentially thrown out the map->proto based implementation because of resource usage concerns, especially on the receiving side.

First minimized format

To that end, I've got some more benchmark results. The first is a comparison of the initial map based approach against a 'minimized' format, which packs symbol offset/len into uint32. This is based on an experiment by @pracucci.

The results of benchmarking
min-results-1
min-results-2

As you can see, the wire bytes and receiver CPU usage were reduced even further. What was concerning was the increased CPU usage.

The minimized format is as follows:

message MinimizedTimeSeries {
  // Sorted list of label name-value pair references. This list's len is always multiple of 4,
  // packing tuples of (label name offset, label name length, label value offset, label value length).
  repeated uint32 label_symbols = 1 [(gogoproto.nullable) = false];

  ...
  ...
}

In the write request the symbols are now replaced with the following instead of a map.

  // The symbols table. All symbols are concatenated strings. To read the symbols table, it's required
  // to know the offset:length range of the actual symbol to read from this string.
  string symbols = 4;

Later iterations replace the repeated uint32 label_symbols with other types of integer representations of offset/length combinations into that same symbols string.

Minimized format optimizations

@npazosmendez found documentation that when uint32 is used in protobuf the generated marshalling code uses varint, see here. We then tried replacing the uint32 with fixed32 since we were always using the whole 32 bits anyways.

With that change we saw sane CPU usage results
single-fixed32-results

However, as I mentioned this new format worked by packing both the offset and length of each symbol into a single uint32. The offset was packed into 12 bits and the length into 20 bits. This has obvious limitations.

As remote write has no control or say over the # of labels a single metric can have, nor the length of the symbols for the label set, we shouldn't rely on the limitations that bit shifting/packing into smaller data sizes. However, it would be nice if we could still get the increased wire bytes reduction that comes with such a change.

Additional minimized format experiments

Nico tried using two fixed32 values for the offset/length, giving each the full 32 bits rather than 12/20. Obviously the wire bytes reduction was less than the single uint32 approach, so again he modified the format to use uint32 except this time he tried modifying the generated marshaling code since it results in extra CPU usage. See here. From profiles we noticed that this was mostly due to increased GC. The short version is that the generated marshaling code for varint creates a new buffer, writes data to it, and then copies that data to the final proto buffer. We can do the encoding in place with the nice trick he implemented in the previous link.

The benchmark results are as follows
min-32-optimized

I was curious about the sender's side CPU usage and wire bytes differences for other possible changes to the format. To that end, yesterday and today I've written some additional minimized formats to experiment with.

defintions

From here on I'll refer to the formats as follows:

  • base the current remote write 1.0 format
  • reduced the originally proposed map -> proto based interning format
  • min-32bit-optimized the 32bit varint format with Nico's optimized marshal function
  • min-64-fixed the minimized format but with the offset and length packed into a single 64bit value (32 bits each)
  • min-32-fixed the minimized format but with the offset and length each getting their own 32bit fixed value

various min format benchmarks

Note that I've modified the panel for the CPU usage diff. to be a range query timeseries panel. Comparing the CPU usage is a bit tricky, there's another comparison further down. For this set of experiments I also increased the batch size from the default to 5k since this is more indicative of a large production workload.

The results were as follows:
2023-11-10-150307_2517x911_scrot
2023-11-10-150316_2518x916_scrot

For the CPU usage, IMO the most interesting comparison would be to see what % of the time range was the CPU usage for one of the new formats less than the CPU usage of the current base format:
2023-11-10-150950_1659x1062_scrot

Roughly speaking, this makes it look like:

  • CPU usage for Nicos format min-32bit-optimized is below that of base ~60% of the time, and it has a wire bytes reduction of ~44%
  • CPU usage for min-64-fixed is below that of base ~75-80% of the time, but it only has a wire bytes reduction of ~28%
  • CPU usage for min-32-fixed is below that of base ~85-90% of the time, it has a wire bytes reduction of ~32%

If we look at just the heap memory usage:
2023-11-10-161810_954x1005_scrot

  • for both the sender and reciever the two fixed bit length options consistently use the least memory
  • the optimized varint format is always using less memory on the receiver side, and using less most of the time on the senders side

Conclusion

Based on the data in the previous section, I think both the formats min-32bit-optimized and min-32-fixed are our best options. The decision would come down to whether we want to use custom proto marshaling code. The modification from the generated code is fairly minimal, but IMO we would probably want to somehow add a test that would fail if (when we eventually replace gogoproto, for example) if the custom marshaling was no longer more efficient than the generated version.

@npazosmendez
Copy link
Contributor

npazosmendez commented Nov 13, 2023

Great summary. To add on the marshaling code optimizations:

  • This is a clean way to do it, which doesn't edit the generated code: c9b6ddd
  • To avoid tweaking the marshaling directly, we can try changing repeated uint32 label_symbols with bytes label_symbols (i.e. have the label_symbols be a []byte), and then we take care of the varint encoding when building the proto struct. I expect this will generate more memory efficient code, but didn't try yet. I will try it and report back

EDIT: I tried what I said above in the second item, using a plain []byte in the protobuf message and manually encode the uint32 varints into the struct. This is the code.

As I expected, the generated unmarshaling code does not allocate that extra memory the previous version did, so there's no need to tweak the code manually. Some quick benchmarks I ran show that the performance is the same as what @cstyan calls min-32bit-optimized above, which is expected. The downside is the proto message definition is weird; we are defining a varint array as a []byte just to hack around the inefficient generated code.

Between these two options, I think it's about whether we prefer a weird message definition vs a weird marshaling code. I personally lean towards the latter; a more efficient code generator may come in the future.

@bwplotka
Copy link
Member

bwplotka commented Nov 16, 2023

Thanks a lot! Epic work everyone! To me the benefits of min-32bit-optimized are great (thanks @cstyan for explaining all on Slack in various threads too). I think interning would make our protocol even more efficient compared to the alternatives, which will be important for adoption.

Before the final decision, I would love to try reproducing the results, especially without genproto (which is deprecated). I might have time for that next week 🤞🏽

BTW I asked some protobuf experts internally at Google, so far ppl are surprised this yields better compression ratio that just compression. Waiting for more opinions, but that seems to be the reality (plus better decoding CPU usage).

Other considerations

  1. If we choose min-32bit-optimized I think I would vote for repeated uint32 label_symbols. Makes the proto simpler to adopt and if one wants to optimize marshalling they can do that, plus it might be only Go or only Go + gogo artifact.
  2. If we do interning, the format is similar to 1.0, but still quite different (so the resulted code for everybody). I wonder if making this Remote Write 2.0 (not 1.1) is not a better move. The additional benefit is that IF we spot big issues with adoption in certain languages (unlikely, but possible), it will be trivial to add other changes like metadata and created timestamp to 1.0 without iterning and make 1.1 at some point. Essentially it gives us more flexibility. WDYT?

PS: Maybe silly question, I know we tried map, but did we tried array of strings (given unit32 ref) as a table?

@npazosmendez
Copy link
Contributor

so far ppl are surprised this yields better compression ratio that just compression.

Note that no interning + a more-compressing algorithm like zstd yields better compression than interning + snappy. However, interning + zstd can be even better. And there's also the CPU and memory benefits.

If we choose min-32bit-optimized I think I would vote for repeated uint32 label_symbols. Makes the proto simpler to adopt and if one wants to optimize marshalling they can do that, plus it might be only Go or only Go + gogo artifact.

I agree with this.

PS: Maybe silly question, I know we tried npazosmendez#2 (comment), but did we tried array of strings (given unit32 ref) as a table?

Almost. We've been experimenting with another format that @pracucci suggested, which makes the symbol table the following []byte: varint_string1_len,string1_data,varint_string2_len,string2_data,..., which is essentially (AFAIU) how protobuf would encode a []string. And that has been showing great results, maybe even better than min-32bit-optimized. We plan to push that into the draft PR and share the stats here very soon. We didn't try a []string directly though; I suspect the proto code generator will generate inefficient code, but we should try (and consider manually optimizing if necessary).

@bwplotka
Copy link
Member

bwplotka commented Nov 16, 2023

Yup, I expect us to really use interning + zstd at the end and recommend that.

Almost. We've been experimenting with another format that @pracucci suggested, which makes the symbol table the following []byte: varint_string1_len,string1_data,varint_string2_len,string2_data,..., which is essentially (AFAIU) how protobuf would encode a []string. And that has been showing great results, maybe even better than min-32bit-optimized. We plan to push that into the draft PR and share the stats here very soon. We didn't try a []string directly though; I suspect the proto code generator will generate inefficient code, but we should try (and consider manually optimizing if necessary).

Ack, []string symbols has the additional benefit of using single uint as ref, not two (offset and length), it also allows users to use global ids or something (for some stateful iteration of this protocol one day). Your []byte idea encoding is pretty solid too (also allows having one int as ref).

@cstyan
Copy link
Member Author

cstyan commented Nov 17, 2023

My fault, conversation in lots of different places. The latest comparison results are here: npazosmendez#3 (comment)

As @npazosmendez mentioned the formats we've most recently experimented with a few more changes to the format that produce even better results. Notably, in one we're changing the symbol table to be a byte slice instead of repeated uint32 or fixed32. In the other the symbols table is still repeated fixed32 values, however those values are just the offset, and the symbols string is now a byte slice of varint len + string varing len + string .... We also tried a format that combined both.

so far ppl are surprised this yields better compression ratio that just compression.

Note that no interning + a more-compressing algorithm like zstd yields better compression than interning + snappy. However, interning + zstd can be even better. And there's also the CPU and memory benefits.

I think the issue was that with just changing the compression we use more CPU. We can get comparable wire bytes reduction results but the CPU cost is too high. The likely cause of the results we're seeing with the new format + zstd is that we have to compress was less data overall than without the interning change.

If we do interning, the format is similar to 1.0, but still quite different (so the resulted code for everybody). I wonder if making this Remote Write 2.0 (not 1.1) is not a better move. The additional benefit is that IF we spot big issues with adoption in certain languages (unlikely, but possible), it will be trivial to add other changes like metadata and created timestamp to 1.0 without iterning and make 1.1 at some point. Essentially it gives us more flexibility. WDYT?

We talked about this on slack but I'll post my thoughts here in case anyone else is curious. In general for me it boils down to this, I want to get these improvements (along with the metadata change and created timestamp) out into a Prometheus release asap behind a feature flag. I don't want this work to have to wait for a prom v3.0 to go out. I know I probably should have a stronger opinion, and with the amount of changes it's more accurately a 2.0 than a 1.1, but I don't. As long as it can go into an official release behind a feature flag as soon as it's ready, and not have to wait for a prom 3.0, I'm fine with calling it rw 1.1 or 2.0.

TL; DR: we will be changing the format to include string interning and may still potentially allow for alternative compression to be used.

@bwplotka
Copy link
Member

bwplotka commented Nov 17, 2023

Thanks!

Just as a documentation piece: There is an elephant in the room to mention in the proposal and test one day--the Arrow format. Quite a good work from Otel trying this out, but also a bit brave to make the protocol stateful, tricky to scale on receiving end, so it's epic we have stateless one here. I never played with Arrow but it feels like fully another protocol. Worth to compare at some point, while we would integrate Prometheus to export data with Intel, but it's looks extremely complex (just metrics mapping is mind-blowing (unless it's Otel specific complexity?)). Way more involved than our symbol table ideas. It's hard to argue that Arrow would be good immediate "quick to adopt" next version of remote write given epic efficiency wins we got so far with 1.1 or 2.0 remote write work, at least that's my understanding. Does it make sense? Happy with other opinions here (:

@bwplotka
Copy link
Member

FYI, I am playing with non gogoproto and array ideas. In the meanwhile to keep 1.1 version and make an incremental proto change instead we could make 1.1 looks similar to this:

message WriteRequest {
  repeated TimeSeries timeseries = 1 [(gogoproto.nullable) = false];
  reserved  2;
  reserved  3;

  // The symbols table. All symbols are concatenated strings. To read the symbols table, it's required
  // to know the offset:length range of the actual symbol to read from this string.
  string symbols = 4;
}

// TimeSeries represents samples and labels for a single time series.
message TimeSeries {
  // For a timeseries to be valid, and for the samples and exemplars
  // to be ingested by the remote system properly, the labels or label symbols field is required.

  // Deprecated in 1.1, use interned label symbols instead.
  repeated Label labels = 1 [(gogoproto.nullable) = false];

  // Sorted list of label name-value pair references. This list's len is always multiple of 4,
  // packing tuples of (label name offset, label name length, label value offset, label value length).
  // Offsets point to the WriteRequest.symbols.
  repeated uint32 label_symbols = 10;

  // Sorted by time, oldest sample first.
  repeated Sample samples = 2 [(gogoproto.nullable) = false];
  repeated Exemplar exemplars = 3 [(gogoproto.nullable) = false];
  repeated Histogram histograms = 4 [(gogoproto.nullable) = false];
  Metadata metadata = 5;
  // Optional created timestamp for the metric in ms format,
  // if the first sample in samples does not contain 0 value.
  // See model/timestamp/timestamp.go for conversion from time.Time
  // to Prometheus timestamp.
  int64 created_timestamp = 6;
}

It would allow backward and forward compatibility and nice migration process (most of types will be reused).

@npazosmendez
Copy link
Contributor

npazosmendez commented Nov 24, 2023

Nice @bwplotka. But I'm thinking: protobuf (de)serialization backward/forward compatibility would not give us protocol backward/forward compatibility. For a request itself to be backw/forw compatible, a sender would need to fill up both labels and label_symbol fields in the request so it's usable for both versions. If, for example, they fill only 1.1 fields and a 1.0 receiver successfully deserializes the request, it will see no labels; even if it handles that gracefully by rejecting the request, I don't see a real benefit if being able to decode the request. It might be better to just fail to deserialize.

On the other hand, the benefit I do see with this is a receiver can accept both 1.0 and 1.1 more easily without duplicating too many code paths.

LMK what you think, I could also be missing something that you thought about.


Separate note: this is tangential to protocol negotiation. We did discuss making 1.1 receivers accept 1.0 to make migrations easier, though not necessarily through a compatible protobuf, but through HTTP headers or other means of negotiation. We intend to put together a design doc to discuss these options.

@bwplotka
Copy link
Member

If, for example, they fill only 1.1 fields and a 1.0 receiver successfully deserializes the request, it will see no labels;

Yup, I think we are on the same page--essentially for full compatibility we would need to duplicate data (which defies the point of interning) and when, let's say after some content negotiations, receivers might get errors on reading empty labels (e.g. when they forget to fix some parts of the code).

I agree, we need content negotiation or at least some configuration. 👍🏽

Didn't think about it for a long time, but I think I still prefered reused messages so far:

  1. I noticed in the code migration, or even when supporting receiving (or sending) both versions, reusing msgs is just much easier. Less code, less opportunity for error for everyone.
  2. It's clear what changed vs old protocol -- much easier to adopt.
  3. Less of questioning around 2.0 vs 1.1. To me message reuse allows us to say this change is 1.1 as essentially there are "less" changes 🤔
  4. Maybe questionable, but following 3. it allows adding metadata & CT to 1.0 (kind of - if we want metadata to be also interned... it's not so simple).
  5. Do we really care if in error case (e.g. receiver code is wrong or content negotiation failed) receiver will get errors on empty labels for time series vs.. some cryptic protobuf deserialization error when they try deserialize Minimized... to WriteRequest accidently? I would argue both experiences are similar.

@npazosmendez
Copy link
Contributor

  1. I noticed in the code migration, or even when supporting receiving (or sending) both versions, reusing msgs is just much easier. Less code, less opportunity for error for everyone.

This is the most convincing point for me. Of course, we'd need to make sure sending a 1.1 request to an old 1.0 receiver works as expected: fail with a 4xx.

Do we really care if in error case (e.g. receiver code is wrong or content negotiation failed) receiver will get errors on empty labels for time series vs.. some cryptic protobuf deserialization error when they try deserialize Minimized... to WriteRequest accidently? I would argue both experiences are similar

I wasn't thinking about the error message for the user. My reasoning was: why would we want to successfully deserialize a payload that we know is invalid? I would rather fail fast instead of relying on post-deserializing validation, which can be bug prone.

Now that I'm more convinced about your proposal, the answer to that (IMHO) is "1": reusing the struct will make things easier and cleaner in a number of other ways, and that benefit may be greater.

@bwplotka
Copy link
Member

bwplotka commented Nov 24, 2023

Very initial benchmark (MARSHAL only) (I added/cleaned some code and reused new proposed proto in https://github.com/prometheus/prometheus/tree/bwplotka/prw-initial based on your amazing work https://github.com/npazosmendez/prometheus/pull/3/files).

I wouldn't be too excited here as I am just learning how you optimized certain things and implemented all so far (still learning), but reusing your benchmark quickly:

goos: darwin
goarch: arm64
pkg: github.com/prometheus/prometheus/storage/remote
                                            │ varstring.txt │            []string.txt             │
                                            │    sec/op     │    sec/op     vs base               │
BuildMinimizedWriteRequest/66-instances-4      19.38n ± 26%   15.38n ± 45%        ~ (p=0.132 n=6)
BuildMinimizedWriteRequest/330-instances-4     85.33n ±  3%   60.84n ±  2%  -28.71% (p=0.002 n=6)
BuildMinimizedWriteRequest/3300-instances-4    798.8n ±  6%   554.2n ±  2%  -30.62% (p=0.002 n=6)
geomean                                        109.7n         80.33n        -26.78%

                                            │   varstring.txt   │              []string.txt               │
                                            │ compressedSize/op │ compressedSize/op  vs base              │
BuildMinimizedWriteRequest/66-instances-4           1.557k ± 0%         1.509k ± 0%  -3.08% (p=0.002 n=6)
BuildMinimizedWriteRequest/330-instances-4          5.694k ± 0%         5.233k ± 0%  -8.10% (p=0.002 n=6)
BuildMinimizedWriteRequest/3300-instances-4         52.47k ± 0%         47.27k ± 0%  -9.90% (p=0.002 n=6)
geomean                                             7.748k              7.200k       -7.07%

It feels the repeated string symbols so proto like below is bit faster, so far 🤔

message WriteRequest {
  repeated TimeSeries timeseries = 1 [(gogoproto.nullable) = false];
  reserved  2;
  reserved  3;

  // The symbols table.
  repeated string symbols = 5;
}

Bench script FYI

#!/bin/zsh

set -xe

export ver=[]string && \
 go test ./storage/remote -run '^$' -bench '^BenchmarkBuildMinimizedWriteRequest' -benchtime 500x -count 6 \
 -cpu 4 \
 -benchmem \
 -memprofile=${ver}.mem.pprof -cpuprofile=${ver}.cpu.pprof \
 | tee ${ver}.txt

It's motivation for trying []string on scale and receive side if anything else 🙈

@bwplotka
Copy link
Member

bwplotka commented Nov 24, 2023

And I felt the pain of custom marshal optimization code... adding fields to proto alone is extremely prone to errors 😱 But tests could quickly catch it! ;p

I did the benchmark and "oh it's so fast"... realizing I forgot to marshal new repeated symbols... ;p

@bwplotka
Copy link
Member

bwplotka commented Nov 24, 2023

My reasoning was: why would we want to successfully deserialize a payload that we know is invalid? I would rather fail fast instead of relying on post-deserializing validation, which can be bug prone.

Yea. You can't do much without labels these days (e.g. lack of metric name), but maybe some receives might ingest that and corrupt DB? 🤔 Sounds indeed quite bad.

The alternative to me is call it 2.0 and create a different "package" version truly:

message WriteRequest {
  // The symbols table for all strings used in WriteRequest message.
  repeated string symbols = 1;
  
  repeated TimeSeries timeseries = 2 [(gogoproto.nullable) = false];
}

// TimeSeries represents samples and labels for a single time series.
message TimeSeries {
  // For a timeseries to be valid, and for the samples and exemplars
  // to be ingested by the remote system properly, the labels or label symbols field is required.

  // Sorted list of label name-value pair references. This list's len is always multiple of 2,
  // packing tuples of (label name ref, label value ref) refs to WriteRequests.symbols.
  repeated uint32 label_symbols = 1;

  // Sorted by time, oldest sample first.
  repeated Sample samples = 2 [(gogoproto.nullable) = false];
  repeated Exemplar exemplars = 3 [(gogoproto.nullable) = false];
  repeated Histogram histograms = 4 [(gogoproto.nullable) = false];
  Metadata metadata = 5;
  // Optional created timestamp for the metric in ms format,
  // if the first sample in samples does not contain 0 value.
  // See model/timestamp/timestamp.go for conversion from time.Time
  // to Prometheus timestamp.
  int64 created_timestamp = 6;
}

@npazosmendez
Copy link
Contributor

It feels the repeated string symbols so proto like below is bit faster, so far

This would be great, the proto message would be much cleaner. I would try benchmarking this with the e2e-bench scripts for a better comparison: https://github.com/prometheus/prometheus/tree/remote-write-1.1/scripts/remotewrite11-bench . That would require the receiver end too, of course. If it looks better than the current minimized version (or even the same), I would definitely go with this one.

Regarding the manual (un)marshaling optimizations:

  • The repeated string symbols version didn't require manual optimizations AFAICT, or did it?
  • Long shot, but if we still see the need for these manual optimizations we can try other code generator like https://github.com/planetscale/vtprotobuf (ty @bboreham for the pointer). That one has some other cool options like unmarshal_unsafe. But I don't know how we feel about experimenting with these.

@bwplotka
Copy link
Member

bwplotka commented Nov 24, 2023

Yea, there is also csproto on top of vtprotobuf, plus changing out of gogo can change things a lot.

I plan to setup Prometheus + receiver code indeed and benchmark more. The soft deadline for this work could be next DevSummit (next week Thur) 🙈 No idea if viable. Otherwise decision will be in January (:

@GiedriusS
Copy link
Contributor

Another concern that I have seems to not be mentioned everywhere. What will happen to the v1.0 version? Will implementors be able to choose whichever version they prefer or only the newest one? Compression is nice but it might not be feasible to implement on certain systems like microcontrollers. With very limited RAM it might be hard to build a symbol table or anything like that. Are you planning to implement an escape hatch for such small systems?

@bwplotka
Copy link
Member

Yes, there will be always option to negotiate/configure what protocol to use. However for your microcontroller case, new protocol will be only better. You can't partially decode proto anyway, and symbols table makes the whole message significantly smaller, even without compression. This means less to allocate for deserialized msg.

@cstyan
Copy link
Member Author

cstyan commented Nov 28, 2023

@bwplotka @npazosmendez

I did the benchmark and "oh it's so fast"... realizing I forgot to marshal new repeated symbols... ;p

is the current set of results here now accurate?

I tried vtproto and wasn't able to get it to perform as well as our current gogoproto format. There's likely still something I'm missing, but because we no longer have gogoproto nullable = false all repeated fields in message are now slices of pointers. I tried various optimizations but this essentially leads to a shit ton more allocations and, as a consequence, GC.

FYI, I am playing with non gogoproto and array ideas. In the meanwhile to keep 1.1 version and make an incremental proto change instead we could make 1.1 looks similar to this:

Personally I don't really like the idea of maintaining this combined format. The code for buffering samples and building write requests is already pretty messy with the amount of "optional" fields we have, which is one thing we're reducing a bit in v1.1/v2. I also feel like the communities desire for fully backwards compatible messages (ie sending all the data in all the fields) is going to be pretty low. People want less resource usage, not more.

Do we really care if in error case (e.g. receiver code is wrong or content negotiation failed) receiver will get errors on empty labels for time series vs.. some cryptic protobuf deserialization error when they try deserialize Minimized... to WriteRequest accidently? I would argue both experiences are similar.

I think the answer here should be defined within the spec., and IMO failing with a consistent error message "remote write proto format unrecognized/not correct". Something like that. I agree with Nicos point that we should optimize the experience for the sender; they should receive an error back immediately if the receiver gets a RW proto format it doesn't support, otherwise return an error about invalid data if the unmarshalling failed.

@GiedriusS

Compression is nice but it might not be feasible to implement on certain systems like microcontrollers. With very limited RAM it might be hard to build a symbol table or anything like that. Are you planning to implement an escape hatch for such small systems?

The new format should end up being the same or better in all cases. IIRC, in our benchmarks for the senders side we use more or less the same amount of CPU but less memory and less bytes over the wire. On the receivers side resource usage is down across the board.

@cstyan
Copy link
Member Author

cstyan commented Nov 28, 2023

here's the benchmark results;

vtproto

goos: linux
goarch: amd64
pkg: github.com/prometheus/prometheus/storage/remote
cpu: AMD Ryzen 9 5950X 16-Core Processor
BenchmarkBuildWriteRequest/2_instances-32         	    1000	     25648 ns/op	      1145 compressedSize/op	    4230 B/op	      66 allocs/op
BenchmarkBuildWriteRequest/10_instances-32        	    1000	    115557 ns/op	      5445 compressedSize/op	   21141 B/op	     330 allocs/op
BenchmarkBuildWriteRequest/1k_instances-32        	    1000	   1304689 ns/op	     62865 compressedSize/op	  216002 B/op	    3300 allocs/op
PASS
ok  	github.com/prometheus/prometheus/storage/remote	1.501s

gogoproto

goos: linux
goarch: amd64
pkg: github.com/prometheus/prometheus/storage/remote
cpu: AMD Ryzen 9 5950X 16-Core Processor
BenchmarkBuildWriteRequest/2_instances-32         	    1000	     17313 ns/op	      1933 compressedSize/op	      80 B/op	       1 allocs/op
BenchmarkBuildWriteRequest/10_instances-32        	    1000	     85166 ns/op	      8156 compressedSize/op	      80 B/op	       1 allocs/op
BenchmarkBuildWriteRequest/1k_instances-32        	    1000	    862581 ns/op	     78215 compressedSize/op	      80 B/op	       1 allocs/op
PASS
ok  	github.com/prometheus/prometheus/storage/remote	1.015s

@bwplotka
Copy link
Member

bwplotka commented Nov 29, 2023

Did some early benchmark without any extra optimizations, just all plugins in fastest possible mode.

Using unmarshal_unsafe on vtproto which is only on main (not released yet) yields the best results for me--better than gogo proto (optimized).

Full results (sorry for not sorted diff):

>> comparing BenchmarkPRWSerialize
goos: darwin
goarch: arm64
pkg: github.com/bwplotka/go-proto-bench/prw
               │ v1testgogo.txt │               v2.txt               │      v2testcsproto.txt       │         v2testgogo-opt.txt          │           v2testgogo.txt           │         v2testvtproto.txt          │
               │     sec/op     │   sec/op     vs base               │   sec/op     vs base         │    sec/op     vs base               │    sec/op     vs base              │   sec/op     vs base               │
PRWSerialize-4     330.5µ ± 11%   474.2µ ± 1%  +43.46% (p=0.002 n=6)   344.5µ ± 2%  ~ (p=0.065 n=6)   256.4µ ± 10%  -22.41% (p=0.002 n=6)   347.9µ ± 12%  +5.25% (p=0.041 n=6)   276.8µ ± 3%  -16.24% (p=0.002 n=6)

               │  v1testgogo.txt   │                  v2.txt                  │            v2testcsproto.txt             │            v2testgogo-opt.txt            │              v2testgogo.txt              │            v2testvtproto.txt             │
               │ serializedSize/op │ serializedSize/op  vs base               │ serializedSize/op  vs base               │ serializedSize/op  vs base               │ serializedSize/op  vs base               │ serializedSize/op  vs base               │
PRWSerialize-4        1055.8k ± 0%         794.0k ± 0%  -24.80% (p=0.002 n=6)         794.0k ± 0%  -24.80% (p=0.002 n=6)         794.0k ± 0%  -24.80% (p=0.002 n=6)         794.0k ± 0%  -24.80% (p=0.002 n=6)         794.0k ± 0%  -24.80% (p=0.002 n=6)

               │ v1testgogo.txt │               v2.txt                │          v2testcsproto.txt          │         v2testgogo-opt.txt          │            v2testgogo.txt            │          v2testvtproto.txt          │
               │      B/op      │     B/op      vs base               │     B/op      vs base               │     B/op      vs base               │     B/op       vs base               │     B/op      vs base               │
PRWSerialize-4    1032.0Ki ± 0%   776.0Ki ± 0%  -24.81% (p=0.002 n=6)   776.0Ki ± 0%  -24.81% (p=0.002 n=6)   776.0Ki ± 0%  -24.81% (p=0.002 n=6)   1182.3Ki ± 0%  +14.56% (p=0.002 n=6)   776.0Ki ± 0%  -24.81% (p=0.002 n=6)

               │ v1testgogo.txt │            v2.txt             │       v2testcsproto.txt       │      v2testgogo-opt.txt       │              v2testgogo.txt              │       v2testvtproto.txt       │
               │   allocs/op    │ allocs/op   vs base           │ allocs/op   vs base           │ allocs/op   vs base           │   allocs/op    vs base                   │ allocs/op   vs base           │
PRWSerialize-4       1.000 ± 0%   1.000 ± 0%  ~ (p=1.000 n=6) ¹   1.000 ± 0%  ~ (p=1.000 n=6) ¹   1.000 ± 0%  ~ (p=1.000 n=6) ¹   2001.000 ± 0%  +200000.00% (p=0.002 n=6)   1.000 ± 0%  ~ (p=1.000 n=6) ¹
¹ all samples are equal
>> comparing BenchmarkPRWDeserialize
goos: darwin
goarch: arm64
pkg: github.com/bwplotka/go-proto-bench/prw
                 │ v1testgogo.txt │               v2.txt                │          v2testcsproto.txt          │         v2testgogo-opt.txt         │           v2testgogo.txt           │         v2testvtproto.txt          │
                 │     sec/op     │    sec/op     vs base               │    sec/op     vs base               │   sec/op     vs base               │   sec/op     vs base               │   sec/op     vs base               │
PRWDeserialize-4     1598.3µ ± 1%   1188.2µ ± 1%  -25.66% (p=0.002 n=6)   1196.9µ ± 6%  -25.12% (p=0.002 n=6)   724.0µ ± 1%  -54.70% (p=0.002 n=6)   725.1µ ± 3%  -54.64% (p=0.002 n=6)   538.7µ ± 1%  -66.30% (p=0.002 n=6)

                 │ v1testgogo.txt │               v2.txt                │          v2testcsproto.txt          │         v2testgogo-opt.txt          │           v2testgogo.txt            │          v2testvtproto.txt          │
                 │      B/op      │     B/op      vs base               │     B/op      vs base               │     B/op      vs base               │     B/op      vs base               │     B/op      vs base               │
PRWDeserialize-4     5.422Mi ± 0%   2.833Mi ± 0%  -47.75% (p=0.002 n=6)   2.986Mi ± 0%  -44.93% (p=0.002 n=6)   2.674Mi ± 0%  -50.68% (p=0.002 n=6)   2.674Mi ± 0%  -50.68% (p=0.002 n=6)   1.684Mi ± 0%  -68.94% (p=0.002 n=6)

                 │ v1testgogo.txt │               v2.txt                │          v2testcsproto.txt          │         v2testgogo-opt.txt          │           v2testgogo.txt            │         v2testvtproto.txt          │
                 │   allocs/op    │  allocs/op    vs base               │  allocs/op    vs base               │  allocs/op    vs base               │  allocs/op    vs base               │  allocs/op   vs base               │
PRWDeserialize-4     52.014k ± 0%   34.146k ± 0%  -34.35% (p=0.002 n=6)   36.146k ± 0%  -30.51% (p=0.002 n=6)   22.145k ± 0%  -57.42% (p=0.002 n=6)   22.145k ± 0%  -57.42% (p=0.002 n=6)   8.036k ± 0%  -84.55% (p=0.002 n=6)
>> comparing BenchmarkPRWDeserializeToBase
goos: darwin
goarch: arm64
pkg: github.com/bwplotka/go-proto-bench/prw
                       │ v1testgogo.txt │               v2.txt                │          v2testcsproto.txt          │         v2testgogo-opt.txt          │           v2testgogo.txt           │         v2testvtproto.txt          │
                       │     sec/op     │    sec/op     vs base               │    sec/op     vs base               │    sec/op     vs base               │   sec/op     vs base               │   sec/op     vs base               │
PRWDeserializeToBase-4     1810.8µ ± 1%   1420.9µ ± 1%  -21.54% (p=0.002 n=6)   1438.8µ ± 3%  -20.55% (p=0.002 n=6)   947.4µ ± 13%  -47.68% (p=0.002 n=6)   954.9µ ± 1%  -47.27% (p=0.002 n=6)   781.5µ ± 1%  -56.84% (p=0.002 n=6)

                       │ v1testgogo.txt │               v2.txt                │          v2testcsproto.txt          │         v2testgogo-opt.txt          │           v2testgogo.txt            │          v2testvtproto.txt          │
                       │      B/op      │     B/op      vs base               │     B/op      vs base               │     B/op      vs base               │     B/op      vs base               │     B/op      vs base               │
PRWDeserializeToBase-4     6.157Mi ± 0%   3.568Mi ± 0%  -42.05% (p=0.002 n=6)   3.720Mi ± 0%  -39.57% (p=0.002 n=6)   3.408Mi ± 0%  -44.64% (p=0.002 n=6)   3.408Mi ± 0%  -44.64% (p=0.002 n=6)   2.418Mi ± 0%  -60.72% (p=0.002 n=6)

                       │ v1testgogo.txt │               v2.txt               │         v2testcsproto.txt          │         v2testgogo-opt.txt         │           v2testgogo.txt           │         v2testvtproto.txt          │
                       │   allocs/op    │  allocs/op   vs base               │  allocs/op   vs base               │  allocs/op   vs base               │  allocs/op   vs base               │  allocs/op   vs base               │
PRWDeserializeToBase-4      56.02k ± 0%   38.15k ± 0%  -31.90% (p=0.002 n=6)   40.15k ± 0%  -28.33% (p=0.002 n=6)   26.15k ± 0%  -53.32% (p=0.002 n=6)   26.15k ± 0%  -53.32% (p=0.002 n=6)   12.04k ± 0%  -78.51% (p=0.002 n=6)

NOTE: PRWDeserializeToBase means unmarshal + symbolization
NOTE: PRWSerialize means marshal of already build proto msg (e.g. desymbolized for v2*)

Benchmark: https://github.com/bwplotka/go-proto-bench/blob/main/prw/benchmark_test.go
vtprotobuf code for marshal/unmarshal/symbolize: https://github.com/bwplotka/go-proto-bench/blob/main/prw/v2testvtproto/custom.go#L24

Summary

My goal was to make sure that without gogoproto our new potential protocol with unit32 with []string symbols does not have any efficiency surprises.

I think it does not, so we should be able to proceed with that and see if we could have decision on tomorrows DevSummit 🤗 cc @cstyan @npazosmendez

The decision to keep gogoproto or not in Prometheus should be separate (I won't be using gogoproto for my receiving production code going forward).

@bwplotka
Copy link
Member

However, what's the final proto form with that []string idea?

I would propose something like https://github.com/bwplotka/go-proto-bench/blob/main/proto/prw/v2testvtproto/write.proto, so PRW 2.0, with another package e.g. remotewrite.v2 (!), separate to remote read even. Published to https://buf.build/product/bsr, documented etc. 2.0 since we make clear breaking changes and minor version for proto feels weird 🤔

Do you want to clear out our https://docs.google.com/document/d/1PljkX3YLLT-4f7MqrLt7XCVPG3IsjRREzYrUzBxCPV0/edit proposal @cstyan to discuss the final details? 🤗

@cstyan
Copy link
Member Author

cstyan commented Nov 29, 2023

Had this all typed up last night but forgot hit Comment.

As usual I might have messed something up when hacking this together, and this isn't apples to apples since I'm using the vtproto generated protocode still, but this is comparing building the write request via MarshalVT vs gogoproto.Marshal():

BenchmarkBuildWriteRequest/2_instances-32         	    1000	      7833 ns/op	       655.0 compressedSize/op	    2040 B/op	       8 allocs/op
BenchmarkBuildWriteRequest/10_instances-32        	    1000	     37824 ns/op	      3099 compressedSize/op	    8248 B/op	      10 allocs/op
BenchmarkBuildWriteRequest/1k_instances-32        	    1000	    380232 ns/op	     30485 compressedSize/op	   91269 B/op	      15 allocs/op
BenchmarkBuildWriteRequestOld/2_instances-32      	    1000	     12330 ns/op	       655.0 compressedSize/op	    3224 B/op	      10 allocs/op
BenchmarkBuildWriteRequestOld/10_instances-32     	    1000	     60776 ns/op	      3099 compressedSize/op	   13690 B/op	      12 allocs/op
BenchmarkBuildWriteRequestOld/1k_instances-32     	    1000	    596341 ns/op	     30485 compressedSize/op	  151937 B/op	      18 allocs/op
PASS
ok  	github.com/prometheus/prometheus/storage/remote	1.316s

@cstyan
Copy link
Member Author

cstyan commented Nov 30, 2023

I need to double check things, I might have broken the code Nico originally wrote slightly for using zstd and s2, but here's some compression benchmarks:
2023-11-29_18-42
2023-11-29_18-42_1

In terms of the naming format; just the compression type name is the default performance, speed is that compression algorithms fastest compression options, and comp is it's best compression performance options. Note that all the results here are comparing the new proto format using snappy. None of the above comparisons are relative to the existing v1 format.

We can see that zstd's best compression option gives us an additional 42% network bytes reduction but at the cost of more than doubling the memory usage on the sending side, so that option is out. Flate's speed compression option gives us the least additional network bytes reduction, but is potentially a sender CPU reduction (the panels here are not completely accurate, those will be updated alongside new results in a few days).

The other options all give us roughly 35% additional network bytes reduction. Both zstd and zstd-speed come at the cost of a memory increase on the sending side relative to using snappy as well as a small CPU usage increase. The other flate options seem good.

Here's some quick Go benchmark results, this is the same BenchmarkBuildWriteRequest we've been running for the proto changes itself, but iterating over doing the compression with each different option.

BenchmarkCompression_BuildMinimizedWriteRequest/comp:snappy-66-instances-32         	    1000	         7.220 ns/op	      1503 compressedSize/op	       0 B/op	       0 allocs/op
BenchmarkCompression_BuildMinimizedWriteRequest/comp:snappy-330-instances-32        	    1000	        30.97 ns/op	      5171 compressedSize/op	       0 B/op	       0 allocs/op
BenchmarkCompression_BuildMinimizedWriteRequest/comp:snappy-3300-instances-32       	    1000	       290.5 ns/op	     46874 compressedSize/op	       0 B/op	       0 allocs/op
BenchmarkCompression_BuildMinimizedWriteRequest/comp:zstd_speed-66-instances-32     	    1000	        24.08 ns/op	      1232 compressedSize/op	       0 B/op	       0 allocs/op
BenchmarkCompression_BuildMinimizedWriteRequest/comp:zstd_speed-330-instances-32    	    1000	        69.18 ns/op	      3837 compressedSize/op	       0 B/op	       0 allocs/op
BenchmarkCompression_BuildMinimizedWriteRequest/comp:zstd_speed-3300-instances-32   	    1000	       512.8 ns/op	     30987 compressedSize/op	       0 B/op	       0 allocs/op
BenchmarkCompression_BuildMinimizedWriteRequest/comp:zstd_comp-66-instances-32      	    1000	        75.71 ns/op	      1148 compressedSize/op	       0 B/op	       0 allocs/op
BenchmarkCompression_BuildMinimizedWriteRequest/comp:zstd_comp-330-instances-32     	    1000	       328.4 ns/op	      3475 compressedSize/op	       0 B/op	       0 allocs/op
BenchmarkCompression_BuildMinimizedWriteRequest/comp:zstd_comp-3300-instances-32    	    1000	      4899 ns/op	     30533 compressedSize/op	       0 B/op	       0 allocs/op
BenchmarkCompression_BuildMinimizedWriteRequest/comp:flate_speed-66-instances-32    	    1000	        29.27 ns/op	      1312 compressedSize/op	       0 B/op	       0 allocs/op
BenchmarkCompression_BuildMinimizedWriteRequest/comp:flate_speed-330-instances-32   	    1000	        65.91 ns/op	      4518 compressedSize/op	       0 B/op	       0 allocs/op
BenchmarkCompression_BuildMinimizedWriteRequest/comp:flate_speed-3300-instances-32  	    1000	       515.3 ns/op	     40519 compressedSize/op	       0 B/op	       0 allocs/op
BenchmarkCompression_BuildMinimizedWriteRequest/comp:flate_comp-66-instances-32     	    1000	        78.70 ns/op	      1210 compressedSize/op	       0 B/op	       0 allocs/op
BenchmarkCompression_BuildMinimizedWriteRequest/comp:flate_comp-330-instances-32    	    1000	       393.0 ns/op	      4007 compressedSize/op	       0 B/op	       0 allocs/op
BenchmarkCompression_BuildMinimizedWriteRequest/comp:flate_comp-3300-instances-32   	    1000	     12140 ns/op	     35726 compressedSize/op	       0 B/op	       0 allocs/op
BenchmarkCompression_BuildMinimizedWriteRequest/comp:zstd-66-instances-32           	    1000	        26.22 ns/op	      1168 compressedSize/op	       0 B/op	       0 allocs/op
BenchmarkCompression_BuildMinimizedWriteRequest/comp:zstd-330-instances-32          	    1000	        77.58 ns/op	      3537 compressedSize/op	       0 B/op	       0 allocs/op
BenchmarkCompression_BuildMinimizedWriteRequest/comp:zstd-3300-instances-32         	    1000	      1031 ns/op	     30435 compressedSize/op	       0 B/op	       0 allocs/op
BenchmarkCompression_BuildMinimizedWriteRequest/comp:flate-66-instances-32          	    1000	        35.13 ns/op	      1219 compressedSize/op	       0 B/op	       0 allocs/op
BenchmarkCompression_BuildMinimizedWriteRequest/comp:flate-330-instances-32         	    1000	        85.52 ns/op	      4176 compressedSize/op	       0 B/op	       0 allocs/op
BenchmarkCompression_BuildMinimizedWriteRequest/comp:flate-3300-instances-32        	    1000	       796.5 ns/op	     37987 compressedSize/op	       0 B/op	       0 allocs/op

As to why the benchmark is showing 0 allocs now, my first thought is that the ResetTimer call hides the memory usage here somehow. Unfortunately I don't have more time to dig tonight. Here's the same results with the ResetTimer call removed:

BenchmarkCompression_BuildMinimizedWriteRequest/comp:snappy-66-instances-32         	    1000	     16765 ns/op	      1503 compressedSize/op	      12 B/op	       0 allocs/op
BenchmarkCompression_BuildMinimizedWriteRequest/comp:snappy-330-instances-32        	    1000	     73649 ns/op	      5171 compressedSize/op	       4 B/op	       0 allocs/op
BenchmarkCompression_BuildMinimizedWriteRequest/comp:snappy-3300-instances-32       	    1000	    677708 ns/op	     46874 compressedSize/op	      34 B/op	       0 allocs/op
BenchmarkCompression_BuildMinimizedWriteRequest/comp:zstd_speed-66-instances-32     	    1000	     34159 ns/op	      1232 compressedSize/op	    7056 B/op	       0 allocs/op
BenchmarkCompression_BuildMinimizedWriteRequest/comp:zstd_speed-330-instances-32    	    1000	    121023 ns/op	      3837 compressedSize/op	   39122 B/op	       2 allocs/op
BenchmarkCompression_BuildMinimizedWriteRequest/comp:zstd_speed-3300-instances-32   	    1000	   1049921 ns/op	     30987 compressedSize/op	   10193 B/op	       8 allocs/op
BenchmarkCompression_BuildMinimizedWriteRequest/comp:zstd_comp-66-instances-32      	    1000	    190442 ns/op	      1148 compressedSize/op	 1416340 B/op	       0 allocs/op
BenchmarkCompression_BuildMinimizedWriteRequest/comp:zstd_comp-330-instances-32     	    1000	   1118620 ns/op	      3475 compressedSize/op	 1416516 B/op	       0 allocs/op
BenchmarkCompression_BuildMinimizedWriteRequest/comp:zstd_comp-3300-instances-32    	    1000	   5443924 ns/op	     30533 compressedSize/op	 1423822 B/op	       1 allocs/op
BenchmarkCompression_BuildMinimizedWriteRequest/comp:flate_speed-66-instances-32    	    1000	     38763 ns/op	      1312 compressedSize/op	       4 B/op	       0 allocs/op
BenchmarkCompression_BuildMinimizedWriteRequest/comp:flate_speed-330-instances-32   	    1000	    109702 ns/op	      4518 compressedSize/op	     834 B/op	       0 allocs/op
BenchmarkCompression_BuildMinimizedWriteRequest/comp:flate_speed-3300-instances-32  	    1000	    934826 ns/op	     40519 compressedSize/op	     975 B/op	       0 allocs/op
BenchmarkCompression_BuildMinimizedWriteRequest/comp:flate_comp-66-instances-32     	    1000	     87680 ns/op	      1210 compressedSize/op	       4 B/op	       0 allocs/op
BenchmarkCompression_BuildMinimizedWriteRequest/comp:flate_comp-330-instances-32    	    1000	    436596 ns/op	      4007 compressedSize/op	      13 B/op	       0 allocs/op
BenchmarkCompression_BuildMinimizedWriteRequest/comp:flate_comp-3300-instances-32   	    1000	  12160781 ns/op	     35726 compressedSize/op	      51 B/op	       0 allocs/op
BenchmarkCompression_BuildMinimizedWriteRequest/comp:zstd-66-instances-32           	    1000	     44861 ns/op	      1168 compressedSize/op	   55179 B/op	       1 allocs/op
BenchmarkCompression_BuildMinimizedWriteRequest/comp:zstd-330-instances-32          	    1000	    123016 ns/op	      3537 compressedSize/op	    7136 B/op	       0 allocs/op
BenchmarkCompression_BuildMinimizedWriteRequest/comp:zstd-3300-instances-32         	    1000	   1451450 ns/op	     30435 compressedSize/op	   14630 B/op	       1 allocs/op
BenchmarkCompression_BuildMinimizedWriteRequest/comp:flate-66-instances-32          	    1000	     44479 ns/op	      1219 compressedSize/op	    1083 B/op	       0 allocs/op
BenchmarkCompression_BuildMinimizedWriteRequest/comp:flate-330-instances-32         	    1000	    128034 ns/op	      4176 compressedSize/op	    1096 B/op	       0 allocs/op
BenchmarkCompression_BuildMinimizedWriteRequest/comp:flate-3300-instances-32        	    1000	   1323094 ns/op	     37987 compressedSize/op	      84 B/op	       0 allocs/op

These set of benchmarks make it more obvious how fast snappy is, any of our alternatives here are at least 2x as slow for roughly 20-35% additional compression.

@cstyan
Copy link
Member Author

cstyan commented Dec 12, 2023

Today I've decided to benchmark just snappy and s2 for the following reason; I think we should replace base golang/snappy with s2's snappy compatible packages now. Even without changing the underlying compression type we can get some amount of network bytes reduction and resource usage reduction. We also won't be requiring remote write receivers to include a new dependency, which in some cases might not actually exist in their language.

To that end, we can't use the base s2 library, but we can use it's set of Snappy compatible compressions. The short version, s2 at it's base takes Snappy's format and improves upon it to get even more compression in a similar format. Think of s2 like something inbetween base snappy and another compression library like zstd. Slightly more cpu for slightly more compression ratio.

The snappy compatible s2 encoding is still decode-able by the google/snappy, but is potentially slightly more compressed and requires less resources to decode.

First lets look at base benchmarks; this is a comparison of building the current write request format (not minimized) and then compressing it with each option:

goos: linux
goarch: amd64
pkg: github.com/prometheus/prometheus/storage/remote
cpu: AMD Ryzen 9 5950X 16-Core Processor
BenchmarkBuildWriteRequest/2_instances-snappy-32         	    1000	     17488 ns/op	      1933 compressedSize/op	      80 B/op	       1 allocs/op
BenchmarkBuildWriteRequest/10_instances-snappy-32        	    1000	     82674 ns/op	      8156 compressedSize/op	      80 B/op	       1 allocs/op
BenchmarkBuildWriteRequest/100_instances-snappy-32        	    1000	    853947 ns/op	     78215 compressedSize/op	      80 B/op	       1 allocs/op
BenchmarkBuildWriteRequest/2_instances-s2-snappy-compat-better-32         	    1000	     22388 ns/op	      1936 compressedSize/op	      80 B/op	       1 allocs/op
BenchmarkBuildWriteRequest/10_instances-s2-snappy-compat-better-32        	    1000	    116538 ns/op	      7683 compressedSize/op	      80 B/op	       1 allocs/op
BenchmarkBuildWriteRequest/100_instances-s2-snappy-compat-better-32        	    1000	   1101922 ns/op	     75514 compressedSize/op	      80 B/op	       1 allocs/op
BenchmarkBuildWriteRequest/2_instances-s2-snappy-compat-32                	    1000	     16789 ns/op	      1989 compressedSize/op	      80 B/op	       1 allocs/op
BenchmarkBuildWriteRequest/10_instances-s2-snappy-compat-32               	    1000	     84673 ns/op	      7797 compressedSize/op	      80 B/op	       1 allocs/op
BenchmarkBuildWriteRequest/100_instances-s2-snappy-compat-32               	    1000	    859306 ns/op	     77538 compressedSize/op	      80 B/op	       1 allocs/op
BenchmarkBuildWriteRequest/2_instances-s2-32                              	    1000	     16399 ns/op	      1727 compressedSize/op	      80 B/op	       1 allocs/op
BenchmarkBuildWriteRequest/10_instances-s2-32                             	    1000	     83344 ns/op	      6518 compressedSize/op	      80 B/op	       1 allocs/op
BenchmarkBuildWriteRequest/100_instances-s2-32                             	    1000	    862536 ns/op	     63792 compressedSize/op	      80 B/op	       1 allocs/op

Not much change in time to compress when we use s2's snappy compatible encodings, some decrease in end compressed size if we use s2's own encoding.

Then again for the new version of the proto:

goos: linux
goarch: amd64
pkg: github.com/prometheus/prometheus/storage/remote
cpu: AMD Ryzen 9 5950X 16-Core Processor
BenchmarkBuildMinimizedWriteRequest/2_instances-snappy-32         	    1000	     15719 ns/op	      1503 compressedSize/op	       0 B/op	       0 allocs/op
BenchmarkBuildMinimizedWriteRequest/10_instances-snappy-32        	    1000	     71484 ns/op	      5171 compressedSize/op	       0 B/op	       0 allocs/op
BenchmarkBuildMinimizedWriteRequest/100_instances-snappy-32       	    1000	    721519 ns/op	     46874 compressedSize/op	       0 B/op	       0 allocs/op
BenchmarkBuildMinimizedWriteRequest/2_instances-s2-snappy-compat-better-32         	    1000	     17233 ns/op	      1491 compressedSize/op	       0 B/op	       0 allocs/op
BenchmarkBuildMinimizedWriteRequest/10_instances-s2-snappy-compat-better-32        	    1000	     76338 ns/op	      5102 compressedSize/op	       0 B/op	       0 allocs/op
BenchmarkBuildMinimizedWriteRequest/100_instances-s2-snappy-compat-better-32       	    1000	    782606 ns/op	     45788 compressedSize/op	       0 B/op	       0 allocs/op
BenchmarkBuildMinimizedWriteRequest/2_instances-s2-snappy-compat-32                	    1000	     15552 ns/op	      1526 compressedSize/op	       0 B/op	       0 allocs/op
BenchmarkBuildMinimizedWriteRequest/10_instances-s2-snappy-compat-32               	    1000	     72550 ns/op	      5254 compressedSize/op	       0 B/op	       0 allocs/op
BenchmarkBuildMinimizedWriteRequest/100_instances-s2-snappy-compat-32              	    1000	    720062 ns/op	     48394 compressedSize/op	       0 B/op	       0 allocs/op
BenchmarkBuildMinimizedWriteRequest/2_instances-s2-32                              	    1000	     15446 ns/op	      1526 compressedSize/op	       0 B/op	       0 allocs/op
BenchmarkBuildMinimizedWriteRequest/10_instances-s2-32                             	    1000	     71853 ns/op	      5246 compressedSize/op	       0 B/op	       0 allocs/op
BenchmarkBuildMinimizedWriteRequest/100_instances-s2-32                            	    1000	    719349 ns/op	     48276 compressedSize/op	       0 B/op	       0 allocs/op

Again, not much change, but as we've mentioned before this benchmark isn't entirely representative of all the work we have to do in remote write, so lets look at the benchmark dashboards again.

2023-12-11-205954_2478x640_scrot
2023-12-11-210007_2473x910_scrot
2023-12-11-210017_2478x915_scrot
2023-12-11-210024_2479x463_scrot

In these screenshots:

  • the base referred to in panel names is the v2 proto format with golang/snappy encoding
  • I'll refer to the proto formats as v1 and v2
  • snappy is the base golang/snappy
  • s2 is s2's own format
  • the others are s2's snappy compatible format with various levels of compression; base, better, and best
  • there seems to be a memory leak in my usage of the s2 snappy compatible better version, it's using ~100% more memory than the others on the sending side, I'll look into that tomorrow
  • s2 snappy compat better and s2's own base format get roughly the same additional network bytes reduction but again s2 is incompatible with golang/snappy receivers so we'll ignore that format for now
  • s2 snappy compat base is an additional 15-25% network bytes reduction
  • with the v2 format both snappy and s2 snappy compat use 2-4% less cpu than v1 snappy on the sending side and 30-40% less cpu on the receiver than v1 with snappy,
  • v2 s2 snappy compat uses worst case scenario ~5% more cpu than v2 snappy but still is always using less cpu than v1 snappy on the sending side, recv side is consistently ~60% cpu reduction for both
  • v2 snappy compat also uses worse case ~7% more memory on the sending side than v2 snappy but still is always less than v1 snappy, and on the recv side they both use ~11% less memory than v1 snappy
  • with v2 s2 snappy compat we get ~68% network bytes reduction, with base snappy it's ~55%

@bwplotka
Copy link
Member

bwplotka commented Jun 21, 2024

The spec was released in the experimental state https://prometheus.io/docs/specs/remote_write_spec_2_0/

Rationales on technical details are in the proposal

Official comms & blog post will come, but it's rdy for early adoption. Prometheus support & compliance test will be released soon 🤗
Help wanted on the follow up items in this thread: #13105 (comment)

We will merge the proposal once the spec will become stable. Feel free to comment/give suggestions!

@bboreham
Copy link
Member

bboreham commented Aug 6, 2024

Linking #14220, which I read as "add the sender/receiver implementation of created_timestamp".

@jan--f jan--f removed this from Prometheus 3.0 Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants