-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[meta] Remote write 2.0 #13105
Comments
On the question of: do we need string interning for Remote Write 1.1? Can't we reduce the payload sizes by just moving from Snappy to other compression algorithms? E.g. The experimentComparison of Remote Write v1 vs the WIP Remote Write v1.1 (string interning), using the e2e bench scripts from #13102, which spawn multiple sender->receiver pairs with different versions. Clarifications:
The resultsThe main takeway is: although just changing the current protocol from Snappy to (for instance) zstd already reduces the payload sizes significantly, string interning reduces the receiver's CPU significantly in addition to network usage. This is because the uncompressed payload is smaller and easier to decode, which makes the unmarshaling much faster and memory efficient. On top of string interning, changing the compression algorithm can be considered, to get even better network usage and mantaining the same CPU gains on the receiver (potentially, based on these results, this needs more digging). Extra notes
|
Incoming wall of text, skip to the end if you just want the conclusion instead of the steps to get there. An update for everyone. We've essentially thrown out the map->proto based implementation because of resource usage concerns, especially on the receiving side. First minimized formatTo that end, I've got some more benchmark results. The first is a comparison of the initial map based approach against a 'minimized' format, which packs symbol offset/len into uint32. This is based on an experiment by @pracucci. As you can see, the wire bytes and receiver CPU usage were reduced even further. What was concerning was the increased CPU usage. The minimized format is as follows:
In the write request the symbols are now replaced with the following instead of a map.
Later iterations replace the Minimized format optimizations@npazosmendez found documentation that when With that change we saw sane CPU usage results However, as I mentioned this new format worked by packing both the offset and length of each symbol into a single uint32. The offset was packed into 12 bits and the length into 20 bits. This has obvious limitations. As remote write has no control or say over the # of labels a single metric can have, nor the length of the symbols for the label set, we shouldn't rely on the limitations that bit shifting/packing into smaller data sizes. However, it would be nice if we could still get the increased wire bytes reduction that comes with such a change. Additional minimized format experimentsNico tried using two The benchmark results are as follows I was curious about the sender's side CPU usage and wire bytes differences for other possible changes to the format. To that end, yesterday and today I've written some additional minimized formats to experiment with. defintionsFrom here on I'll refer to the formats as follows:
various min format benchmarksNote that I've modified the panel for the CPU usage diff. to be a range query timeseries panel. Comparing the CPU usage is a bit tricky, there's another comparison further down. For this set of experiments I also increased the batch size from the default to 5k since this is more indicative of a large production workload. For the CPU usage, IMO the most interesting comparison would be to see what % of the time range was the CPU usage for one of the new formats less than the CPU usage of the current base format: Roughly speaking, this makes it look like:
If we look at just the heap memory usage:
ConclusionBased on the data in the previous section, I think both the formats min-32bit-optimized and min-32-fixed are our best options. The decision would come down to whether we want to use custom proto marshaling code. The modification from the generated code is fairly minimal, but IMO we would probably want to somehow add a test that would fail if (when we eventually replace gogoproto, for example) if the custom marshaling was no longer more efficient than the generated version. |
Great summary. To add on the marshaling code optimizations:
EDIT: I tried what I said above in the second item, using a plain As I expected, the generated unmarshaling code does not allocate that extra memory the previous version did, so there's no need to tweak the code manually. Some quick benchmarks I ran show that the performance is the same as what @cstyan calls Between these two options, I think it's about whether we prefer a weird message definition vs a weird marshaling code. I personally lean towards the latter; a more efficient code generator may come in the future. |
Thanks a lot! Epic work everyone! To me the benefits of Before the final decision, I would love to try reproducing the results, especially without genproto (which is deprecated). I might have time for that next week 🤞🏽 BTW I asked some protobuf experts internally at Google, so far ppl are surprised this yields better compression ratio that just compression. Waiting for more opinions, but that seems to be the reality (plus better decoding CPU usage). Other considerations
PS: Maybe silly question, I know we tried map, but did we tried array of strings (given unit32 ref) as a table? |
Note that no interning + a more-compressing algorithm like zstd yields better compression than interning + snappy. However, interning + zstd can be even better. And there's also the CPU and memory benefits.
I agree with this.
Almost. We've been experimenting with another format that @pracucci suggested, which makes the symbol table the following |
Yup, I expect us to really use interning + zstd at the end and recommend that.
Ack, |
My fault, conversation in lots of different places. The latest comparison results are here: npazosmendez#3 (comment) As @npazosmendez mentioned the formats we've most recently experimented with a few more changes to the format that produce even better results. Notably, in one we're changing the symbol table to be a byte slice instead of repeated uint32 or fixed32. In the other the symbols table is still repeated fixed32 values, however those values are just the offset, and the symbols string is now a byte slice of
I think the issue was that with just changing the compression we use more CPU. We can get comparable wire bytes reduction results but the CPU cost is too high. The likely cause of the results we're seeing with the new format + zstd is that we have to compress was less data overall than without the interning change.
We talked about this on slack but I'll post my thoughts here in case anyone else is curious. In general for me it boils down to this, I want to get these improvements (along with the metadata change and created timestamp) out into a Prometheus release asap behind a feature flag. I don't want this work to have to wait for a prom v3.0 to go out. I know I probably should have a stronger opinion, and with the amount of changes it's more accurately a 2.0 than a 1.1, but I don't. As long as it can go into an official release behind a feature flag as soon as it's ready, and not have to wait for a prom 3.0, I'm fine with calling it rw 1.1 or 2.0. TL; DR: we will be changing the format to include string interning and may still potentially allow for alternative compression to be used. |
Thanks! Just as a documentation piece: There is an elephant in the room to mention in the proposal and test one day--the Arrow format. Quite a good work from Otel trying this out, but also a bit brave to make the protocol stateful, tricky to scale on receiving end, so it's epic we have stateless one here. I never played with Arrow but it feels like fully another protocol. Worth to compare at some point, while we would integrate Prometheus to export data with Intel, but it's looks extremely complex (just metrics mapping is mind-blowing (unless it's Otel specific complexity?)). Way more involved than our symbol table ideas. It's hard to argue that Arrow would be good immediate "quick to adopt" next version of remote write given epic efficiency wins we got so far with 1.1 or 2.0 remote write work, at least that's my understanding. Does it make sense? Happy with other opinions here (: |
FYI, I am playing with non gogoproto and array ideas. In the meanwhile to keep 1.1 version and make an incremental proto change instead we could make 1.1 looks similar to this: message WriteRequest {
repeated TimeSeries timeseries = 1 [(gogoproto.nullable) = false];
reserved 2;
reserved 3;
// The symbols table. All symbols are concatenated strings. To read the symbols table, it's required
// to know the offset:length range of the actual symbol to read from this string.
string symbols = 4;
}
// TimeSeries represents samples and labels for a single time series.
message TimeSeries {
// For a timeseries to be valid, and for the samples and exemplars
// to be ingested by the remote system properly, the labels or label symbols field is required.
// Deprecated in 1.1, use interned label symbols instead.
repeated Label labels = 1 [(gogoproto.nullable) = false];
// Sorted list of label name-value pair references. This list's len is always multiple of 4,
// packing tuples of (label name offset, label name length, label value offset, label value length).
// Offsets point to the WriteRequest.symbols.
repeated uint32 label_symbols = 10;
// Sorted by time, oldest sample first.
repeated Sample samples = 2 [(gogoproto.nullable) = false];
repeated Exemplar exemplars = 3 [(gogoproto.nullable) = false];
repeated Histogram histograms = 4 [(gogoproto.nullable) = false];
Metadata metadata = 5;
// Optional created timestamp for the metric in ms format,
// if the first sample in samples does not contain 0 value.
// See model/timestamp/timestamp.go for conversion from time.Time
// to Prometheus timestamp.
int64 created_timestamp = 6;
} It would allow backward and forward compatibility and nice migration process (most of types will be reused). |
Nice @bwplotka. But I'm thinking: protobuf (de)serialization backward/forward compatibility would not give us protocol backward/forward compatibility. For a request itself to be backw/forw compatible, a sender would need to fill up both On the other hand, the benefit I do see with this is a receiver can accept both 1.0 and 1.1 more easily without duplicating too many code paths. LMK what you think, I could also be missing something that you thought about. Separate note: this is tangential to protocol negotiation. We did discuss making 1.1 receivers accept 1.0 to make migrations easier, though not necessarily through a compatible protobuf, but through HTTP headers or other means of negotiation. We intend to put together a design doc to discuss these options. |
Yup, I think we are on the same page--essentially for full compatibility we would need to duplicate data (which defies the point of interning) and when, let's say after some content negotiations, receivers might get errors on reading empty labels (e.g. when they forget to fix some parts of the code). I agree, we need content negotiation or at least some configuration. 👍🏽 Didn't think about it for a long time, but I think I still prefered reused messages so far:
|
This is the most convincing point for me. Of course, we'd need to make sure sending a 1.1 request to an old 1.0 receiver works as expected: fail with a 4xx.
I wasn't thinking about the error message for the user. My reasoning was: why would we want to successfully deserialize a payload that we know is invalid? I would rather fail fast instead of relying on post-deserializing validation, which can be bug prone. Now that I'm more convinced about your proposal, the answer to that (IMHO) is "1": reusing the struct will make things easier and cleaner in a number of other ways, and that benefit may be greater. |
Very initial benchmark (MARSHAL only) (I added/cleaned some code and reused new proposed proto in https://github.com/prometheus/prometheus/tree/bwplotka/prw-initial based on your amazing work https://github.com/npazosmendez/prometheus/pull/3/files). I wouldn't be too excited here as I am just learning how you optimized certain things and implemented all so far (still learning), but reusing your benchmark quickly:
It feels the
Bench script FYI
It's motivation for trying []string on scale and receive side if anything else 🙈 |
And I felt the pain of custom marshal optimization code... adding fields to proto alone is extremely prone to errors 😱 But tests could quickly catch it! ;p I did the benchmark and "oh it's so fast"... realizing I forgot to marshal new repeated symbols... ;p |
Yea. You can't do much without labels these days (e.g. lack of metric name), but maybe some receives might ingest that and corrupt DB? 🤔 Sounds indeed quite bad. The alternative to me is call it 2.0 and create a different "package" version truly:
|
This would be great, the proto message would be much cleaner. I would try benchmarking this with the e2e-bench scripts for a better comparison: https://github.com/prometheus/prometheus/tree/remote-write-1.1/scripts/remotewrite11-bench . That would require the receiver end too, of course. If it looks better than the current minimized version (or even the same), I would definitely go with this one. Regarding the manual (un)marshaling optimizations:
|
Yea, there is also csproto on top of vtprotobuf, plus changing out of gogo can change things a lot. I plan to setup Prometheus + receiver code indeed and benchmark more. The soft deadline for this work could be next DevSummit (next week Thur) 🙈 No idea if viable. Otherwise decision will be in January (: |
Another concern that I have seems to not be mentioned everywhere. What will happen to the v1.0 version? Will implementors be able to choose whichever version they prefer or only the newest one? Compression is nice but it might not be feasible to implement on certain systems like microcontrollers. With very limited RAM it might be hard to build a symbol table or anything like that. Are you planning to implement an escape hatch for such small systems? |
Yes, there will be always option to negotiate/configure what protocol to use. However for your microcontroller case, new protocol will be only better. You can't partially decode proto anyway, and symbols table makes the whole message significantly smaller, even without compression. This means less to allocate for deserialized msg. |
is the current set of results here now accurate? I tried vtproto and wasn't able to get it to perform as well as our current gogoproto format. There's likely still something I'm missing, but because we no longer have
Personally I don't really like the idea of maintaining this combined format. The code for buffering samples and building write requests is already pretty messy with the amount of "optional" fields we have, which is one thing we're reducing a bit in v1.1/v2. I also feel like the communities desire for fully backwards compatible messages (ie sending all the data in all the fields) is going to be pretty low. People want less resource usage, not more.
I think the answer here should be defined within the spec., and IMO failing with a consistent error message "remote write proto format unrecognized/not correct". Something like that. I agree with Nicos point that we should optimize the experience for the sender; they should receive an error back immediately if the receiver gets a RW proto format it doesn't support, otherwise return an error about invalid data if the unmarshalling failed.
The new format should end up being the same or better in all cases. IIRC, in our benchmarks for the senders side we use more or less the same amount of CPU but less memory and less bytes over the wire. On the receivers side resource usage is down across the board. |
here's the benchmark results; vtproto
gogoproto
|
Did some early benchmark without any extra optimizations, just all plugins in fastest possible mode. Using Full results (sorry for not sorted diff):
Benchmark: https://github.com/bwplotka/go-proto-bench/blob/main/prw/benchmark_test.go SummaryMy goal was to make sure that without gogoproto our new potential protocol with I think it does not, so we should be able to proceed with that and see if we could have decision on tomorrows DevSummit 🤗 cc @cstyan @npazosmendez The decision to keep |
However, what's the final proto form with that []string idea? I would propose something like https://github.com/bwplotka/go-proto-bench/blob/main/proto/prw/v2testvtproto/write.proto, so PRW 2.0, with another package e.g. Do you want to clear out our https://docs.google.com/document/d/1PljkX3YLLT-4f7MqrLt7XCVPG3IsjRREzYrUzBxCPV0/edit proposal @cstyan to discuss the final details? 🤗 |
Had this all typed up last night but forgot hit As usual I might have messed something up when hacking this together, and this isn't apples to apples since I'm using the vtproto generated protocode still, but this is comparing building the write request via MarshalVT vs gogoproto.Marshal():
|
Today I've decided to benchmark just snappy and s2 for the following reason; I think we should replace base golang/snappy with s2's snappy compatible packages now. Even without changing the underlying compression type we can get some amount of network bytes reduction and resource usage reduction. We also won't be requiring remote write receivers to include a new dependency, which in some cases might not actually exist in their language. To that end, we can't use the base s2 library, but we can use it's set of Snappy compatible compressions. The short version, s2 at it's base takes Snappy's format and improves upon it to get even more compression in a similar format. Think of s2 like something inbetween base snappy and another compression library like zstd. Slightly more cpu for slightly more compression ratio. The snappy compatible s2 encoding is still decode-able by the google/snappy, but is potentially slightly more compressed and requires less resources to decode. First lets look at base benchmarks; this is a comparison of building the current write request format (not minimized) and then compressing it with each option:
Not much change in time to compress when we use s2's snappy compatible encodings, some decrease in end compressed size if we use s2's own encoding. Then again for the new version of the proto:
Again, not much change, but as we've mentioned before this benchmark isn't entirely representative of all the work we have to do in remote write, so lets look at the benchmark dashboards again. In these screenshots:
|
The spec was released in the experimental state https://prometheus.io/docs/specs/remote_write_spec_2_0/ Rationales on technical details are in the proposal Official comms & blog post will come, but it's rdy for early adoption. Prometheus support & compliance test will be released soon 🤗 We will merge the proposal once the spec will become stable. Feel free to comment/give suggestions! |
Linking #14220, which I read as "add the sender/receiver implementation of created_timestamp". |
This issue serves as a public place to track and discuss the progress of remote write
v1.12.0, which is described in this spec document, and was first worked on here.Still in progress via #13052 and a number of other PRs.
UPDATE Dec 12, 2023: we're changing this to be 2.0, we don't intend for the protobuf format in the new version to be backwards compatible
Issues (as up-to-date as possible)
In-progress
TODO
TestSampleDelivery
to check for metadata in 2.0 proto #14414Complete
The text was updated successfully, but these errors were encountered: