-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added binary index header implementation. #1952
Conversation
8bde3c7
to
55f3cfe
Compare
1855315
to
ca2690d
Compare
52b01fd
to
2163a97
Compare
Depends on: #1952 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Enabled it also on all our tests. Depends on: #1952 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Enabled it also on all our tests. Depends on: #1952 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Enabled it also on all our tests. Depends on: #1952 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Enabled it also on all our tests. Depends on: #1952 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Enabled it also on all our tests. Depends on: #1952 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
fb170a5
to
632b7cf
Compare
Enabled it also on all our tests. Depends on: #1952 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
632b7cf
to
c3e3adc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for picking up the work on this! Unfortunately I didn't have enough time back in the day. Obviously I'm not an super expert on the TSDB format but the code seems clean and the edge cases were thought through AFAICT. Took me some time to wrap my head around all of this. Just some minor comments. The cherry on top of the pie would be to add some kind of flag to Thanos Compact
which would migrate all of the existing JSON files into the binary ones and improve the binary reader to try to download them from remote object storage. But I guess that could be added when we will move this from the experimental stage.
pkg/block/indexheader/header.go
Outdated
LookupSymbol(o uint32) (string, error) | ||
LabelValues(name string) []string | ||
|
||
// LabelValues returns all label values for given label name or error if not found. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment here seems a bit wrong: we return an error when an actual error had occurred, not when we haven't found any label values.
type BinaryTOC struct { | ||
// Symbols holds start to the same symbols section as index related to this index header. | ||
Symbols uint64 | ||
// PostingsTable holds start to the the same Postings Offset Table section as index related to this index header. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment should start here with PostingsOffsetTable
.
} | ||
// TODO(bwplotka): This is wrong, probably we have to sort. | ||
if lastKey != nil { | ||
prevRng.End = int64(off + 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a lot of places we add 4
to different offsets. I guess that's because of how big crc32 is. I think the code's readability could be improved if we would move that constant 4
into a variable. crc32.Size
could probably be reused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, but it's not crc, but size of length of posting.
if len(key) != 2 { | ||
return errors.Errorf("unexpected key length for posting table %d", len(key)) | ||
} | ||
// TODO(bwplotka): This is wrong, probably we have to sort. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So should we sort this or not? 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not ;p
func compareIndexToHeader(t *testing.T, indexByteSlice index.ByteSlice, headerReader Reader) { | ||
indexReader, err := index.NewReader(indexByteSlice) | ||
testutil.Ok(t, err) | ||
defer func() { _ = indexReader.Close() }() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we check that this succeeds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, because it's a Prometheus indexReader and we are not testing that one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magnificent job @bwplotka ! The change set LGTM. The only reason why I haven't approved is because I don't feel comfortable enough to deeply review BinaryReader
. I left few comments here and there, mainly questions.
pkg/store/bucket.go
Outdated
return nil, errors.Wrap(d.Err(), "read postings list") | ||
} | ||
// 4 for posting length, then n * 4, foreach each big endian posting. | ||
return b[:4+n*4], nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we double check there are actually 4+n*4
bytes in b
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch!
c := b[p.ptr.Start-start : p.ptr.End-start] | ||
|
||
_, fetchedPostings, err := r.dec.Postings(c) | ||
// index-header can estimate endings, which means we need to resize the endings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't be the index-header reader responsability to return the right offsets, instead of having resizePostings()
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't due to arbitrary padding that can occur - it's fine though, overfetching is usually small.
We can only exactly resize once we know number of entries from fetched posting.
cur uint32 | ||
} | ||
|
||
// TODO(bwplotka): Expose those inside Prometheus. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
return br, nil | ||
} | ||
|
||
level.Warn(logger).Log("msg", "failed to read index-header from disk; recreating", "path", binfn, "err", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a warning? Looks more an Info()
to me, given we don't expect to find the index-header
on disk the first time a block is seen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Definitely! I would say even debug.
|
||
level.Warn(logger).Log("msg", "failed to read index-header from disk; recreating", "path", binfn, "err", err) | ||
|
||
if err := WriteBinary(ctx, bkt, id, binfn); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice tracking an histogram metric with the time it takes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm.. let's start with elapsed time in log line WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree
r.postings[key[0]] = []postingOffset{} | ||
if lastKey != nil { | ||
// Always include last value for each label name. | ||
r.postings[lastKey[0]] = append(r.postings[lastKey[0]], postingOffset{value: lastKey[1], tableOff: lastTableOff}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Despite the logic is correct, may be more clear if we also set lastKey = nil
after this line. Up to you.
} | ||
if i+1 == len(e) { | ||
for i := range tmpRngs { | ||
tmpRngs[i].End = r.indexLastPostingEnd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that when we'll load the postings, if we're reading the first one we're going to download from the bucket the entire postings section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow ❤️ You just found a major bug here! Nice - I fixed it and made tests more strict! Awesome!
This PR adds index-header implementation based on [this design](https://thanos.io/proposals/201912_thanos_binary_index_header.md/) It adds a separate indexheader.Binary* structs and method allowing to build and read index-header in binary format. Size difference: 10k series for my autogenerated data it's 2.1x -rw-r--r-- 1 bwplotka bwplotka 6.1M Jan 10 13:20 index -rw-r--r-- 1 bwplotka bwplotka 23K Jan 10 13:20 index.cache.json -rw-r--r-- 1 bwplotka bwplotka 9.2K Jan 10 13:20 index-header For realistic block 8mln series, also similar gain. -rw-r--r-- 1 bwplotka bwplotka 1.9G Jan 10 13:29 index -rw-r--r-- 1 bwplotka bwplotka 287M Jan 10 13:29 index.cache.json -rw-r--r-- 1 bwplotka bwplotka 122M Jan 10 13:29 index-header NOTE: Size is smaller, but it's not what we are trying to optimize for. Nevertheless PostingOffsets and Symbols takes significant amount of bytes. The only downsides of size is the fact that to create such index-header we have to fetch those two parts ~60MB each from object storage. Idea for improvement if that will become a problem: Cache only 32th of the posting ranges and fetch gaps between on demand on query time (with some cache). Real time latencies for creation and loading (without network traffic): For 10k block it's similar for both (ms/micros), for 8mln we can spot the difference: index-header: * write 134.197732ms * read 415.971774ms index-cache.json: * write 6.712496338s * read 6.112222132s Before comparing I changed names to correlate tests: BenchmarkJSONReader-12-> BenchmarkRead-12 old BenchmarkBinaryReader-12 -> BenchmarkRead-12 new BenchmarkJSONWrite-12 -> BenchmarkWrite-12 old BenchmarkBinaryWrite-12 -> BenchmarkWrite-12 new benchmark old ns/op new ns/op delta BenchmarkRead-12 591780 66613 -88.74% BenchmarkWrite-12 2458454 6532651 +165.72% benchmark old allocs new allocs delta BenchmarkRead-12 2306 629 -72.72% BenchmarkWrite-12 1995 64 -96.79% benchmark old bytes new bytes delta BenchmarkRead-12 150904 32976 -78.15% BenchmarkWrite-12 161501 73412 -54.54% CPU time for smaller index file is interesting. Value is low anyway. Might be something to follow up. benchmark old ns/op new ns/op delta BenchmarkRead-12 7026290474 552913402 -92.13% BenchmarkWrite-12 6480769814 276441977 -95.73% benchmark old allocs new allocs delta BenchmarkRead-12 20100014 5501312 -72.63% BenchmarkWrite-12 18263356 64 -100.00% benchmark old bytes new bytes delta BenchmarkRead-12 1873789526 406021516 -78.33% BenchmarkWrite-12 2385193317 74187 -100.00% Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Enabled it also on all our tests. Depends on: #1952 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
c3e3adc
to
462d607
Compare
Thanks for reviews! All should be addressed.
I would say, remove all index-cache JSON file as we build all on-demand on store Gateway (: @pracucci kudos for finding quite a major bug, I think the Series() micro-benchmark would find this.. if we would have any (: TODO for next PR to iterate on index-header. |
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
return nil, errors.Wrap(err, "write index header") | ||
} | ||
|
||
level.Debug(logger).Log("msg", "built index-header file", "path", binfn, "elapsed", time.Since(start)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given we don't have any metric whenever an index-header
is created, if there's a bug which re-creates the index-header
for a block each time (because the reader detects it as corrupted) it would be very difficult to notice, unless you run Thanos with debug logging. Switching this log to Info
may help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. In this case we need a metric (:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add that in next PR. (:
This feature really helped. Much better than mmapping the really large index json files that were being generated by our large prometheus cluster. Combined with the memcached index caching I can finally lower the memory requirements significantly in our clusters for store gateway. Hope to see more improvements. |
Btw did you run memcached as well? Any feedback on that part?
…On Fri, 28 Feb 2020, 18:00 Thomas Farvour, ***@***.***> wrote:
This feature really helped. Much better than mmapping the really large
index json files that were being generated by our large prometheus cluster.
Combined with the memcached index caching I can finally lower the memory
requirements significantly in our clusters for store gateway.
Hope to see more improvements.
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#1952?email_source=notifications&email_token=ABVA3O5DVHAHRQH5HHK5GH3RFFGMPA5CNFSM4KD55EY2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENJSPRI#issuecomment-592652229>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABVA3O2D3XKE6EB6LNFMEO3RFFGMPANCNFSM4KD55EYQ>
.
|
Just to give some feedback: The memory consumption of our store gateway sank from 28,9 GB to 27,2 GB, which IMHO isn't such a huge improvement. But we still have no idea how the |
This PR adds index-header implementation based on this design.
It adds a separate
indexheader.Binary*
structs and methods allowing to build and read index-header in binary format.Fixes: #1711
Fixes: #1873
Fixes: #1750
Fixes: #1655
Fixes: #1471
Fixes: #1455
Fixes: #942
Fixes: #448
Fixes: #1750
cc @brian-brazil @codesome as you are super familiar with the TSDB format.
cc @pracucci
Stats
Size difference:
10k series for my autogenerated block:
For production block 8mln series, also similar gain.
NOTE: Size is smaller, but it's not what we are trying to optimize for. Nevertheless
PostingOffsets and Symbols takes a significant amount of bytes. The only downsides of size
is the fact that to create such index-header we have to fetch those two parts ~60MB each from object storage.
Idea for another improvement at some point would if it will become a problem: Cache only 32th of the posting ranges and fetch gaps between on demand
on query time (with some cache).
Real-time latencies for creation and loading (without network traffic):
For 10k block it's similar for both (ms/micros), for 8mln we can spot the difference:
index-header:
index-cache.json:
Go Benchmarks:
Before comparing I changed names to correlate tests with
benchcmp
:10k series block:
CPU time for smaller index file is interesting. Value is low anyway. Might be
something to follow up.
8mln series (index takes 2GB so not committed to git):
NOTE: Benchmarks for query time and CHANGELOG will go in next PR.
Signed-off-by: Bartlomiej Plotka bwplotka@gmail.com