-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the performance of VAR_LIST storage layout #3093
Conversation
65ccd6f
to
1c18f92
Compare
42f0763
to
a8f15f7
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #3093 +/- ##
==========================================
+ Coverage 91.89% 91.93% +0.04%
==========================================
Files 1169 1171 +2
Lines 43774 43982 +208
==========================================
+ Hits 40227 40436 +209
+ Misses 3547 3546 -1 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also bump the last digit of the CMake project version.
Haven't finished reviewing all of these changes, I will continue later.
@@ -792,7 +792,8 @@ void Column::commitColumnChunkOutOfPlace(Transaction* transaction, node_group_id | |||
auto chunkMeta = getMetadata(nodeGroupIdx, transaction->getType()); | |||
// TODO(Guodong): Should consider caching the scanned column chunk to avoid redundant | |||
// scans in the same transaction. | |||
auto columnChunk = getEmptyChunkForCommit(chunkMeta.numValues + dstOffsets.size()); | |||
auto columnChunk = | |||
getEmptyChunkForCommit(1.5 * std::bit_ceil(chunkMeta.numValues + dstOffsets.size())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why 1.5*
? Is there performance implications behind this?
startVarListOffset, endVarListOffset); | ||
varListColumnChunk->resetOffset(); | ||
} else { | ||
varListColumnChunk->resizeDataColumnChunk(std::bit_ceil(resizeNumValues)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code path is not well covered by tests. Can you add tests that first perform a bunch of in place updates, then perform reads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me take another look after you've addressed comments.
for (auto i = 0u; i < appendListNum; i++) { | ||
auto appendLen = varListChunk->getListLen(startSrcOffset + i); | ||
for (auto j = 0u; j < appendLen; j++) { | ||
dstOffsetsInDataColumn.push_back(dataColumnSize + appendListDataNum + j); | ||
} | ||
appendListDataNum += appendLen; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just
for (auto i = 0u; i < appendListNum; i++) {
for (auto j = 0u; j < appendLen; j++) {
dstOffsetsInDataColumn.push_back(dataColumnSize++);
}
}
btw, you can reuse this function under rel_table_data.h
inline void fillSequence(std::span<common::offset_t> offsets, common::offset_t startOffset) {
for (auto i = 0u; i < offsets.size(); i++) {
offsets[i] = i + startOffset;
}
}
// in the end of list data column we need to plus to original data column size to get the | ||
// new offset | ||
// TODO(Jiamin): A better way is to store the offset in a offset column, just like size | ||
// column. Then we can reuse prepareCommitForChunk interface for offset column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should apply this optimization. You can separate it in another PR, but we should do it soon.
f9126d7
to
2be5cb4
Compare
2be5cb4
to
6870d28
Compare
This PR is to improve the performance of var #3060.
First, we now support in-place commits for VAR_LIST. We separately commit the offset column, size column, and data column. Previously, we do not support in-place commit for VAR_LIST. Thus, every time we update an item, it will trigger an out-place commit: scanning the whole column, updating the column, and writing the updated column into the persistence storage.
Second, the design of #3060 is good for writing performance. But it is bad for scan performance since it may cause random access to the data column. If the offset is not in ascending order, we need to scan the list one by one. Worse, the data column will be increased with every update/insert. To balance write and read performance, we rewrite the whole var list column in ascending order at some point. Now we rewrite it when we do an out-place commit and the size of the data column is larger than a threshold(Currently, we use capacity/2).
There are still some spaces we can optimize the storage layout of VAR_LIST.
tmpDataColumnChunk
inVarListColumn::scan(Transaction* transaction, node_group_idx_t nodeGroupIdx, kuzu::storage::ColumnChunk* columnChunk, offset_t startOffset, offset_t endOffset)
.I also ran a benchmark to test the performance of the current VAR_LIST storage layout. The dataset only has one person(id int64, age int64[]) table and one knows table(from person to person, length int64) table. The dataset has 10000000 nodes(around 1GB) and 100000000 edges (around 11GB). The results are as follows. The writing performance has improved a lot. The reading performance is similar to the previous implementation except for scanning the whole edge table (maybe caused by the above second bottleneck).