Improve the performance of VAR_LIST storage layout #3093

hououou · 2024-03-20T02:51:25Z

This PR is to improve the performance of var #3060.

First, we now support in-place commits for VAR_LIST. We separately commit the offset column, size column, and data column. Previously, we do not support in-place commit for VAR_LIST. Thus, every time we update an item, it will trigger an out-place commit: scanning the whole column, updating the column, and writing the updated column into the persistence storage.

Second, the design of #3060 is good for writing performance. But it is bad for scan performance since it may cause random access to the data column. If the offset is not in ascending order, we need to scan the list one by one. Worse, the data column will be increased with every update/insert. To balance write and read performance, we rewrite the whole var list column in ascending order at some point. Now we rewrite it when we do an out-place commit and the size of the data column is larger than a threshold(Currently, we use capacity/2).

There are still some spaces we can optimize the storage layout of VAR_LIST.

We now can only do out-place commits for the offset column because of some interfaces. A better way is to store the offset in another offset column but not in the buffer of the VAR_LIST column itself, just like the size column.
We should support scanning the data in non-empty column chunk. Then we can move tmpDataColumnChunk in VarListColumn::scan(Transaction* transaction, node_group_idx_t nodeGroupIdx, kuzu::storage::ColumnChunk* columnChunk, offset_t startOffset, offset_t endOffset) .
If a list is not in ascending order, we now scan the list items one by one. A better way is to scan in groups, trying to scan out those small parts of contiguous data together as much as possible.

I also ran a benchmark to test the performance of the current VAR_LIST storage layout. The dataset only has one person(id int64, age int64[]) table and one knows table(from person to person, length int64) table. The dataset has 10000000 nodes(around 1GB) and 100000000 edges (around 11GB). The results are as follows. The writing performance has improved a lot. The reading performance is similar to the previous implementation except for scanning the whole edge table (maybe caused by the above second bottleneck).

codecov · 2024-03-20T19:23:55Z

Codecov Report

Attention: Patch coverage is 93.42561% with 19 lines in your changes are missing coverage. Please review.

Project coverage is 91.93%. Comparing base (f9bc0c6) to head (6870d28).
Report is 3 commits behind head on master.

Files	Patch %	Lines
src/storage/store/var_list_column.cpp	90.06%	15 Missing ⚠️
src/storage/store/var_list_column_chunk.cpp	96.66%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3093      +/-   ##
==========================================
+ Coverage   91.89%   91.93%   +0.04%     
==========================================
  Files        1169     1171       +2     
  Lines       43774    43982     +208     
==========================================
+ Hits        40227    40436     +209     
+ Misses       3547     3546       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ray6080

Let's also bump the last digit of the CMake project version.

Haven't finished reviewing all of these changes, I will continue later.

ray6080 · 2024-03-21T12:27:37Z

src/storage/store/column.cpp

@@ -792,7 +792,8 @@ void Column::commitColumnChunkOutOfPlace(Transaction* transaction, node_group_id
        auto chunkMeta = getMetadata(nodeGroupIdx, transaction->getType());
        // TODO(Guodong): Should consider caching the scanned column chunk to avoid redundant
        // scans in the same transaction.
-        auto columnChunk = getEmptyChunkForCommit(chunkMeta.numValues + dstOffsets.size());
+        auto columnChunk =
+            getEmptyChunkForCommit(1.5 * std::bit_ceil(chunkMeta.numValues + dstOffsets.size()));


why 1.5*? Is there performance implications behind this?