Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Compression] ALP Compression (float/double) #9635

Merged
merged 47 commits into from
Jan 22, 2024
Merged
Show file tree
Hide file tree
Changes from 43 commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
960cffa
[Compression] ALP First Unit Tests (Double, Float and Negative Numbers)
Aug 25, 2023
a9d436d
[Compression] Base ALP for Doubles and Floats
Aug 25, 2023
67e2249
[Compression] ALP as a type of compression
Aug 25, 2023
823f038
[Compression] A first version of ALP
Aug 30, 2023
70e25e8
[ALPRD] First implementation
Oct 21, 2023
c3f20e5
[ALPRD] First implementation
Oct 21, 2023
4e7f04a
[ALPRD] First implementation
Oct 21, 2023
53295f5
[ALP] Fixing sampling bug and DoubleToInt64 function
Oct 21, 2023
6da081e
[ALP] Fixing TODOs and removing code smells
Oct 21, 2023
f9d1a10
[ALP RD] Prettyfying code
Oct 23, 2023
982eb1a
[ALP] Prettyfying code
Oct 23, 2023
ee588ad
[Alp] Prettyfying code and fixing bad smells
Oct 23, 2023
54ba0c8
[Alp] Adding some tests
Oct 23, 2023
824cd40
[ALP] Adding Patas tests to ALP
Oct 24, 2023
980230c
[ALPRD] Adding Patas tests to ALPRD
Oct 24, 2023
942d6e7
[ALP] Skipping CI for now [ci skip]
Oct 24, 2023
2bf0b9d
Merge branch 'duckdb:main' into alp_compression
Oct 24, 2023
2963bd6
[ALP] Fixing clang-format
Oct 24, 2023
19643c6
[ALP] Fixing clang-format
Oct 24, 2023
b352160
[ALP] Fixing undefined references on Release version
Oct 24, 2023
28864dd
[ALP] Fixing math functions
Oct 24, 2023
018ac9e
[ALP] Including cmath
Oct 24, 2023
5ee73e4
[ALP] Including cmath
Oct 24, 2023
3b11610
[ALP] Small format fix
Oct 24, 2023
0d13662
[ALP] Fix bug in ALP and Windows format fix (hopefully)
Nov 6, 2023
6af5f61
Merge branch 'duckdb:main' into alp_compression
Nov 7, 2023
6caf2b1
[ALP] Fix bug in ALP RD size estimation
Nov 7, 2023
43f33a8
Merge branch 'duckdb:main' into alp_compression
Nov 7, 2023
84e1e80
[ALP] Commit to run Actions with tags
Nov 7, 2023
2ab5166
Merge branch 'duckdb:main' into alp_compression
Nov 8, 2023
a090daa
[ALP] Fix a bug that was making compression slower than intended
Nov 8, 2023
cefaf85
Merge branch 'duckdb:main' into alp_compression
Nov 8, 2023
33a0c02
[ALP] Adding ALP Licenses
Nov 9, 2023
ffeed5a
[ALP] Adding Benchmarks for ALP and ALPRD
Nov 10, 2023
0a04e64
[ALP] Refactoring code and implementing most review comments
Nov 24, 2023
a809e63
[ALP] Removing unused code and modularizing nulls replacement function
Nov 24, 2023
b11ea45
[ALP] Making dictionary not always same size to improve compression r…
Nov 28, 2023
86428a7
[ALP] Format fix
Nov 28, 2023
5259cb2
[ALP] Fix bug in early exit mechanism introduced with a refactoring
Nov 28, 2023
6e52821
[ALP] Fix repeated lines bug
Nov 28, 2023
5983a96
[ALP] Removed unused variable
Nov 28, 2023
83b7c68
[ALP] Including cmath
Nov 28, 2023
c265eba
[ALP] Removing FIXME comment
Nov 29, 2023
40fd3f1
[ALP] Implementing a struct for the Encoding Indices instead of a pair
Nov 29, 2023
63d3a5e
[ALP] Compression optimizations & other small changes
Jan 19, 2024
dcadf17
[ALP] Fixing overload to compress without nulls
Jan 19, 2024
5e90a1e
[ALP] Changing comment of constant to be more accurate
Jan 19, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions benchmark/micro/compression/alp/alp_read.benchmark
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# name: benchmark/micro/compression/alp/alp_read.benchmark
# description: Scanning a large amount of doubles
# group: [alp]

name Alp Scan
group alp
storage persistent
require parquet
require httpfs

load
DROP TABLE IF EXISTS integers;
PRAGMA force_compression='alp';
CREATE TABLE temperatures (
temperature DOUBLE
);
INSERT INTO temperatures SELECT temp FROM 'https://github.com/duckdb/duckdb-data/releases/download/v1.0/city_temperature.parquet' t(temp), range(28);
checkpoint;

run
select avg(temperature) from temperatures;

result I
56.028391124637494
16 changes: 16 additions & 0 deletions benchmark/micro/compression/alp/alp_read_best_case.benchmark
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# name: benchmark/micro/compression/alp/alp_read_best_case.benchmark
# description: ALP best case scenario is when it founds low precision decimals within a limited absolute range
# group: [alp]

name Alp Scan
group alp
storage persistent

load
DROP TABLE IF EXISTS alp_random_doubles;
PRAGMA force_compression='alp';
create table alp_random_doubles as select round(random(), 1)::DOUBLE as data from range(200000000) tbl(i);
checkpoint;

run
select avg(data) from alp_random_doubles;
16 changes: 16 additions & 0 deletions benchmark/micro/compression/alp/alp_read_worst_case.benchmark
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# name: benchmark/micro/compression/alp/alp_read_worst_case.benchmark
# description: ALP slowest scenario is when it founds high precision decimals. Here, ALP achieves no compression and everything is encoded as exception
# group: [alp]

name Alp Scan
group alp
storage persistent

load
DROP TABLE IF EXISTS alp_random_doubles;
PRAGMA force_compression='alp';
create table alp_random_doubles as select random()::DOUBLE as data from range(200000000) tbl(i);
checkpoint;

run
select avg(data) from alp_random_doubles;
21 changes: 21 additions & 0 deletions benchmark/micro/compression/alp/alp_store.benchmark
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# name: benchmark/micro/compression/alp/alp_store.benchmark
# description: Scanning a large amount of doubles
# group: [alp]

name Alp Insert
group alp
storage persistent
require_reinit
require parquet
require httpfs

load
PRAGMA force_compression='alp';
DROP TABLE IF EXISTS temperatures;
CREATE TABLE temperatures (
temperature DOUBLE
);

run
INSERT INTO temperatures SELECT temp FROM 'https://github.com/duckdb/duckdb-data/releases/download/v1.0/city_temperature.parquet' t(temp), range(28);
Mytherin marked this conversation as resolved.
Show resolved Hide resolved
checkpoint;
16 changes: 16 additions & 0 deletions benchmark/micro/compression/alp/alp_store_best_case.benchmark
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# name: benchmark/micro/compression/alp/alp_store_best_case.benchmark
# description: ALP best case scenario is when it founds low precision decimals within a limited absolute range.
# group: [alp]

name Alp Insert
group alp
storage persistent
require_reinit

load
PRAGMA force_compression='alp';
DROP TABLE IF EXISTS alp_random_doubles;

run
create table alp_random_doubles as select round(random(), 1)::DOUBLE as data from range(50000000) tbl(i);
checkpoint;
16 changes: 16 additions & 0 deletions benchmark/micro/compression/alp/alp_store_worst_case.benchmark
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# name: benchmark/micro/compression/alp/alp_store_worst_case.benchmark
# description: ALP slowest scenario is when it founds high precision decimals. Here, ALP achieves no compression and everything is encoded as exception
# group: [alp]

name Alp Insert
group alp
storage persistent
require_reinit

load
PRAGMA force_compression='alp';
DROP TABLE IF EXISTS alp_random_doubles;

run
create table alp_random_doubles as select random()::DOUBLE as data from range(50000000) tbl(i);
checkpoint;
24 changes: 24 additions & 0 deletions benchmark/micro/compression/alprd/alprd_read.benchmark
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# name: benchmark/micro/compression/alprd/alprd_read.benchmark
# description: Scanning a large amount of doubles
# group: [alprd]

name Alprd Scan
group alprd
storage persistent
require parquet
require httpfs

load
DROP TABLE IF EXISTS integers;
PRAGMA force_compression='alprd';
CREATE TABLE temperatures (
temperature DOUBLE
);
INSERT INTO temperatures SELECT temp FROM 'https://github.com/duckdb/duckdb-data/releases/download/v1.0/city_temperature.parquet' t(temp), range(28);
checkpoint;

run
select avg(temperature) from temperatures;

result I
56.028391124637494
16 changes: 16 additions & 0 deletions benchmark/micro/compression/alprd/alprd_read_best_case.benchmark
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# name: benchmark/micro/compression/alprd/alprd_read_best_case.benchmark
# description: ALPRD best case scenario is when all the floats share their front bits
# group: [alprd]

name Alprd Scan
group alprd
storage persistent

load
DROP TABLE IF EXISTS alprd_random_doubles;
PRAGMA force_compression='alprd';
create table alprd_random_doubles as select (random() + 10)::DOUBLE as data from range(200000000) tbl(i);
checkpoint;

run
select avg(data) from alprd_random_doubles;
16 changes: 16 additions & 0 deletions benchmark/micro/compression/alprd/alprd_read_worst_case.benchmark
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# name: benchmark/micro/compression/alprd/alprd_read_worst_case.benchmark
# description: ALPRD worst case scenario is when all the float have unique front bits. Multiplying by different powers of two ensures us unique front bits
# group: [alprd]

name Alprd Scan
group alprd
storage persistent

load
DROP TABLE IF EXISTS alprd_random_doubles;
PRAGMA force_compression='alprd';
create table alprd_random_doubles as select (random() * pow(2, (i % 1000)) * (CASE WHEN i%2=0 THEN 1 ELSE -1 END))::DOUBLE as data from range(200000000) tbl(i);
checkpoint;

run
select avg(data) from alprd_random_doubles;
21 changes: 21 additions & 0 deletions benchmark/micro/compression/alprd/alprd_store.benchmark
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# name: benchmark/micro/compression/alprd/alprd_store.benchmark
# description: Scanning a large amount of doubles
# group: [alprd]

name Alprd Insert
group alprd
storage persistent
require_reinit
require parquet
require httpfs

load
PRAGMA force_compression='alprd';
DROP TABLE IF EXISTS temperatures;
CREATE TABLE temperatures (
temperature DOUBLE
);

run
INSERT INTO temperatures SELECT temp FROM 'https://github.com/duckdb/duckdb-data/releases/download/v1.0/city_temperature.parquet' t(temp), range(28);
checkpoint;
17 changes: 17 additions & 0 deletions benchmark/micro/compression/alprd/alprd_store_best_case.benchmark
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# name: benchmark/micro/compression/alprd/alprd_store_best_case.benchmark
# description: ALPRD best case scenario is when all the floats share their front bits.
# group: [alprd]

name Alprd Insert
group alprd
storage persistent
require_reinit

load
DROP TABLE IF EXISTS alprd_random_doubles;
PRAGMA force_compression='alprd';
checkpoint;

run
create table alprd_random_doubles as select (random() + 10)::DOUBLE as data from range(50000000) tbl(i);
checkpoint;
17 changes: 17 additions & 0 deletions benchmark/micro/compression/alprd/alprd_store_worst_case.benchmark
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# name: benchmark/micro/compression/alprd/alprd_store_worst_case.benchmark
# description: ALPRD worst case scenario is when all the float have unique front bits. Multiplying by different powers of two ensures us unique front bits
# group: [alprd]

name Alprd Insert
group alprd
storage persistent
require_reinit

load
DROP TABLE IF EXISTS alprd_random_doubles;
PRAGMA force_compression='alprd';
checkpoint;

run
create table alprd_random_doubles as select (random() * pow(2, (i % 1000)) * (CASE WHEN i%2=0 THEN 1 ELSE -1 END))::DOUBLE as data from range(50000000) tbl(i);
checkpoint;
10 changes: 10 additions & 0 deletions src/common/enum_util.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1033,6 +1033,10 @@ const char* EnumUtil::ToChars<CompressionType>(CompressionType value) {
return "COMPRESSION_CHIMP";
case CompressionType::COMPRESSION_PATAS:
return "COMPRESSION_PATAS";
case CompressionType::COMPRESSION_ALP:
return "COMPRESSION_ALP";
case CompressionType::COMPRESSION_ALPRD:
return "COMPRESSION_ALPRD";
case CompressionType::COMPRESSION_COUNT:
return "COMPRESSION_COUNT";
default:
Expand Down Expand Up @@ -1072,6 +1076,12 @@ CompressionType EnumUtil::FromString<CompressionType>(const char *value) {
if (StringUtil::Equals(value, "COMPRESSION_PATAS")) {
return CompressionType::COMPRESSION_PATAS;
}
if (StringUtil::Equals(value, "COMPRESSION_ALP")) {
return CompressionType::COMPRESSION_ALP;
}
if (StringUtil::Equals(value, "COMPRESSION_ALPRD")) {
return CompressionType::COMPRESSION_ALPRD;
}
if (StringUtil::Equals(value, "COMPRESSION_COUNT")) {
return CompressionType::COMPRESSION_COUNT;
}
Expand Down
8 changes: 8 additions & 0 deletions src/common/enums/compression_type.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,10 @@ CompressionType CompressionTypeFromString(const string &str) {
return CompressionType::COMPRESSION_CHIMP;
} else if (compression == "patas") {
return CompressionType::COMPRESSION_PATAS;
} else if (compression == "alp") {
return CompressionType::COMPRESSION_ALP;
} else if (compression == "alprd") {
return CompressionType::COMPRESSION_ALPRD;
} else {
return CompressionType::COMPRESSION_AUTO;
}
Expand Down Expand Up @@ -61,6 +65,10 @@ string CompressionTypeToString(CompressionType type) {
return "Chimp";
case CompressionType::COMPRESSION_PATAS:
return "Patas";
case CompressionType::COMPRESSION_ALP:
return "ALP";
case CompressionType::COMPRESSION_ALPRD:
return "ALPRD";
default:
throw InternalException("Unrecognized compression type!");
}
Expand Down
4 changes: 4 additions & 0 deletions src/function/compression_config.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,8 @@ static DefaultCompressionMethod internal_compression_methods[] = {
DictionaryCompressionFun::TypeIsSupported},
{CompressionType::COMPRESSION_CHIMP, ChimpCompressionFun::GetFunction, ChimpCompressionFun::TypeIsSupported},
{CompressionType::COMPRESSION_PATAS, PatasCompressionFun::GetFunction, PatasCompressionFun::TypeIsSupported},
{CompressionType::COMPRESSION_ALP, AlpCompressionFun::GetFunction, AlpCompressionFun::TypeIsSupported},
{CompressionType::COMPRESSION_ALPRD, AlpRDCompressionFun::GetFunction, AlpRDCompressionFun::TypeIsSupported},
{CompressionType::COMPRESSION_FSST, FSSTFun::GetFunction, FSSTFun::TypeIsSupported},
{CompressionType::COMPRESSION_AUTO, nullptr, nullptr}};

Expand Down Expand Up @@ -76,6 +78,8 @@ vector<reference<CompressionFunction>> DBConfig::GetCompressionFunctions(Physica
TryLoadCompression(*this, result, CompressionType::COMPRESSION_DICTIONARY, data_type);
TryLoadCompression(*this, result, CompressionType::COMPRESSION_CHIMP, data_type);
TryLoadCompression(*this, result, CompressionType::COMPRESSION_PATAS, data_type);
TryLoadCompression(*this, result, CompressionType::COMPRESSION_ALP, data_type);
TryLoadCompression(*this, result, CompressionType::COMPRESSION_ALPRD, data_type);
TryLoadCompression(*this, result, CompressionType::COMPRESSION_FSST, data_type);
return result;
}
Expand Down
2 changes: 2 additions & 0 deletions src/include/duckdb/common/enums/compression_type.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@ enum class CompressionType : uint8_t {
COMPRESSION_FSST = 7,
COMPRESSION_CHIMP = 8,
COMPRESSION_PATAS = 9,
COMPRESSION_ALP = 10,
COMPRESSION_ALPRD = 11,
COMPRESSION_COUNT // This has to stay the last entry of the type!
};

Expand Down
10 changes: 10 additions & 0 deletions src/include/duckdb/function/compression/compression.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,16 @@ struct PatasCompressionFun {
static bool TypeIsSupported(PhysicalType type);
};

struct AlpCompressionFun {
static CompressionFunction GetFunction(PhysicalType type);
static bool TypeIsSupported(PhysicalType type);
};

struct AlpRDCompressionFun {
static CompressionFunction GetFunction(PhysicalType type);
static bool TypeIsSupported(PhysicalType type);
};

struct FSSTFun {
static CompressionFunction GetFunction(PhysicalType type);
static bool TypeIsSupported(PhysicalType type);
Expand Down
21 changes: 21 additions & 0 deletions src/include/duckdb/storage/compression/alp/algorithm/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2023 CWI, Azim Afroozeh, Leonardo Xavier Kuffo Rivero

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Loading