Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet Encryption #9392

Merged
merged 40 commits into from
Nov 8, 2023
Merged

Parquet Encryption #9392

merged 40 commits into from
Nov 8, 2023

Conversation

lnkuiper
Copy link
Contributor

This PR implements encryption for Parquet files as an experimental feature. We mostly follow the Parquet Modular Encryption specification, save for some details. For now, this means that our Parquet encryption is not compatible with that of, e.g., PyArrow, until the missing details are implemented.

Named encryption keys of 128, 192, or 256 bits can be added to a session, which are stored in-memory:

PRAGMA add_parquet_key('key128', '0123456789112345');
PRAGMA add_parquet_key('key192', '012345678911234501234567');
PRAGMA add_parquet_key('key256', '01234567891123450123456789112345');

Files are then encrypted like so:

COPY tbl TO 'tbl.parquet' (ENCRYPTION_CONFIG {footer_key: 'key256'});

For now, we encrypt the footer and all columns using the footer_key. The Parquet specification allows encryption of individual columns with different keys, e.g.:

COPY tbl TO 'tbl.parquet' (ENCRYPTION_CONFIG {footer_key: 'key256',
                                              column_keys: {key256: ['col0', 'col1']}});

This will cause an error to be thrown (for now!):

Not implemented Error: Parquet encryption_config column_keys not yet implemented

The encrypted file can then be read like so:

COPY tbl FROM 'tbl.parquet' (ENCRYPTION_CONFIG {footer_key: 'key256'});
SELECT * FROM read_parquet('tbl.parquet', encryption_config={footer_key: 'key256'});

Encryption has some performance implications, of course. Without encryption, reading/writing the lineitem table from TPC-H at SF1, which is 6M rows and 15 columns, from/to a Parquet file takes 0.26s and 0.99s, respectively. With encryption, this takes 0.64s and 2.21s.

@github-actions github-actions bot marked this pull request as draft October 20, 2023 08:33
@lnkuiper lnkuiper marked this pull request as ready for review October 20, 2023 08:33
@Mytherin Mytherin added the Needs Documentation Use for issues or PRs that require changes in the documentation label Oct 24, 2023
@carlopi
Copy link
Contributor

carlopi commented Oct 25, 2023

I did check locally, and compiles with duckdb-wasm (was unsure about dependencies, and should set a proper CI job to do this). Yey!
Minor data point: DuckDB binary size difference (while bundling parquet, as per the default configuration) is 2%.

Copy link
Contributor

@Tmonster Tmonster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't look at the embedded libraries. Will also most likely look at this again after reading the Parquet Modular Encryption

extension/parquet/column_reader.cpp Outdated Show resolved Hide resolved
extension/parquet/include/parquet_crypto.hpp Show resolved Hide resolved
}

if (transport_remaining != 0) {
throw InvalidInputException("Encoded ciphertext length differs from actual ciphertext length");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this error mean exactly? What is the difference between an encoded cipher text and the actual cipher text? I'm guessing something to do with encoding the text in base64 for transport?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The length of the ciphertext (in bytes) is encoded right before the actual ciphertext. If, for some reason, the actual length differs from the encoded length, something is definitely wrong, and we have to throw an error.

extension/parquet/parquet_reader.cpp Outdated Show resolved Hide resolved
@@ -395,7 +422,10 @@ void ParquetReader::InitializeSchema() {
auto file_meta_data = GetFileMetadata();

if (file_meta_data->__isset.encryption_algorithm) {
throw FormatException("Encrypted Parquet files are not supported");
if (file_meta_data->encryption_algorithm.__isset.AES_GCM_CTR_V1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 100% sure how parquet reading works, so this question might be totally wrong. It looks like you do this check for the footer as well. Is it possible to encrypt a parquet file footer with one encryption standard but the rest of the file with another?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funky stuff like that is possible, but we only support the basics. If another algorithm is used, we throw an error here. If other parts are encrypted with a different algo, decryption will fail, and we will definitely throw an error elsewhere.

@github-actions github-actions bot marked this pull request as draft November 7, 2023 15:55
@lnkuiper lnkuiper marked this pull request as ready for review November 7, 2023 15:55
@Mytherin Mytherin merged commit 4915dd7 into duckdb:feature Nov 8, 2023
45 of 46 checks passed
@Mytherin
Copy link
Collaborator

Mytherin commented Nov 8, 2023

Thanks!

@lnkuiper lnkuiper deleted the parquet_encryption branch November 24, 2023 13:37
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request Dec 11, 2023
Merge pull request duckdb/duckdb#9392 from lnkuiper/parquet_encryption
Merge pull request duckdb/duckdb#9461 from hawkfish/merge-sort-trees
Merge pull request duckdb/duckdb#8788 from alnkesq/capi_create_enum_type
Merge pull request duckdb/duckdb#9513 from Tmonster/5614-database-invalidated
Merge pull request duckdb/duckdb#9622 from Mytherin/typescoping
Merge pull request duckdb/duckdb#9615 from hawkfish/strptime-infinity
Merge pull request duckdb/duckdb#9627 from Mytherin/attachifnotexists
Merge pull request duckdb/duckdb#9648 from samansmink/add-keep-alive-toggle
Merge pull request duckdb/duckdb#9638 from taniabogatsch/bench-refactor
Merge pull request duckdb/duckdb#9651 from Mytherin/getenv
@carlopi carlopi added Needs Documentation Use for issues or PRs that require changes in the documentation and removed Needs Documentation Use for issues or PRs that require changes in the documentation labels Feb 10, 2024
@Youssef-Harby
Copy link

Perhaps this is not the right place to ask, but I'm unsure. Can encryption be applied on a column-by-column basis, such that specific information is accessible only with a specific encryption key?
I mean not only the whole file

Maybe attribute based authorization or Attribute-Based Access Control (ABAC)

@Tmonster
Copy link
Contributor

Hi @Youssef-Harby,

Unfortunately we do not support encryption on a column by column basis yet.
Feel free to take a look at the docs for any other questions! https://duckdb.org/docs/data/parquet/encryption

@alexandroid11
Copy link

Hello,

Doing very simple test for .10 version
(Downloaded latest version with winget install DuckDB.cli )

select * from md;
┌────────┐
│ Id │
│ int32 │
├────────┤
│ 0 rows │
└────────┘
insert into md (Id) values (1);
insert into md (Id) values (2);
select * from md;
┌───────┐
│ Id │
│ int32 │
├───────┤
│ 1 │
│ 2 │
└───────┘
PRAGMA add_parquet_key('key256', '01234567891123450123456789112345');
┌─────────┐
│ Success │
│ boolean │
├─────────┤
│ 0 rows │
└─────────┘
COPY md TO 'tbl.parquet' (ENCRYPTION_CONFIG {footer_key: 'key256'});

Error: Invalid Error: Unable to generate random data

Not sure what I'm doing wrong.

@szarnyasg
Copy link
Collaborator

@alexandroid11 this works fine for me on macOS:

create table md (id int32);
select * from md;
insert into md (Id) values (1);
insert into md (Id) values (2);
select * from md;
PRAGMA add_parquet_key('key256', '01234567891123450123456789112345');
COPY md TO 'tbl.parquet' (ENCRYPTION_CONFIG {footer_key: 'key256'});
PRAGMA add_parquet_key('key256', '01234567891123450123456789112345');
FROM read_parquet('tbl.parquet', encryption_config = {footer_key: 'key256'});
┌───────┐
│  id   │
│ int32 │
├───────┤
│     1 │
│     2 │
└───────┘

@szarnyasg
Copy link
Collaborator

If you can reproduce the issue on a fresh set up, please feel free to open a new issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Documentation Use for issues or PRs that require changes in the documentation Ready For Review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants