-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet Encryption #9392
Parquet Encryption #9392
Conversation
I did check locally, and compiles with duckdb-wasm (was unsure about dependencies, and should set a proper CI job to do this). Yey! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't look at the embedded libraries. Will also most likely look at this again after reading the Parquet Modular Encryption
} | ||
|
||
if (transport_remaining != 0) { | ||
throw InvalidInputException("Encoded ciphertext length differs from actual ciphertext length"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this error mean exactly? What is the difference between an encoded cipher text and the actual cipher text? I'm guessing something to do with encoding the text in base64 for transport?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The length of the ciphertext (in bytes) is encoded right before the actual ciphertext. If, for some reason, the actual length differs from the encoded length, something is definitely wrong, and we have to throw an error.
@@ -395,7 +422,10 @@ void ParquetReader::InitializeSchema() { | |||
auto file_meta_data = GetFileMetadata(); | |||
|
|||
if (file_meta_data->__isset.encryption_algorithm) { | |||
throw FormatException("Encrypted Parquet files are not supported"); | |||
if (file_meta_data->encryption_algorithm.__isset.AES_GCM_CTR_V1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not 100% sure how parquet reading works, so this question might be totally wrong. It looks like you do this check for the footer as well. Is it possible to encrypt a parquet file footer with one encryption standard but the rest of the file with another?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Funky stuff like that is possible, but we only support the basics. If another algorithm is used, we throw an error here. If other parts are encrypted with a different algo, decryption will fail, and we will definitely throw an error elsewhere.
Thanks! |
Merge pull request duckdb/duckdb#9392 from lnkuiper/parquet_encryption Merge pull request duckdb/duckdb#9461 from hawkfish/merge-sort-trees Merge pull request duckdb/duckdb#8788 from alnkesq/capi_create_enum_type Merge pull request duckdb/duckdb#9513 from Tmonster/5614-database-invalidated Merge pull request duckdb/duckdb#9622 from Mytherin/typescoping Merge pull request duckdb/duckdb#9615 from hawkfish/strptime-infinity Merge pull request duckdb/duckdb#9627 from Mytherin/attachifnotexists Merge pull request duckdb/duckdb#9648 from samansmink/add-keep-alive-toggle Merge pull request duckdb/duckdb#9638 from taniabogatsch/bench-refactor Merge pull request duckdb/duckdb#9651 from Mytherin/getenv
Perhaps this is not the right place to ask, but I'm unsure. Can encryption be applied on a column-by-column basis, such that specific information is accessible only with a specific encryption key? Maybe attribute based authorization or Attribute-Based Access Control (ABAC) |
Hi @Youssef-Harby, Unfortunately we do not support encryption on a column by column basis yet. |
Hello, Doing very simple test for .10 version select * from md; Error: Invalid Error: Unable to generate random data Not sure what I'm doing wrong. |
@alexandroid11 this works fine for me on macOS: create table md (id int32);
select * from md;
insert into md (Id) values (1);
insert into md (Id) values (2);
select * from md;
PRAGMA add_parquet_key('key256', '01234567891123450123456789112345');
COPY md TO 'tbl.parquet' (ENCRYPTION_CONFIG {footer_key: 'key256'}); PRAGMA add_parquet_key('key256', '01234567891123450123456789112345');
FROM read_parquet('tbl.parquet', encryption_config = {footer_key: 'key256'});
|
If you can reproduce the issue on a fresh set up, please feel free to open a new issue! |
This PR implements encryption for Parquet files as an experimental feature. We mostly follow the Parquet Modular Encryption specification, save for some details. For now, this means that our Parquet encryption is not compatible with that of, e.g., PyArrow, until the missing details are implemented.
Named encryption keys of 128, 192, or 256 bits can be added to a session, which are stored in-memory:
Files are then encrypted like so:
For now, we encrypt the footer and all columns using the
footer_key
. The Parquet specification allows encryption of individual columns with different keys, e.g.:This will cause an error to be thrown (for now!):
The encrypted file can then be read like so:
Encryption has some performance implications, of course. Without encryption, reading/writing the
lineitem
table from TPC-H at SF1, which is 6M rows and 15 columns, from/to a Parquet file takes 0.26s and 0.99s, respectively. With encryption, this takes 0.64s and 2.21s.