-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support gcs:// and r2:// URLs to read data from GCS and R2 #9388
Conversation
test/sql/copy/s3/url_encode.test
Outdated
Catalog Error: Table with name gcs://test-bucket/whatever does not exist! | ||
|
||
statement error | ||
SELECT * FROM 'r2://test-bucket/whatever'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if these tests are also executed on Windows, if they are then specifying the full path (including slashes) in the expected error might break when tested on Windows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Tishj I think this should work on windows? the slashes in http urls are /
regardless of platform?
@chrisiou I have another comment with these tests though. The idea behind this test was to confirm that the generated http urls are correct. So for example when using the settings from source ./scripts/set_s3_test_server_variables.sh
, when should be able to test this:
SELECT * FROM 'gcs://test-bucket/whatever.parquet';
Error: IO Error: Unable to connect to URL "http://test-bucket.duckdb-minio.com:9000/whatever.parquet": 404 (Not Found)
Note that the endpoint is now set to duckdb-minio.com:9000
as defined by the env variable from the script. Now when I run the following:
D set s3_endpoint='s3.some.random.endpoint.com';
D SELECT * FROM 'gcs://test-bucket/whatever.parquet';
Error: IO Error: Connection error for HTTP HEAD to 'http://test-bucket.s3.some.random.endpoint.com/whatever.parquet'
You can see that the error message can show us we're constructing the correct urls here. Could you rewrite the test in such a way that this test confirm we are using the right endpoint default with each prefix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@samansmink you're right, I'm thinking of local filesystem paths, my bad!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @chrisiou, thanks for the PR, looks good! I have added 2 small comments
extension/httpfs/s3fs.cpp
Outdated
@@ -193,7 +193,13 @@ S3AuthParams S3AuthParams::ReadFrom(FileOpener *opener, FileOpenerInfo &info) { | |||
if (FileOpener::TryGetCurrentSetting(opener, "s3_endpoint", value, info)) { | |||
endpoint = value.ToString(); | |||
} else { | |||
endpoint = "s3.amazonaws.com"; | |||
if (StringUtil::StartsWith(info.file_path, "gcs://")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea of this change is that when a user queries somethings like:
SELECT * FROM 'gcs://test-bucket/some/file';
We can automatically set the endpoint as you are doing here. However I think this codepath is currently never hit, because the TryGetCurrentSetting now always returns true, setting the endpoint to an empty string in this case. Could you rewrite this in such a way that when the s3_endpoint
does not exist or is empty, we set the default based on the prefix?
test/sql/copy/s3/url_encode.test
Outdated
Catalog Error: Table with name gcs://test-bucket/whatever does not exist! | ||
|
||
statement error | ||
SELECT * FROM 'r2://test-bucket/whatever'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Tishj I think this should work on windows? the slashes in http urls are /
regardless of platform?
@chrisiou I have another comment with these tests though. The idea behind this test was to confirm that the generated http urls are correct. So for example when using the settings from source ./scripts/set_s3_test_server_variables.sh
, when should be able to test this:
SELECT * FROM 'gcs://test-bucket/whatever.parquet';
Error: IO Error: Unable to connect to URL "http://test-bucket.duckdb-minio.com:9000/whatever.parquet": 404 (Not Found)
Note that the endpoint is now set to duckdb-minio.com:9000
as defined by the env variable from the script. Now when I run the following:
D set s3_endpoint='s3.some.random.endpoint.com';
D SELECT * FROM 'gcs://test-bucket/whatever.parquet';
Error: IO Error: Connection error for HTTP HEAD to 'http://test-bucket.s3.some.random.endpoint.com/whatever.parquet'
You can see that the error message can show us we're constructing the correct urls here. Could you rewrite the test in such a way that this test confirm we are using the right endpoint default with each prefix?
0ec975c
to
e51c7dd
Compare
Hey @chrisiou! Could you mark the PR as ready for review? this will trigger the CI to run |
@Mytherin This PR is good to go! |
Thanks! |
Merge pull request duckdb/duckdb#9164 from Mause/feature/jdbc-uuid-param Merge pull request duckdb/duckdb#9185 from pdet/adbc_07 Merge pull request duckdb/duckdb#9126 from Maxxen/parquet-kv-metadata Merge pull request duckdb/duckdb#9123 from lnkuiper/parquet_schema Merge pull request duckdb/duckdb#9086 from lnkuiper/json_inconsistent_structure Merge pull request duckdb/duckdb#8977 from Tishj/python_readcsv_multi_v2 Merge pull request duckdb/duckdb#9279 from hawkfish/nsdate-cast Merge pull request duckdb/duckdb#8851 from taniabogatsch/binary_lambdas Merge pull request duckdb/duckdb#8983 from Maxxen/types/fixedsizelist Merge pull request duckdb/duckdb#9318 from Maxxen/fix-unused Merge pull request duckdb/duckdb#9220 from hawkfish/exclude Merge pull request duckdb/duckdb#9230 from Maxxen/json-plan-serialization Merge pull request duckdb/duckdb#9011 from Tmonster/add_create_statement_support_to_fuzzer Merge pull request duckdb/duckdb#9400 from Maxxen/array-fixes Merge pull request duckdb/duckdb#8741 from Tishj/python_import_cache_upgrade Merge fixes Merge pull request duckdb/duckdb#9395 from taniabogatsch/lambda-performance Merge pull request duckdb/duckdb#9427 from Tishj/python_table_support_replacement_scan Merge pull request duckdb/duckdb#9516 from carlopi/fixformat Merge pull request duckdb/duckdb#9485 from Maxxen/fix-parquet-serialization Merge pull request duckdb/duckdb#9388 from chrisiou/issue217 Merge pull request duckdb/duckdb#9565 from Maxxen/fix-array-vector-sizes Merge pull request duckdb/duckdb#9583 from carlopi/feature Merge pull request duckdb/duckdb#8907 from cryoEncryp/new-list-functions Merge pull request duckdb/duckdb#8642 from Virgiel/capi-streaming-arrow Merge pull request duckdb/duckdb#8658 from Tishj/pytype_optional Merge pull request duckdb/duckdb#9040 from Light-City/feature/set_mg
This PR adds two new schemes for the s3 filesystem:
us-east-1
. Relevant links for: s3, gcs and r2s3.amazonaws.com
as default because otherwise (for ther2.cloudflarestorage.com
endpoint), it needs additionally an account_id (this will be addressed in another PR)url_encode.test
is modified to additionally test gcs and r2 like s3 url.