Support gcs:// and r2:// URLs to read data from GCS and R2 #9388

chrisiou · 2023-10-18T15:16:07Z

This PR adds two new schemes for the s3 filesystem:

The default region of all services is changed to us-east-1. Relevant links for: s3, gcs and r2
The gcs and s3 services have their corresponding endpoints as defaults.
The r2 gets s3.amazonaws.com as default because otherwise (for ther2.cloudflarestorage.com endpoint), it needs additionally an account_id (this will be addressed in another PR)
The gcs and r2 prefixes have been added wherever needed.
The url_encode.test is modified to additionally test gcs and r2 like s3 url.

Tishj · 2023-10-18T16:49:03Z

test/sql/copy/s3/url_encode.test

+Catalog Error: Table with name gcs://test-bucket/whatever does not exist!
+
+statement error
+SELECT * FROM 'r2://test-bucket/whatever';


I wonder if these tests are also executed on Windows, if they are then specifying the full path (including slashes) in the expected error might break when tested on Windows

@Tishj I think this should work on windows? the slashes in http urls are / regardless of platform?

@chrisiou I have another comment with these tests though. The idea behind this test was to confirm that the generated http urls are correct. So for example when using the settings from source ./scripts/set_s3_test_server_variables.sh, when should be able to test this:

SELECT * FROM 'gcs://test-bucket/whatever.parquet'; Error: IO Error: Unable to connect to URL "http://test-bucket.duckdb-minio.com:9000/whatever.parquet": 404 (Not Found)

Note that the endpoint is now set to duckdb-minio.com:9000 as defined by the env variable from the script. Now when I run the following:

D set s3_endpoint='s3.some.random.endpoint.com'; D SELECT * FROM 'gcs://test-bucket/whatever.parquet'; Error: IO Error: Connection error for HTTP HEAD to 'http://test-bucket.s3.some.random.endpoint.com/whatever.parquet'

You can see that the error message can show us we're constructing the correct urls here. Could you rewrite the test in such a way that this test confirm we are using the right endpoint default with each prefix?

@samansmink you're right, I'm thinking of local filesystem paths, my bad!

samansmink

Hey @chrisiou, thanks for the PR, looks good! I have added 2 small comments

samansmink · 2023-10-19T08:43:11Z

extension/httpfs/s3fs.cpp

@@ -193,7 +193,13 @@ S3AuthParams S3AuthParams::ReadFrom(FileOpener *opener, FileOpenerInfo &info) {
 	if (FileOpener::TryGetCurrentSetting(opener, "s3_endpoint", value, info)) {
 		endpoint = value.ToString();
 	} else {
-		endpoint = "s3.amazonaws.com";
+		if (StringUtil::StartsWith(info.file_path, "gcs://")) {


The idea of this change is that when a user queries somethings like:

SELECT * FROM 'gcs://test-bucket/some/file';

We can automatically set the endpoint as you are doing here. However I think this codepath is currently never hit, because the TryGetCurrentSetting now always returns true, setting the endpoint to an empty string in this case. Could you rewrite this in such a way that when the s3_endpoint does not exist or is empty, we set the default based on the prefix?

samansmink · 2023-10-19T08:49:07Z

test/sql/copy/s3/url_encode.test

+Catalog Error: Table with name gcs://test-bucket/whatever does not exist!
+
+statement error
+SELECT * FROM 'r2://test-bucket/whatever';


@Tishj I think this should work on windows? the slashes in http urls are / regardless of platform?

@chrisiou I have another comment with these tests though. The idea behind this test was to confirm that the generated http urls are correct. So for example when using the settings from source ./scripts/set_s3_test_server_variables.sh, when should be able to test this:

SELECT * FROM 'gcs://test-bucket/whatever.parquet'; Error: IO Error: Unable to connect to URL "http://test-bucket.duckdb-minio.com:9000/whatever.parquet": 404 (Not Found)

Note that the endpoint is now set to duckdb-minio.com:9000 as defined by the env variable from the script. Now when I run the following:

D set s3_endpoint='s3.some.random.endpoint.com'; D SELECT * FROM 'gcs://test-bucket/whatever.parquet'; Error: IO Error: Connection error for HTTP HEAD to 'http://test-bucket.s3.some.random.endpoint.com/whatever.parquet'

You can see that the error message can show us we're constructing the correct urls here. Could you rewrite the test in such a way that this test confirm we are using the right endpoint default with each prefix?

extension/httpfs/s3fs.cpp

samansmink · 2023-10-26T07:57:25Z

Hey @chrisiou! Could you mark the PR as ready for review? this will trigger the CI to run

…ements to test the generated urls

samansmink · 2023-11-06T10:19:44Z

@Mytherin This PR is good to go!

Mytherin · 2023-11-06T12:18:02Z

Thanks!

Merge pull request duckdb/duckdb#9164 from Mause/feature/jdbc-uuid-param Merge pull request duckdb/duckdb#9185 from pdet/adbc_07 Merge pull request duckdb/duckdb#9126 from Maxxen/parquet-kv-metadata Merge pull request duckdb/duckdb#9123 from lnkuiper/parquet_schema Merge pull request duckdb/duckdb#9086 from lnkuiper/json_inconsistent_structure Merge pull request duckdb/duckdb#8977 from Tishj/python_readcsv_multi_v2 Merge pull request duckdb/duckdb#9279 from hawkfish/nsdate-cast Merge pull request duckdb/duckdb#8851 from taniabogatsch/binary_lambdas Merge pull request duckdb/duckdb#8983 from Maxxen/types/fixedsizelist Merge pull request duckdb/duckdb#9318 from Maxxen/fix-unused Merge pull request duckdb/duckdb#9220 from hawkfish/exclude Merge pull request duckdb/duckdb#9230 from Maxxen/json-plan-serialization Merge pull request duckdb/duckdb#9011 from Tmonster/add_create_statement_support_to_fuzzer Merge pull request duckdb/duckdb#9400 from Maxxen/array-fixes Merge pull request duckdb/duckdb#8741 from Tishj/python_import_cache_upgrade Merge fixes Merge pull request duckdb/duckdb#9395 from taniabogatsch/lambda-performance Merge pull request duckdb/duckdb#9427 from Tishj/python_table_support_replacement_scan Merge pull request duckdb/duckdb#9516 from carlopi/fixformat Merge pull request duckdb/duckdb#9485 from Maxxen/fix-parquet-serialization Merge pull request duckdb/duckdb#9388 from chrisiou/issue217 Merge pull request duckdb/duckdb#9565 from Maxxen/fix-array-vector-sizes Merge pull request duckdb/duckdb#9583 from carlopi/feature Merge pull request duckdb/duckdb#8907 from cryoEncryp/new-list-functions Merge pull request duckdb/duckdb#8642 from Virgiel/capi-streaming-arrow Merge pull request duckdb/duckdb#8658 from Tishj/pytype_optional Merge pull request duckdb/duckdb#9040 from Light-City/feature/set_mg

chrisiou marked this pull request as draft October 18, 2023 15:16

samansmink self-requested a review October 18, 2023 15:20

Tishj reviewed Oct 18, 2023

View reviewed changes

chrisiou marked this pull request as ready for review October 19, 2023 08:18

samansmink reviewed Oct 19, 2023

View reviewed changes

chrisiou closed this Oct 23, 2023

chrisiou force-pushed the issue217 branch from c33442f to 812c4a9 Compare October 23, 2023 09:08

samansmink reopened this Oct 23, 2023

github-actions bot marked this pull request as draft October 23, 2023 09:25

chrisiou force-pushed the issue217 branch 2 times, most recently from 0ec975c to e51c7dd Compare October 23, 2023 14:32

chrisiou marked this pull request as ready for review October 24, 2023 08:51

carlopi reviewed Oct 25, 2023

View reviewed changes

extension/httpfs/s3fs.cpp Outdated Show resolved Hide resolved

github-actions bot marked this pull request as draft October 25, 2023 13:47

chrisiou marked this pull request as ready for review October 26, 2023 08:18

chrisiou closed this Oct 26, 2023

chrisiou reopened this Oct 26, 2023

chrisiou added 4 commits November 3, 2023 13:25

Support gcs:// and r2:// URLs to read data from GCS and R2

2d11a28

code format changes

51b9221

Change the r2 default endpoint to be the same as s3 and add some stat…

12e9095

…ements to test the generated urls

Modify to use the explicit prefixes on file handling in S3FileSystem

b959f7a

chrisiou force-pushed the issue217 branch from d23882c to b959f7a Compare November 3, 2023 12:33

github-actions bot marked this pull request as draft November 3, 2023 12:34

chrisiou marked this pull request as ready for review November 3, 2023 12:34

Mytherin merged commit 66b583b into duckdb:feature Nov 6, 2023
45 checks passed

chrisiou deleted the issue217 branch November 6, 2023 13:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support gcs:// and r2:// URLs to read data from GCS and R2 #9388

Support gcs:// and r2:// URLs to read data from GCS and R2 #9388

chrisiou commented Oct 18, 2023 •

edited

Loading

Tishj Oct 18, 2023

samansmink Oct 19, 2023

Tishj Oct 19, 2023

samansmink left a comment

samansmink Oct 19, 2023

samansmink Oct 19, 2023

samansmink commented Oct 26, 2023

samansmink commented Nov 6, 2023

Mytherin commented Nov 6, 2023

Support gcs:// and r2:// URLs to read data from GCS and R2 #9388

Support gcs:// and r2:// URLs to read data from GCS and R2 #9388

Conversation

chrisiou commented Oct 18, 2023 • edited Loading

Tishj Oct 18, 2023

Choose a reason for hiding this comment

samansmink Oct 19, 2023

Choose a reason for hiding this comment

Tishj Oct 19, 2023

Choose a reason for hiding this comment

samansmink left a comment

Choose a reason for hiding this comment

samansmink Oct 19, 2023

Choose a reason for hiding this comment

samansmink Oct 19, 2023

Choose a reason for hiding this comment

samansmink commented Oct 26, 2023

samansmink commented Nov 6, 2023

Mytherin commented Nov 6, 2023

chrisiou commented Oct 18, 2023 •

edited

Loading