Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow fsspec URLs in open_(mf)dataset #4823

Merged
merged 24 commits into from
Feb 16, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Only implement for zarr
  • Loading branch information
Martin Durant committed Jan 28, 2021
commit c40ce4e036a3674c4dbb196de67d3126f7156934
26 changes: 12 additions & 14 deletions xarray/backends/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -871,26 +871,24 @@ def open_mfdataset(
.. [2] http://xarray.pydata.org/en/stable/dask.html#chunking-and-performance
"""
if isinstance(paths, str):
if is_remote_uri(paths):
from fsspec import open_files
from fsspec.core import get_fs_token_paths
if is_remote_uri(paths) and engine == "zarr":
try:
from fsspec.core import get_fs_token_paths
except ImportError as e:
raise ImportError(
"The use of remote URLs for opening zarr requires the package fsspec"
) from e

so = kwargs.get("backend_kwargs", {}).get("storage_options", {})
# zarr requires mappers, other backends may work with file-like objects
# get_fs_token_paths also allows arguments embedded in URLs
fs, _, _ = get_fs_token_paths(
paths,
mode="rb",
storage_options=so,
storage_options=kwargs.get("backend_kwargs", {}).get(
"storage_options", {}
),
expand=False,
)
if engine == "zarr":
paths = fs.glob(fs._strip_protocol(paths))
paths = [fs.get_mapper(path) for path in paths]
else:
if not paths.startswith("http"):
# pass through HTTP unchanged for backward compatibility
paths = open_files(paths, mode="rb", **so)
paths = fs.glob(fs._strip_protocol(paths)) # finds directories
paths = [fs.get_mapper(path) for path in paths]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit tricky. This assumes the backend is to want a mapper object (as the zarr backend does). But, what if the glob returns a list of netcdf files? Wouldn't we want a list of file objects?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this is my comment for "should we actually special case zarr". It could make files - for now it would just error. We don't have tests for this, though, but now might be the time to start.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now tracking with the comments above, I think we have two options:

  1. inject some backend specific logic in the api here to make a decision about what sort of object to return (if engine=='zarr', return mapper; else return file_obj)
  2. only support globbing remote zarr stores

(1) seems to be the more reasonable thing to do here but is slightly less principled as we've been working to cleanly separate the api from the backends.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose a third alternative might be to pass the paths through, and create mappers in the zarr backend (will re-instantiate the FS, but that's fine) and add the opening of remote files into each of the other backends that can handle it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have essentially done 1), but excluded HTTP for the non-zarr path, because it has a special place for some backends (dap...). In any case, I don't suppose anyone is using globbing with http, since it's generally unreliable.

else:
paths = sorted(glob(_normalize_path(paths)))
else:
Expand Down
4 changes: 3 additions & 1 deletion xarray/tests/test_backends.py
Original file line number Diff line number Diff line change
Expand Up @@ -3046,9 +3046,11 @@ def test_open_mfdataset(self):
def test_open_mfdataset_no_files(self):
pytest.importorskip("aiobotocore")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this means this test is always skipped. Should this be added to some of the environments?


# glob is attempted as of #4823, but finds no files
with raises_regex(OSError, "no files"):
# glob is attempted as of #4823, but finds no files
open_mfdataset("http://some/remote/uri")
with raises_regex(OSError, "no files"):
open_mfdataset("http://some/remote/uri", engine="zarr")

def test_open_mfdataset_2d(self):
original = Dataset({"foo": (["x", "y"], np.random.randn(10, 8))})
Expand Down