xray is a Python package for working with aligned sets of homogeneous, n-dimensional arrays. It implements flexible array operations and dataset manipulation for in-memory datasets within the Common Data Model widely used for self-describing scientific data (netCDF, OpenDAP, etc.).
Warning: xray is still in its early development phase. Expect the API to change.
- Extended array objects (
XArray
andDatasetArray
) that are compatible with NumPy's ndarray and ufuncs but that keep ancilliary variables and metadata intact. - Flexible array broadcasting based on dimension names and coordinate indices.
- Lazily load arrays from netCDF files on disk or OpenDAP URLs.
- Flexible split-apply-combine functionality with the array
groupby
method (patterned after pandas). - Fast label-based indexing and (limited) time-series functionality built on pandas.
- Provide a data analysis toolkit as fast and powerful as pandas but designed for working with datasets of aligned, homogeneous N-dimensional arrays.
- Whenever possible, build on top of and interoperate with pandas and the rest of the awesome scientific python stack.
- Provide a uniform API for loading and saving scientific data in a variety of formats (including streaming data).
- Use metadata according to conventions when appropriate, but don't strictly enforce them. Conflicting attributes (e.g., units) should be silently dropped instead of causing errors. The onus is on the user to make sure that operations make sense.
- Iris (supported by the UK Met office) is a similar package
designed for working with geophysical datasets in Python. Iris provided
much of the inspiration for xray (e.g., xray's
DatasetArray
is largely based on the IrisCube
), but it has several limitations that led us to build xray instead of extending Iris:- Iris has essentially one first-class object (the
Cube
) on which it attempts to build all functionality (Coord
supports a much more limited set of functionality). xray has its equivalent of the Cube (theDatasetArray
object), but it is only a thin wrapper on the more primitive building blocks of Dataset and Array objects. - Iris has a strict interpretation of CF conventions, which, although a principled choice, we have found to be impractical for everyday uses. With Iris, every quantity has physical (SI) units, all coordinates have cell-bounds, and all metadata (units, cell-bounds and other attributes) is required to match before merging or doing operations with on multiple cubes. This means that a lot of time with Iris is spent figuring out why cubes are incompatible and explicitly removing possibly conflicting metadata.
- Iris can be slow and complex. Strictly interpretting metadata requires a lot of work and (in our experience) can be difficult to build mental models of how Iris functions work. Moreover, it means that a lot of logic (e.g., constraint handling) uses non-vectorized operations. For example, extracting all times within a range can be surprisingly slow (e.g., 0.3 seconds vs 3 milliseconds in xray to select along a time dimension with 10000 elements).
- Iris has essentially one first-class object (the
- pandas is fast and powerful but oriented around working with
tabular datasets. pandas has experimental N-dimensional panels, but they
don't support aligned math with other objects. We believe the
DatasetArray
/Cube
model is better suited to working with scientific datasets. We use pandas internally in xray to support fast indexing. - netCDF4-python provides xray's primary interface for working with netCDF and OpenDAP datasets.