Skip to content

HertugHelms/xarray

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

79 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

xray: extended arrays for working with scientific datasets in Python

xray is a Python package for working with aligned sets of homogeneous, n-dimensional arrays. It implements flexible array operations and dataset manipulation for in-memory datasets within the Common Data Model widely used for self-describing scientific data (netCDF, OpenDAP, etc.).

Warning: xray is still in its early development phase. Expect the API to change.

Main Feaures

  • Extended array objects (XArray and DatasetArray) that are compatible with NumPy's ndarray and ufuncs but that keep ancilliary variables and metadata intact.
  • Flexible array broadcasting based on dimension names and coordinate indices.
  • Lazily load arrays from netCDF files on disk or OpenDAP URLs.
  • Flexible split-apply-combine functionality with the array groupby method (patterned after pandas).
  • Fast label-based indexing and (limited) time-series functionality built on pandas.

Design Goals

  • Provide a data analysis toolkit as fast and powerful as pandas but designed for working with datasets of aligned, homogeneous N-dimensional arrays.
  • Whenever possible, build on top of and interoperate with pandas and the rest of the awesome scientific python stack.
  • Provide a uniform API for loading and saving scientific data in a variety of formats (including streaming data).
  • Use metadata according to conventions when appropriate, but don't strictly enforce them. Conflicting attributes (e.g., units) should be silently dropped instead of causing errors. The onus is on the user to make sure that operations make sense.

Prior Art

  • Iris (supported by the UK Met office) is a similar package designed for working with geophysical datasets in Python. Iris provided much of the inspiration for xray (e.g., xray's DatasetArray is largely based on the Iris Cube), but it has several limitations that led us to build xray instead of extending Iris:
    1. Iris has essentially one first-class object (the Cube) on which it attempts to build all functionality (Coord supports a much more limited set of functionality). xray has its equivalent of the Cube (the DatasetArray object), but it is only a thin wrapper on the more primitive building blocks of Dataset and Array objects.
    2. Iris has a strict interpretation of CF conventions, which, although a principled choice, we have found to be impractical for everyday uses. With Iris, every quantity has physical (SI) units, all coordinates have cell-bounds, and all metadata (units, cell-bounds and other attributes) is required to match before merging or doing operations with on multiple cubes. This means that a lot of time with Iris is spent figuring out why cubes are incompatible and explicitly removing possibly conflicting metadata.
    3. Iris can be slow and complex. Strictly interpretting metadata requires a lot of work and (in our experience) can be difficult to build mental models of how Iris functions work. Moreover, it means that a lot of logic (e.g., constraint handling) uses non-vectorized operations. For example, extracting all times within a range can be surprisingly slow (e.g., 0.3 seconds vs 3 milliseconds in xray to select along a time dimension with 10000 elements).
  • pandas is fast and powerful but oriented around working with tabular datasets. pandas has experimental N-dimensional panels, but they don't support aligned math with other objects. We believe the DatasetArray/ Cube model is better suited to working with scientific datasets. We use pandas internally in xray to support fast indexing.
  • netCDF4-python provides xray's primary interface for working with netCDF and OpenDAP datasets.

About

N-D labeled arrays and datasets in Python

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.9%
  • Other 0.1%