Skip to content

Using MetPy to split up testing/training/validation xarray datasets for Machine Learning #3579

Open
@ThomasMGeo

Description

What should we add?

Creating testing/training/validation datasets is a key step in machine learning workflows. Usually for Climate/Weather ML analysis, we split these datasets on a time dimension.

Scikit-learn has a function that does this for 2D arrays / pandas dataframes here. This function can't split xarray datasets.

Improvements on the scikit-learn implementation:

  1. Built for xarray datasets
  2. Can create a validation dataset (a third dataset) instead of doing it in two lines
  3. Can split datasets up in a useful way for time series analysis (do not split up datasets randomly for time series analysis!)

Big questions:

  1. Where should this go?
  2. can we use Xr.dataset.parse_cf() in a smart way to pull the time dimension automagically? This might not be required anyways.

Reference

No response

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions