Skip to content

factorpricingmodel/prefect-yaml

Repository files navigation

Prefect YAML

CI Status Documentation Status Test coverage percentage

Poetry black pre-commit

PyPI Version Supported Python versions License

Package to run prefect with YAML configuration. For further details, please refer to the documentation

Installation

Install this via pip (or your favourite package manager):

pip install prefect-yaml

Usage

Run the command line prefect-yaml with the specified configuration file.

For example, the following YAML configuration is located in examples/simple_config.yaml.

metadata:
  output:
    directory: .output

task:
  task_a:
    caller: math:fabs
    parameters:
      - -9.0
    output:
      format: json
  task_b:
    caller: math:sqrt
    parameters:
      - !data task_a
    output:
      directory: null
  task_c:
    caller: math:fsum
    parameters:
      - [!data task_b, 1]

Run the following command to generate all the task outputs to the directory .output in the running directory.

prefect-yaml -c examples/simple_config.yaml

The output directory contains all the task outputs in the specified format.

% tree .output
.output
├── task_a.json
└── task_c.pickle

0 directories, 2 files

The expected behavior is to

  1. run task_a to dump the value fabs(-9.0) to the output directory in JSON format,
  2. run task_b to get the value sqrt(9.0) (from the output of task_a)
  3. run task_c to dump the value fsum([3.0, 1.0]) to the output directory in pickle format.

As the output directory in task_b is overridden as null, the output of task_b is passed to task_c in memory. Also, the output format in task_c is not specified so it is dumped in default format (pickle).

For further details, please see the section configuration in the documentation.

Configuration

The output section defines how the task writes and loads the task return. The section in metadata applies for all tasks globally while that in each task overrides the global parameters.

For further details, please see the documentation for parameter definitions in each section.

Output

The default output format is either pickle (default) or JSON, while users can define their own output format.

For example, if you would like to use pandas to load and dump the parquet file in pyarrow engine by default, you can define the configuration like below.

metadata:
  format: parquet
  dump-caller: object.to_parquet
  dump-parameters:
    engine: pyarrow
  load-caller: pandas:read_parquet
  load-parameters:
    engine: pyarrow

All the output parameters, like directory, dumper and loaders, can be overridden in the task level. You can also specify which tasks to export to the output directory, while the others to only be passed down to downstream in memory.

For further details, please see the output section in documentation.

Roadmap

Currently the project is still under development. The basic features are mostly available while the following features are coming soon

  • Multi cloud storage support
  • Subtasks supported in each task

Contributing

All levels of contributions are welcomed. Please refer to the contributing section for development and release guidelines.