Package to run prefect with YAML configuration. For further details, please refer to the documentation
Install this via pip (or your favourite package manager):
pip install prefect-yaml
Run the command line prefect-yaml
with the specified configuration
file.
For example, the following YAML configuration is located in examples/simple_config.yaml.
metadata:
output:
directory: .output
task:
task_a:
caller: math:fabs
parameters:
- -9.0
output:
format: json
task_b:
caller: math:sqrt
parameters:
- !data task_a
output:
directory: null
task_c:
caller: math:fsum
parameters:
- [!data task_b, 1]
Run the following command to generate all the task outputs to the
directory .output
in the running directory.
prefect-yaml -c examples/simple_config.yaml
The output directory contains all the task outputs in the specified format.
% tree .output
.output
├── task_a.json
└── task_c.pickle
0 directories, 2 files
The expected behavior is to
- run
task_a
to dump the valuefabs(-9.0)
to the output directory in JSON format, - run
task_b
to get the valuesqrt(9.0)
(from the output oftask_a
) - run
task_c
to dump the valuefsum([3.0, 1.0])
to the output directory in pickle format.
As the output directory in task_b
is overridden as null
, the output of task_b
is passed to task_c
in memory. Also, the output format in task_c
is not specified so it is dumped in default format (pickle).
For further details, please see the section configuration in the documentation.
The output section defines how the task writes and loads the task return. The section in metadata
applies for all tasks globally while that in each task
overrides the global parameters.
For further details, please see the documentation for parameter definitions in each section.
The default output format is either pickle (default) or JSON, while users can define their own output format.
For example, if you would like to use pandas
to load and dump the parquet file
in pyarrow engine by default, you can define the configuration like below.
metadata:
format: parquet
dump-caller: object.to_parquet
dump-parameters:
engine: pyarrow
load-caller: pandas:read_parquet
load-parameters:
engine: pyarrow
All the output parameters, like directory, dumper and loaders, can be overridden in the task level. You can also specify which tasks to export to the output directory, while the others to only be passed down to downstream in memory.
For further details, please see the output section in documentation.
Currently the project is still under development. The basic features are mostly available while the following features are coming soon
- Multi cloud storage support
- Subtasks supported in each task
All levels of contributions are welcomed. Please refer to the contributing section for development and release guidelines.