This is a WIP! Please contribute back if you have ideas. You can track the associated issue here.
A machine learning training system can easily be modeled as a state machine.
While often, this is more of a DAG (E.G. query_data
-> feature_engineer
-> train_model
-> evaluate_model
), there
are some interesting reasons this as a state machine.
If, rather than treating train
as a action, you treat train_epoch
as an action, you can utilize the abstractions
burr represents to get the following benefits:
You can store the current params + best model (or pointers to them). Then go back to the failure point. This allows you to recover from failure in a generic way.
You can have a condition that checks training completeness based on metric history, which is different from the training logic itself. This allows you to easily swap in and out termination conditions (testing them against each other or running different ones in different scenarios), as well as easy reuse.
Similar to hooks in pytorch lightning, you can use Burr hooks to log metrics, visualize, etc... at each step. This allows you to have a more granular view of training that updates live (obviuosly depending on the UI/model-monitoring you're using.)
While some models are trained in a single-shot and shipped to production, many require human input. Burr can be used to express training, then checkpoint/pause the state while a human is evaluating it, and have their input (e.g. go/no-go) passed in as an input parameter.
Note that this still requires a scheduling tool (say a task executor that runs until the next human input is needed), but that task executor does not need to be complicated (all it needs to do is run a job when prompted, and possibly on a chron schedule).
Similar to epoch training, hyperparameter training can be modeled as a state machine. The system makes a decision about what to look for next based on the prior results. Furthermore, one can model a step that handles multiple jobs, keeping them alive/ launching new ones as they complete (based on state).
For a sketch of the latter see application.py & .