Bacon is a framework for orchestrating machine learning experiments on AWS. The stack consists of:
- Airflow service on ECS Fargate,
- ECS autoscaling group on which to run Weights & Biases hyperparameter sweeps
- API to trigger parallelized W&B sweep runs.
Sweeps are run in parallel over eight workers per EC2 instance. Choose between c5.9xlarge (default) or p3.8xlarge instances for training.
- AWS account and user with appropriate permissions and credentials
- (Optional) Running On-demand P instances vCPU quota >= 32 (if using
p3.8xlarge
instance type) - (Optional) An EC2 key pair with which to connect to the sweep task autoscaling group
- NodeJS and NPM (the present package was developed against versions
v14.18.1
andv8.3.0
, respectively) - Python
virtualenv
module and an accessiblepython3.9
distribution - An AWS Secrets Manager secret named
WandbApiTokenSecret
with a key-value pairWandbApiKey: <key>
, where<key>
is your W&B access token.
$ make install
The Images stack provisions for CodeBuild projects that build the requisite Docker images. The following command deploys the Images stack:
$ make deploy-images [env=<value>]
The default value for env
is staging
.
Note: To deploy a Bacon stack with env=<env>
, you must first deploy a respective bacon-<env>-images
.
The following command deploys the main Bacon stack:
$ make deploy [contextVar=<value>...]
where contextVar
belongs to the following options:
env
: Stack environment -- suffixed to stack name (default:staging
)sweepTaskImageTag
: Tag of sweep task image (default example UNet sub-module gitsha)airflowImageTag
: Tag of Airflow image (default: present gitsha)sweepTaskInstanceType
: Instance Type for sweep task autoscaling group (default:c5.9xlarge
)numSweepTasks
: Number of sweep tasks to run (default:8
)maxNumInstances
: Max number of instances to run in autoscaling group (default:1
)
To run a W&B hyperparameter sweep experiment:
- Navigate to the Airflow UI using the load balancer DNS Cloudformation output
- Trigger the
sweep_dag
with a sweep experiment config - Monitor tasks on the
SweepCluster
ECS cluster and training data in the W & B sweep console, which can be obtained from theinit_sweep
Airflow task logs.
A sweep experiment config passed to the sweep_dag
trigger should contain the following fields:
experiment_id
(string): ID for the sweep experimentn_runs_per_task
(int, optional): Number of runs to conduct per task (default 10)sweep_config
(object): A W&B sweep config specification.
NOTE: Leave the sweep_config.command
field unset; it will be set by the sweep_dag
.
example/unet
is an example project that uses Bacon to orchestrate experiments.
It consists of:
- UNet implementation and training procedure
- Sweep experiment Docker image
- Example experiment configuration
The sweep experiment Docker image's entrypoint is a shell script that invokes the training procedure via the W&B sweep agent.
The sweep agent pulls parameters from the W&B server against which the sweep from the init_sweep
Airflow task was executed.
To run the example on your Bacon stack,
- Navigate to the sweep DAG in the Airflow console
- Trigger the DAG with
example/unet/exp/config.json
as the DAG `conf``
You can obtain the W&B sweep URL from the init_sweep
Airflow task logs and then observe the experiment in the W&B console.
venv
: Set up a virtual environment; requiresvirtualenv
Python module installed, as well as an accessible Python 3.9 distributioninstall
: Activate virtual environment and install Python dependenciesimage-airflow
: Build Airflow service imageimage-registrar
: Build registrar function imagedeploy-images
: Deploy-images
stack; see Deployment sectiondeploy
: Deploy Bacon stack; see Deployment sectiontest-dag
: Activate virtual environment and validate thesweep_dag
clean
: Clean the virtual environment anddist
output
Upcoming features:
- CLI for triggering sweep experiments
- Auto generate experiment IDs
- Include model deployment support downstream of the sweep tasks