Our artifacts build on Pollux OSDI 2021's artifact release. Supplied artifacts contain scripts to reproduce simulator-based and physical-cluster based experiments. For simulator-based experiments, use code in sia-simulator
.
Scripts to reproduce results in Table 3 and instructions for the simulator can be found inside sia-simulator/README.md Note
We implementation of Sia using AdaptDL. Specifically, we make minimal changes to Pollux's OSDI 2021 artifact release in the following files to support GPU heterogeneity using adaptdl's mechanisms to communicate scheduler hints from client-side that are then picked up by the Sia scheduler to realize heterogeneity-aware goodput-optimized clusterscheduling.
Sia scheduler is referred to as mip
in the codebase for physical cluster experiments.
- run_workload.py: adds support for Sia scheduler through a temporary name 'mip'
- adaptdl/adaptdl/: added support for multiple GPU types in
env.py, scheduler_hints.py, _metrics.py, torch/data.py
. Also added support for per GPU-type batch-size limits intorch/data.py
(data-loader for adaptdl client) - adaptdl/adaptdl/torch/seed/: 1-GPU profiles for each job on different GPU types used to generate the batch-size limits and initial bootstrap throughput models in Sia
- benchmark/clear_state.sh: added ability to clear adaptdl state (including checkpoints). Note: you'll have to modify this script to point to the k8s PVC that stores adaptDL state.
- benchmark/models/*/Dockerfile: Upgraded dockers to pytorch 1.9+cu11.1 to support A100s. Also upgraded ubuntu to 20.04 from 18.04
- benchmark/models//.py: Modified launch scripts to hint app type (cifar10, imagenet, etc.). Only the
mip
/Sia scheduler uses these hints and other schedulers (pollux, gavel) ignore them
- Pick a scheduler (say
gavel
). Modifysched/adaptdl_sched/allocator.py
to setPOLICY
variable to the scheduler name. Set scheduler round duration using_sched_trigger_interval
inAdaptDLAllocator
class. - Run
make deploy
to deploy the scheduler and adaptdl server to k8s cd benchmark
and runpython run_workload.py POLICY phoebe_workload.csv
to start the job-submission script- In another terminal, run
python run_monitor.py --output=JSON_LOG_FILE --verbose
to log job and scheduler state every 15 seconds. We use thisJSON_LOG_FILE
to generate the plots in the paper. - Once the job submission script finishes, you can run
kubectl get adaptdljob
to see if all jobs successfully completed (last column should saySuccessful
and notFailed/Waiting
) - Average JCT is given by taking average of runtimes from JSON log file (you can also look at kubectl get adaptdljob to see individual runtimes), and makespan is the time taken for all jobs to complete
- adaptdl/run_placement.py: This generates profiles for all jobs used in this artifact. You'll need to modify this depending on cluster size and relevant placements that need profiled (probably only have to deal with number of GPUs per node and number of nodes really).
- adaptdl/adaptdl/scheduler_hints.py:
NODE_TO_CLUSTER_MAP
dictates mapping between nodenames and GPU types. You'll need to modify this to point to the correct mapping - benchmark/models/*/Dockerfile: Change the docker image to point to your docker registry
- benchmark/models/yolov3/config/yolov3_config_voc.py: Change the
DATA_PATH
to point to your dataset path - benchmark/models/yolov3/utils/datasets.py: Check
img_path
and ensure that it matches your dataset path - benchmark/pvc.cephfs.yaml: Configure PVC as required. We use a local flash network-attached storage (
hot
storage class in our k8s cluster) - benchmark/run_workload.py: Set
--repository
to point to your docker repository and check any mounts that you might need to change (look formounts.append
in the script) - sched/adaptdl_sched/cluster_config.py: modify cluster configuration to reflect your cluster's GPU types and their counts, along with node names and mapping of node names to node IDs
- sched/adaptdl_sched/policy/unaware_pollux.py: If your cluster contains different GPU types and different numbers of GPUs per node for each GPU type, you'll have to follow the template in
optimize()
underDEBUG_PHOEBE
to homogenize your cluster so that Unaware-Pollux only sees one number of GPUs per node (agnostic to GPU type anyway). This is not required for Sia as it can handle heterogeneity natively.