This is the code repository for the deep learning job scheduling paper titled 'Liquid: Intelligent Resource Requirement Estimation and Scheduling for Deep Learning Jobs on Distributed GPU Clusters'.
The project is based on Docker.
- OS Centos Linux release7.6.1810
- Nvidia Driver 410.129
- CUDA 10.0
- Docker 19.03
- Nvidia-docker 2.2.2
# on master node
docker swarm init
# Add other nodes to the cluster
docker swarm join --token A-LONG-TOKEN-STRING-HERE 192.168.0.1:2377
docker swarm leave
docker swarm leave --force
docker network create --driver overlay --attachable yao-net
# docker network create --driver overlay --attachable --opt encrypted yao-net
Note: try remove encrypted when the containers cannot communicate cross nodes
Liquid-docs/sbin/run_hdfs.sh
Liquid-docs/sbin/run_glusterfs.sh
Liquid-docs/sbin/run_agent_helper.sh
Liquid-docs/sbin/run_agent.sh
Liquid-docs/sbin/start_agent_master.sh
Liquid-docs/sbin/start_mysql.sh
Liquid-docs/sbin/run_optimizer.sh
Liquid-docs/sbin/start_scheduler.sh
Liquid-docs/sbin/start_redis.sh
Liquid-docs/sbin/start_portal.sh
Liquid-docs/sbin/start_gitea.sh
Visit http://YOUR_IP/install.php