-
Notifications
You must be signed in to change notification settings - Fork 207
Running DeepMind in EENet cluster
First you need to have a private-public key pair. If you don't have one, here are the instructions for creating one in Windows and Ubuntu.
Then go to https://taat.grid.ee/ and choose Log in via: TAAT in top right corner. NB! Only people studying in Estonian universities can create this account. After logging in, at some point you are prompted for SSH keys. Upload your public key.
Finally try to log in to juur.grid.eenet.ee
using SSH (i.e. Putty in Windows). NB! Your login name is probably something like hpc_yourfirstname
.
Virtual Python Environment builder allows us to install missing Python packages locally, without waiting for sysadmins of the cluster.
Download latest virtualenv source package from https://pypi.python.org/pypi/virtualenv#downloads and untar it:
$ wget https://pypi.python.org/packages/source/v/virtualenv/virtualenv-1.11.6.tar.gz
$ tar xzf virtualenv-1.11.6.tar.gz
Create a new virtual environment in ~/python
:
$ cd virtualenv-1.11.6
$ python2.7 virtualenv.py ~/python
New python executable in /home/hpc_tambet/python/bin/python2.7
Also creating executable in /home/hpc_tambet/python/bin/python
Installing setuptools, pip...done.
NB! Make sure you use python2.7
command, because default is Python 2.6.
Now you activate the virtual environment with:
$ source ~/python/bin/activate
NB! You must use source
command, otherwise environment variables defined in the script don't take effect in current shell.
Now install required Python packages:
$ pip install numpy scipy pillow
Installed libjpeg-turbo package is missing some of the features needed by cuda-convnet2
. That's why we need to compile it ourselves.
First download libjpeg-turbo from http://libjpeg-turbo.virtualgl.org/ and untar it:
$ wget http://skylink.dl.sourceforge.net/project/libjpeg-turbo/1.3.1/libjpeg-turbo-1.3.1.tar.gz
$ tar xzf libjpeg-turbo-1.3.1.tar.gz
Configure, make and install:
$ cd libjpeg-turbo-1.3.1
$ ./configure --prefix=$HOME/Libraries/libjpeg
$ make
$ make install
cuda-convnet2 is the convolutional neural network implementation we are using.
First clone cuda-convnet2 code to a local directory:
$ git clone https://code.google.com/p/cuda-convnet2/
$ cd cuda-convnet2
Now you have to fix few directories in the makefiles.
build.sh
Line 35:
export NUMPY_INCLUDE_PATH=/usr/lib64/python2.7/site-packages/numpy/core/include/numpy
Line 38:
export ATLAS_LIB_PATH=/usr/lib64/atlas
cudaconvnet/Makefile
You need to include your compiled libjpeg-turbo here.
Lines 72-73:
LDFLAGS += -lpthread -ljpeg -lpython$(PYTHON_VERSION) -L../util -lutilpy -L../nvmatrix -lnvmatrix -L../cudaconv3 -lcudaconv -lcublas -Wl,-rpath=./util -Wl,-rpath=./nvmatrix -Wl,-rpath=./cudaconv3 -L$(HOME)/Libraries/libjpeg/lib
INCLUDES := -I$(CUDA_INC_PATH) -I $(CUDA_SDK_PATH)/common/inc -I./include -I$(PYTHON_INCLUDE_PATH) -I$(NUMPY_INCLUDE_PATH) -I$(HOME)/libjpeg/include
make-data/pyext/Makefile
make-data.py
script makes use of OpenCV library, which we have already installed, but in a non-standard location.
Line 27:
LINK_LIBS := -L$(CUDA_INSTALL_PATH)/lib64 `pkg-config --libs python` -lpthread -L/usr/local/opencv/2.4.5/lib64 -lopencv_core -lopencv_ml -lopencv_imgproc -lopencv_highgui
Line 29:
INCLUDES += -I$(PYTHON_INCLUDE_PATH) -I/usr/local/opencv/2.4.5/include
Additionally you need to change 2 lines in cudaconv3/src/weight_acts.cu to make it work with 4 channels, that we are using for representing 4 frames.
Line 2023:
assert(numGroups > 1 || (numImgColors > 0 && (numImgColors <= 4 || numImgColors % 16 == 0)));
Line 2059:
if (numFilterColors > 4) {
Finally build cuda-convnet2 using build.sh
$ ./build.sh
To make those libraries discoverable run-time, you need to add them to .bash_profile
(or .bashrc
):
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/opencv/2.4.5/lib64:$HOME/libjpeg/lib:$HOME/cuda-convnet2/util:$HOME/cuda-convnet2/nvmatrix:$HOME/cuda-convnet2/cudaconv3
export PYTHONPATH=$HOME/cuda-convnet2
You need to log out and in again for these changes to take effect.
Generating batches from ImageNet dataset (doesn't need GPU):
$ python make-data.py --src-dir /storage/hpc_kristjan/ImageNet/ --tgt-dir /storage/hpc_tambet/cuda-convnet2
Run cuda-convnet2 on ImageNet batches using two Nvidia Tesla K20m GPUs:
$ srun --partition=long --gres=gpu:2 --constraint=K20 --mem=12000 python convnet.py --data-path /storage/hpc_kristjan/cuda-convnet2 --train-range 0-417 --test-range 1000-1016 --save-path /storage/hpc_tambet/tmp --epochs 90 --layer-def layers/layers-imagenet-2gpu-data.cfg --layer-params layers/layer-params-imagenet-2gpu-data.cfg --data-provider image --inner-size 224 --gpu 0,1 --mini 128 --test-freq 201 --color-noise 0.1
srun
is Slurm command to run a command on some node in cluster. Arguments:
-
--partition=long
- run this job in partition for "long" jobs. Default partition is "short", which is limited by 30 minutes. -
--gres=gpu:2
- require two GPUs from node. -
--constraint=K20
- those GPUs must by Nvidia Tesla K20-s. -
--mem=12000
- memory limit for the job is 12GB. -
convnet.py
arguments are described here.
Simpler and faster example that does only one epoch on one batch, using only one GPU:
$ srun --gres=gpu:1 --constraint=K20 --mem=12000 python convnet.py --data-path /storage/hpc_kristjan/cuda-convnet2 --train-range 0 --test-range 1000 --save-path /storage/hpc_tambet/tmp --epochs 1 --layer-def layers/layers-imagenet-1gpu.cfg --layer-params layers/layer-params-imagenet-1gpu.cfg --data-provider image --inner-size 224 --gpu 0 --mini 128 --test-freq 201 --color-noise 0.1
Now finally we get to the DeepMind part. Clone DeepMind repository to a local directory:
$ git clone https://github.com/kristjankorjus/Replicating-DeepMind.git
$ cd Replicating-DeepMind
DeepMind comes with Arcade Learning Environment (ALE) included. First you have to turn off SDL and compile ALE.
To turn off SDL change line 7 in libraries/ale/makefile
:
USE_SDL := 0
Then compile it:
$ cd libraries/ale
$ make
Congratulations, now you can finally run DeepMind!
$ cd src
$ srun --gres=gpu:1 --constraint=K20 python main.py --gpu 0 --save-path /tmp
Basically main.py
takes the same arguments, as convnet.py
above, only some of the arguments are irrelevant and others have default values.