Skip to content

Running DeepMind in EENet cluster

kuz edited this page Oct 22, 2014 · 17 revisions

Create an user account

First you need to have a private-public key pair. If you don't have one, here are the instructions for creating one in Windows and Ubuntu.

Then go to https://taat.grid.ee/ and choose Log in via: TAAT in top right corner. NB! Only people studying in Estonian universities can create this account. After logging in, at some point you are prompted for SSH keys. Upload your public key.

Finally try to log in to juur.grid.eenet.ee using SSH (i.e. Putty in Windows). NB! Your login name is probably something like hpc_yourfirstname.

Install Virtual Python Environment builder

Virtual Python Environment builder allows us to install missing Python packages locally, without waiting for sysadmins of the cluster.

Download latest virtualenv source package from https://pypi.python.org/pypi/virtualenv#downloads and untar it:

$ wget https://pypi.python.org/packages/source/v/virtualenv/virtualenv-1.11.6.tar.gz
$ tar xzf virtualenv-1.11.6.tar.gz

Create a new virtual environment in ~/python:

$ cd virtualenv-1.11.6
$ python2.7 virtualenv.py ~/python
New python executable in /home/hpc_tambet/python/bin/python2.7
Also creating executable in /home/hpc_tambet/python/bin/python
Installing setuptools, pip...done.

NB! Make sure you use python2.7 command, because default is Python 2.6.

Now you activate the virtual environment with:

$ source ~/python/bin/activate

NB! You must use source command, otherwise environment variables defined in the script don't take effect in current shell.

Now install required Python packages:

$ pip install numpy scipy pillow

Build libjpeg-turbo

Installed libjpeg-turbo package is missing some of the features needed by cuda-convnet2. That's why we need to compile it ourselves.

First download libjpeg-turbo from http://libjpeg-turbo.virtualgl.org/ and untar it:

$ wget http://skylink.dl.sourceforge.net/project/libjpeg-turbo/1.3.1/libjpeg-turbo-1.3.1.tar.gz
$ tar xzf libjpeg-turbo-1.3.1.tar.gz

Configure, make and install:

$ cd libjpeg-turbo-1.3.1
$ ./configure --prefix=$HOME/Libraries/libjpeg
$ make
$ make install

Build cuda-convnet2

cuda-convnet2 is the convolutional neural network implementation we are using.

First clone cuda-convnet2 code to a local directory:

$ git clone https://code.google.com/p/cuda-convnet2/
$ cd cuda-convnet2

Now you have to fix few directories in the makefiles.

build.sh

Line 35:

export NUMPY_INCLUDE_PATH=/usr/lib64/python2.7/site-packages/numpy/core/include/numpy

Line 38:

export ATLAS_LIB_PATH=/usr/lib64/atlas

cudaconvnet/Makefile

You need to include your compiled libjpeg-turbo here.

Lines 72-73:

LDFLAGS   += -lpthread -ljpeg -lpython$(PYTHON_VERSION) -L../util -lutilpy -L../nvmatrix -lnvmatrix -L../cudaconv3 -lcudaconv -lcublas -Wl,-rpath=./util -Wl,-rpath=./nvmatrix -Wl,-rpath=./cudaconv3 -L$(HOME)/Libraries/libjpeg/lib
INCLUDES      := -I$(CUDA_INC_PATH) -I $(CUDA_SDK_PATH)/common/inc -I./include -I$(PYTHON_INCLUDE_PATH) -I$(NUMPY_INCLUDE_PATH) -I$(HOME)/libjpeg/include

make-data/pyext/Makefile

make-data.py script makes use of OpenCV library, which we have already installed, but in a non-standard location.

Line 27:

LINK_LIBS := -L$(CUDA_INSTALL_PATH)/lib64 `pkg-config --libs python` -lpthread -L/usr/local/opencv/2.4.5/lib64 -lopencv_core -lopencv_ml -lopencv_imgproc -lopencv_highgui

Line 29:

INCLUDES += -I$(PYTHON_INCLUDE_PATH) -I/usr/local/opencv/2.4.5/include

Additionally you need to change 2 lines in cudaconv3/src/weight_acts.cu to make it work with 4 channels, that we are using for representing 4 frames.

Line 2023:

assert(numGroups > 1 || (numImgColors > 0 && (numImgColors <= 4 || numImgColors % 16 == 0)));

Line 2059:

if (numFilterColors > 4) {

Finally build cuda-convnet2 using build.sh

$ ./build.sh

Modify paths

To make those libraries discoverable run-time, you need to add them to .bash_profile (or .bashrc):

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/opencv/2.4.5/lib64:$HOME/libjpeg/lib:$HOME/cuda-convnet2/util:$HOME/cuda-convnet2/nvmatrix:$HOME/cuda-convnet2/cudaconv3
export PYTHONPATH=$HOME/cuda-convnet2

You need to log out and in again for these changes to take effect.

Test cuda-convnet2

Generating batches from ImageNet dataset (doesn't need GPU):

$ python make-data.py --src-dir /storage/hpc_kristjan/ImageNet/ --tgt-dir /storage/hpc_tambet/cuda-convnet2

Run cuda-convnet2 on ImageNet batches using two Nvidia Tesla K20m GPUs:

$ srun --partition=long --gres=gpu:2 --constraint=K20 --mem=12000 python convnet.py --data-path /storage/hpc_kristjan/cuda-convnet2 --train-range 0-417 --test-range 1000-1016 --save-path /storage/hpc_tambet/tmp  --epochs 90 --layer-def layers/layers-imagenet-2gpu-data.cfg --layer-params layers/layer-params-imagenet-2gpu-data.cfg --data-provider image --inner-size 224 --gpu 0,1 --mini 128 --test-freq 201 --color-noise 0.1

srun is Slurm command to run a command on some node in cluster. Arguments:

  • --partition=long - run this job in partition for "long" jobs. Default partition is "short", which is limited by 30 minutes.
  • --gres=gpu:2 - require two GPUs from node.
  • --constraint=K20 - those GPUs must by Nvidia Tesla K20-s.
  • --mem=12000 - memory limit for the job is 12GB.
  • convnet.py arguments are described here.

Simpler and faster example that does only one epoch on one batch, using only one GPU:

$ srun --gres=gpu:1 --constraint=K20 --mem=12000 python convnet.py --data-path /storage/hpc_kristjan/cuda-convnet2 --train-range 0 --test-range 1000 --save-path /storage/hpc_tambet/tmp  --epochs 1 --layer-def layers/layers-imagenet-1gpu.cfg --layer-params layers/layer-params-imagenet-1gpu.cfg --data-provider image --inner-size 224 --gpu 0 --mini 128 --test-freq 201 --color-noise 0.1

Download and build DeepMind

Now finally we get to the DeepMind part. Clone DeepMind repository to a local directory:

$ git clone https://github.com/kristjankorjus/Replicating-DeepMind.git
$ cd Replicating-DeepMind

DeepMind comes with Arcade Learning Environment (ALE) included. First you have to turn off SDL and compile ALE.

To turn off SDL change line 7 in libraries/ale/makefile:

USE_SDL     := 0

Then compile it:

$ cd libraries/ale
$ make

Congratulations, now you can finally run DeepMind!

$ cd src
$ srun --gres=gpu:1 --constraint=K20 python main.py --gpu 0 --save-path /tmp

Basically main.py takes the same arguments, as convnet.py above, only some of the arguments are irrelevant and others have default values.