This directory contains our TF implementation of Transformer-XL. Note that our state-of-the-art results reported in the paper were obtained by training the model on a large-scale TPU cluster, and our gpu codebase currently does not support distributed training. Here we provide two sets of hyperparameters and scripts:
*large_tpu.sh
are for the SoTA setting on TPUs. These are exactly the commands we used to obtained our best results.*base_gpu.sh
are for the base models which can be run on a few GPUs.
- Python 2.7
- Tensorflow 1.12.0
(a) Set your own DATA_ROOT
in sota/download.sh
(default to ./
), which will be the root diretory of downloaded model.
(b) Then, download the model & data by bash sota/download.sh
. After downloading, the expected directory structure is as follows
pretrained_xl
tf_enwik8/
data/
cache.pkl
corpus-info.json
model/
checkpoint
model.ckpt*
tf_wt103/
...
...
Note: we include preprocessed data in the download files to make sure the same vocabulary is used. Please see the code tf/data_utils.py
to understand the data structure.
-
enwik8: modify the script
sota/enwik8.sh
accordingly (see below)- set
DATA_ROOT
to the same folder used in the download step (default to./
) - set
TEST_NUM_CORE
(number of GPUs to use): we recommend 2 GPUs => about 60 mins - run the script:
bash sota/enwik8.sh
- set
-
lm1b: modify the script
sota/lm1b.sh
accordingly (see below)- set
DATA_ROOT
to the same folder used in the download step (default to./
) - set
TEST_NUM_CORE
(number of GPUs to use): we recommend 1 GPUs => less than 5 mins - run the script:
bash sota/lm1b.sh
- set
-
wt103: modify the script
sota/wt103.sh
accordingly (see below)- set
DATA_ROOT
to the same folder used in the download step (default to./
) - set
TEST_NUM_CORE
(number of GPUs to use): we recommend 1 GPUs => less than 5 mins - run the script:
bash sota/wt103.sh
- set
-
text8: modify the script
sota/text8.sh
accordingly (see below)- set
DATA_ROOT
to the same folder used in the download step (default to./
) - set
TEST_NUM_CORE
(number of GPUs to use): we recommend 2 GPUs => about 60 mins - run the script:
bash sota/text8.sh
- set
We used 32, 32, 64, and 512 TPU cores for training our best models on enwik8, text8, wt103, and lm1b respectively. The training time for each model ranges from 2 to 5 days.
bash getdata.sh
For dataset
in [enwik8, lm1b, wt103, text8]
:
- check out
scripts/dataset_base_gpu.sh
for GPU training and evaluation - check out
scripts/dataset_large_tpu.sh
for TPU training and evaluation
NOTE: The preprocessing for GPU and TPU are different. So, you have to run them separately.
GPU:
- create training and validation data:
bash scripts/dataset_bas_gpu.sh train_data
- create test data:
bash scripts/dataset_base_gpu.sh test_data
TPU:
- Set the Google storage URL in
scripts/dataset_large_tpu.sh
:GSDATA
: data URLGSEXP
: experiment URL
- create training and validation data:
bash scripts/dataset_large_tpu.sh train_data
- create test data:
bash scripts/dataset_large_tpu.sh test_data
Base models on GPUs:
- Modify the configurations in
scripts/dataset_base_gpu.sh
according to your needs. bash scripts/dataset_base_gpu.sh train
- If enough resources are available, increasing the model sizes (e.g.,
N_LAYER
,D_MODEL
,D_EMBED
,D_HEAD
,D_INNER
) so that they are closer to the values defined inscripts/dataset_large_tpu.sh
. Likewise, when resources are limited, decrease the model sizes. It is recommended to ensure thatD_MODEL == D_EMBED
andD_MODEL == N_HEAD x D_HEAD
. When the model sizes increase, remember to increasewarmup_steps
accordingly to alleviate optimization difficulties. - Adjust the
NUM_CORE
parameter to reflect the number of GPUs to use.
Larger models on TPUs:
- Modify the configurations in
scripts/dataset_large_tpu.sh
according to your needs. bash scripts/dataset_large_tpu.sh train
Base models on GPUs:
bash scripts/dataset_base_gpu.sh eval --eval_ckpt_path PATH_TO_CKPT
Larger models on TPUs:
bash scripts/dataset_base_tpu.sh eval --eval_ckpt_path PATH_TO_CKPT