Combines
https://github.com/NVIDIA/Megatron-LM/commits/master (commit 0399d32c75b4719c89b91c18a173d05936112036)
and
https://github.com/nyu-mll/jiant/commits/master (commit 14d9e3d294b6cb4a29b70325b2b993d5926fe668)
to get a BERT repo from pretraining to evaluation.
To run pretraining on slurm (from outside this directory):
srun --gres=gpu:1 -c 8 --mem=12G -p gpu bash Megatron-LM/scripts/pretrain_bert.sh > output.txt &
To run evaluation on slurm (from outside this directory):
srun --gres=gpu:1 -c 8 --mem=12G -p gpu python3 -m Megatron-LM.evaluate.main --config_file test_bert.conf &
We officially support only python3.6.
To use this repo please install the latest supported versions of PyTorch with GPU support.
Additionally, part of this codebase leverages tensorflow-cpu to perform dataloading of TFRecords. We recommend creating a virtual environment (to avoid breaking existing tf installations) and install our reuirements.txt
.
python -m pip install virtualenv
virtualenv bert_env
source bert_env/bin/activate
pip install -r requirements.txt
We've provided 4 scripts that pretrain BERT. All saved checkpoints can be used for finetuning according to existing implementations. Save model checkpoints with --save
.
bash scripts/pretrain_bert.sh
This script runs single gpu BERT pretraining and is mainly for debugging purposes.
To use this script place your --train-data
in loose json format with one json per line. The text field of your json dictionaries should correspond to --text-key
.
python pretrain_bert.py \
--batch-size 4 \
--tokenizer-type BertWordPieceTokenizer \
--cache-dir temp_cache_dir \
--tokenizer-model-type bert-large-uncased \
--vocab-size 30522 \
--train-data wikipedia \
--presplit-sentences \
--loose-json \
--text-key text \
--split 1000,1,1 \
--lazy-loader \
--max-preds-per-seq 80 \
--seq-length 512 \
--max-position-embeddings 512 \
--num-layers 24 \
--hidden-size 1024 \
--intermediate-size 4096 \
--num-attention-heads 16 \
--hidden-dropout 0.1 \
--attention-dropout 0.1 \
--train-iters 1000000 \
--lr 0.0001 \
--lr-decay-style linear \
--lr-decay-iters 990000 \
--warmup .01 \
--weight-decay 1e-2 \
--clip-grad 1.0 \
--fp16 \
--fp32-layernorm \
--fp32-embedding \
--hysteresis 2 \
--num-workers 2
bash scripts/pretrain_bert_distributed.sh
To use this script, follow the same data preparation procedure as in earlier sections. This script uses the pytorch distributed launcher to launch distributed training. As such, multinode training can be achieved by properly setting environment variables for the env://
init method. See the official pytorch documentation for further description of these environment variables. By default multinode training uses the nccl distributed backend.
bash scripts/pretrain_bert_tfrecords_distributed.sh
This script takes advantage of TensorFlow BERT's create_pretraining.py
script to pre-cache the dataset in the TFRecord format. To convert the data to pytorch tensors we use a TFRecordDataset
and tensorflow eager mode to turn the TFRecords into numpy matrices before loading them into pytorch gpu tensors. This greatly reduces the overhead of dataprocessing and speeds up training. Pass a whitespace-separated list of TFRecord paths to --train-data
and enable the --use-tfrecords
flag. Multinode training can be achieved as described in the previous section.
bash scripts/pretrain_bert_sentencepiece.sh
This script runs BERT pretraining with a sentencepiece
tokenizer. If no sentencepiece tokenizer exists at --tokenizer-path
one will be trained automatically. The sentencepiece tokenizer can be used with the previous scripts (NOTE: sentencepiece training can only happen during single gpu pretraining). <--tokenizer-path>.vocab
can be used with create_pretraining_data.py
to make a TFRecord dataset with the given tokenization.
We recommend following the wikipedia data extraction process specified by google research: "the recommended pre-processing is to download the latest dump, extract the text with WikiExtractor.py, and then apply any necessary cleanup to convert it into plain text."
We recommend using the --json
argument when using WikiExtractor, which will dump the wikipedia data into loose json format (one json per line), making it more manageable and readily consumable by our codebase. We recommend further preprocessing this json dataset by preprocessing the dataset with nltk punctuation standardization, and presplitting each document into newline separated sentences. This can be done with the provided script ./scripts/presplit_sentences_json.py
and will allow for faster data processing during training time. Pretraining with presplit data should be run with the --presplit-sentences
flag as shown above.
Once the json dataset is ready make sure to set the path in line 27 of data_utils/corpora.py
.
If your system is memory limited we also recommend running pretraining with the --lazy-loader
argument as we've done. After preprocessing the dataset once, this will allow the dataset to be lazily loaded from disk, as opposed to storing it in memory.