Skip to content

kjfff/ru-gpts

 
 

Repository files navigation

ruGPT3-(Small, Medium, Large, XL)

This repository contains bunch of autoregressive transformer language models trained on a huge dataset of russian language.

Russian GPT-3 models (ruGPT3XL, ruGPT3Large, ruGPT3Medium, ruGPT3Small) trained with 2048 sequence length with sparse and dense attention blocks. We also provide Russian GPT-2 large model (ruGPT2Large) trained with 1024 sequence length.

We suggest using ruGPT2Large or ruGPT3XL because this models are well tested and achieve the best perplexity.

Usage examples are described in detail here.

Note: If you couldn't download the checkpoint, try adding it to your google drive following this issue

Table of contents

Setup

Setup ruGPT3XL

Details of setup the XL model are described on a separate page here.

Setup ruGPT3Large

This model reuses code from Microsoft fork of Megatron-LM. Supports python3.6 only.

To use this repo please install the latest version of PyTorch with CUDA support.

Also this codebase leverages tensorflow-cpu to (optionally) perform dataloading of TFRecords for GPT training. We recommend creating a virtual environment (to avoid breaking existing tf installations) and install our requirements.txt.

python -m pip install virtualenv
virtualenv gpt_env
source gpt_env/bin/activate
pip install -r requirements.txt

To use sparse attention blocks, you should additionally install torch-blocksparse:

source gpt_env/bin/activate
pip install torch-blocksparse

Torch-Blocksparse depends on CUDA 10.1 and the Triton language compiler, which requires llvm-9.

Setup ruGPT3Medium

For this model you can use code from Megatron LM in our repo or use transformers interface. Therefore, you should follow the instructions for setup ruGPT2Large or ruGPT3Large.

Setup ruGPT3Small

For this model you can use code from microsoft Megatron LM in our repo or use transformers interface. Therefore, you should follow the instructions for setup ruGPT2Large or ruGPT3Large.

Setup ruGPT2Large

This model is smaller and was trained with transformers==v2.8.0. For installing use command:

pip install transformers

Pretraining

All pretraining was done on Nvidia Tesla V100-SXM3 32 Gb GPUs on a Christofari Cluster. Following are the details of pretraining for each model.

Pretraining ruGPT3XL

Model was trained with 512 sequence length using Deepspeed and Megatron code by SberDevices team, on 80B tokens dataset for 4 epochs. After that model was finetuned 1 epoch with sequence length 2048.
Note! Model has sparse attention blocks.

Total training time was around 10 days on 256 GPUs.
Final perplexity on test set is 12.05.

🤗HuggingFace model card link.

See more details here.

Pretraining ruGPT3Large

Model was trained with sequence length 1024 using transformers lib by SberDevices team on 80B tokens for 3 epochs. After that model was finetuned 1 epoch with sequence length 2048. *For load transformers checkpoint use --load-openai.

Total training time was around 14 days on 128 GPUs for 1024 context and few days on 16 GPUs for 2048 context.
Final perplexity on test set is 13.6.

You can obtain this model here GDrive Yandex.Disk GDrive option-2 or use transformers with model name sberbank-ai/rugpt3large_based_on_gpt2 (see usage for details).

🤗HuggingFace model card link

Pretraining ruGPT3Medium

Model was trained with sequence length 1024 using transformers lib by SberDevices team on 80B tokens for 3 epoch. After that model was finetuned on 2048 context.

Total training time was around 16 days on 64 GPUs.
Final perplexity on test set is 17.4.

You can obtain this model here GDrive Yandex.Disk GDrive option-2 or use transformers with model name sberbank-ai/rugpt3medium_based_on_gpt2 (see usage for details).

🤗HuggingFace model card link

Pretraining ruGPT3Small

Model was trained with sequence length 1024 using transformers by SberDevices team on 80B tokens around 3 epoch. After that model was finetuned on 2048 context.

Total training time took around one week on 32 GPUs.

You can obtain this model here GDrive or use transformers with model name sberbank-ai/rugpt3small_based_on_gpt2 (see usage for details).

🤗HuggingFace model card link

Pretraining ruGPT2Large

Model was trained with sequence length 1024 using transformers by SberDevices team on 170Gb data on 64 GPUs 3 weeks.

You can obtain this model here GDrive Yandex.Disk GDrive option-2 or use transformers with model name sberbank-ai/rugpt2large (see usage for details).

🤗HuggingFace model card link

Usage

Usage ruGPT3XL

See all the details here or run example in Google Colab

Usage ruGPT3Large

We provide 2 scripts for pretraining and generation with ruGPT3Large model. Save and load model checkpoints with --save and --load.

Finetuning

Data preparation

We support three file formats for training, but all of them require preprocessing. First, place your training data in a loose json format, with one json containing a text sample per line. For example:

{"src": "KISH", "text": "Как же джокер ты хитер", "type": "Ru", "id": "0", "title": "First Part"}
{"src": "The Internet", "text": "Ты удачи приговор", "type": "Ru", "id": "42", "title": "Second Part"}

The name of the text field of the json could be changed with --text-key flag. The other metadata is optional and is not used in training.

Running script

bash ./scripts/pretrain_ruGPT3Large.sh

This script runs pretraining ruGPT3Large on a single GPU. Script contains commands for running on Christophari:

MP_SIZE=1
NUM_GPUS_PER_WORKER=1

mpirun --np ${NUM_GPUS_PER_WORKER} python pretrain_megatron.py \
       --train-data /home/jovyan/data/train.jsonl \
       --valid-data /home/jovyan/data/valid.jsonl \
       --test-data /home/jovyan/data/valid.jsonl \
       --save /home/jovyan/ruGPT3Large/checkpoints_${now}_${host} \
       --load /home/jovyan/ruGPT3Large \
       --tensorboard-dir /home/jovyan/ruGPT3Large/runs_${now}_${host} \
       --save-interval 500 \
       --eval-interval 500 \
       --log-interval 100 \
       --model-parallel-size ${MP_SIZE} \
       --num-layers 24 \
       --hidden-size 1536 \
       --num-attention-heads 16 \
       --seq-length 2048 \
       --max-position-embeddings 2048 \
       --vocab-size 50257 \
       --batch-size 1 \
       --train-iters 200000 \
       --distributed-backend nccl \
       --lr 0.00015 \
       --lr-decay-style cosine \
       --weight-decay 1e-2 \
       --clip-grad 1.0 \
       --warmup .01 \
       --fp16 \
       --lazy-loader \
       --checkpoint-activations \
       --loose-json \
       --text-key \
       --tokenizer-path /home/jovyan/ruGPT3Large \
       --tokenizer-type GPT2BPETokenizer \
       --finetune \

Or you can use transformers interface:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("sberbank-ai/rugpt3large_based_on_gpt2")

model = AutoModel.from_pretrained("sberbank-ai/rugpt3large_based_on_gpt2")
Text Generation

bash ./scripts/generate_ruGPT3Large.sh

Starts an interactive terminal session that generates text either conditionally or unconditionally depending on what the user enters into the prompt.
The script is capable of top-K and top-P sampling as specified by the appropriate variables within the script.
Example of generation:

Context: на словах ты лев толстой
ruGPT3Large: а в сущности, - ты тоже не дурак, просто так же, как и твой человек, то есть твоя "жизнь", а также как и ты думаешь по-настоящему "ты" и есть твои "жизнь" или "выбор" в отношении твоего положения.

Context: как же джокер ты хитер
ruGPT3Large: или автор книги по бизнесу!

Example of generation in Googel Colab

Usage ruGPT3Medium

You can run megatron script with option --load-openai or use transformers interface:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("sberbank-ai/rugpt3medium_based_on_gpt2")

model = AutoModel.from_pretrained("sberbank-ai/rugpt3medium_based_on_gpt2")

Text Generation

bash ./scripts/generate_ruGPT3Medium.sh

Starts an interactive terminal session that generates text either conditionally or unconditionally depending on what the user enters into the prompt.
The script is capable of top-K and top-P sampling as specified by the appropriate variables within the script.
Example of generation:

Context >>> На словах ты Лев Толстой, а на деле
ruGPT: На словах ты Лев Толстой, а на деле я — Лев Давидович Троцкий, — сказал я. — Так что мы еще посмотрим

Context: как же джокер ты хитер
ruGPT: как же джокер ты хитер, в этой игре
 - Я не злодей, просто хотел узнать, можно ли узнать о чём?

Usage ruGPT3Small

You can run megatron script with option --load-openai or use transformers interface:

from transformers import AutoTokenizer, AutoModelWithLMHead

tokenizer = AutoTokenizer.from_pretrained("sberbank-ai/rugpt3small_based_on_gpt2")

model = AutoModelWithLMHead.from_pretrained("sberbank-ai/rugpt3small_based_on_gpt2")

Text Generation

bash ./scripts/generate_ruGPT3Small.sh

Starts an interactive terminal session that generates text either conditionally or unconditionally depending on what the user enters into the prompt.
The script is capable of top-K and top-P sampling as specified by the appropriate variables within the script.
Example of generation:

Context >>> На словах ты Лев Толстой, а на деле
ruGPT: На словах ты Лев Толстой, а на деле – Толстой, – с улыбкой заметил Николай, – я вижу, что ты прав.

– А вот это – другое дело, – сказал Лев Толстой, – это дело другое.

– Да, да, – согласился Николай, – я прав.

– А вот что, Лев Николаевич, – сказал Лев Толстой, – я думаю, что в этом отношении у меня нет оснований сомневаться в твоей правоте.

Example of finetuning on essays and generation in Open In Colab

Usage ruGPT2Large

We provide 2 scripts that pretrain and generate with ruGPT2Large from transformers original code.

Finetuning

Data preparation

We can pass to model raw text files.

Running script

bash ./scripts/pretrain_ruGPT2Large.sh

This script runs single gpu ruGPT3Large pretraining. This script contains command for running on Christofari:

python pretrain_transformers.py \
    --output_dir=/home/jovyan/rugpt2large/checkpoints_"${now}"_"${host}" \
    --model_type=gpt2 \
    --model_name_or_path=/home/jovyan/gpt2_large_bbpe_v50 \
    --do_train \
    --train_data_file=/home/jovyan/data/train.txt \
    --do_eval \
    --eval_data_file=/home/jovyan/data/valid.txt \
    --fp16

Or use transformers interface:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("sberbank-ai/rugpt2large")

model = AutoModel.from_pretrained("sberbank-ai/rugpt2large")

Text Generation

bash ./scripts/generate_ruGPT2Large.sh

Starts an interactive terminal session that generates text either conditionally or unconditionally depending on what the user enters into the prompt.
The script is capable of top-K and top-P sampling as specified by the appropriate variables within the script.
Example of generation:

Context: На словах ты Лев Толстой, а на деле
ruGPT: На словах ты Лев Толстой, а на деле – козел!» – так я про себя подумал, но решил не отвечать. Я встал, поклонился

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.5%
  • Shell 0.5%