See also:
- DPKD: A simple improvement of MiniLLM using DPO.
- MiniPLM: Knowledge distillation for pre-training lanuage models.
pip3 install git+https://github.com/t1101675/transformers@minillm
pip3 install torch
pip3 install deepspeed
pip3 install numerize
pip3 install rouge-score
pip3 install torchtyping
pip3 install rich
pip3 install accelerate
pip3 install datasets
pip3 install peft
or
bash install.sh
Our data and pre-trained models are uploaded to our HuggingFace repo. We modified the transformers code base to support model (tensor) parallel and teacher-mixed sampling. The modified lines are wrapped with
# ### MiniLLM BEGIN ###
... SOME NEW CODES ...
# ### MiniLLM END ###
- The training/evaluation intruction-response data before processing can be downloaded from the following links: dolly, self-inst, vicuna, sinst, and uinst
huggingface-cli download MiniLLM/dolly --repo-type dataset /PATH_TO/LMOps/minillm/data/dolly/ huggingface-cli download MiniLLM/self-inst --repo-type dataset /PATH_TO/LMOps/minillm/data/self-inst/ huggingface-cli download MiniLLM/Vicuna --repo-type dataset /PATH_TO/LMOps/minillm/data/vicuna/ huggingface-cli download MiniLLM/sinst --repo-type dataset /PATH_TO/LMOps/minillm/data/sinst/ huggingface-cli download MiniLLM/uinst --repo-type dataset /PATH_TO/LMOps/minillm/data/uinst/
- (Optional) The plain-text corpus
$\mathcal{D}_\text{PT}$ can be download from the HugginFace datasets repository. For reproducibility, we recommend you to use the following preprocessed data. - The processed data can be downloaded from the following links: dolly, openwebtext (Optional), roberta-corpus (Optional).
huggingface-cli download MiniLLM/dolly-processed --repo-type dataset --local-dir /PATH_TO/LMOps/minillm/processed_data/dolly/ huggingface-cli download MiniLLM/openwebtext-processed --repo-type dataset --local-dir /PATH_TO/LMOps/minillm/processed_data/openwebtext/gpt2/512/10M/ # Optional huggingface-cli download MiniLLM/roberta-corpus-processed --repo-type dataset --local-dir /PATH_TO/LMOps/minillm/processed_data/openwebtext/ # Optional
bash scripts/gpt2/tools/process_data_dolly.sh /PATH_TO/LMOps/minillm # Process Dolly Train / Validation Data
bash scripts/opt/tools/process_data_dolly.sh /PATH_TO/LMOps/minillm # Process Dolly Train / Validation Data
bash scripts/llama/tools/process_data_dolly.sh /PATH_TO/LMOps/minillm # Process Dolly Train / Validation Data
Get plain-text corpus
python3 tools/get_openwebtext.py
This script will replace the continuous \n
in each document with a special token "<@x(x!>" and write each document in OpenWebText in a line, which is covenient for parallel processing. In data/openwebtext/data.txt
, we give an example of the resulting format. You can follow this format to prepare other corpus beyond OpenWebText.
Tokenize the data and store them in binary files:
bash scripts/gpt2/tools/process_data_pretrain.sh /PATH_TO/LMOps/minillm # Process OpenWebText Train / Validation Data
bash scripts/opt/tools/process_data_pretrain.sh /PATH_TO/LMOps/minillm # Process RoBERTa Corpus Train / Validation Data
bash scripts/llama/tools/process_data_pretrain.sh /PATH_TO/LMOps/minillm # Process RoBERTa Corpus Train / Validation Data
- The pre-trained models (MiniLLM and the baselines) can be found in this collection.
To run fine-tuning or standard KD baselines, you need to download the model checkpoints from [Huggingface Model Hub] and put them in checkpoints/
. For example, for gpt2-large, you can download the model from this link and put them in checkpoints/gpt2-large
.
huggingface-cli download gpt2 --repo-type model /PATH_TO/LMOps/minillm/checkpoints/gpt2-base
huggingface-cli download gpt2-medium --repo-type model /PATH_TO/LMOps/minillm/checkpoints/gpt2-medium
huggingface-cli download gpt2-large --repo-type model /PATH_TO/LMOps/minillm/checkpoints/gpt2-large
huggingface-cli download gpt2-xl --repo-type model /PATH_TO/LMOps/minillm/checkpoints/gpt2-xlarge
Alternatively, you can also change the CKPT
variable in each script to the corresponding model name to enable Transformers to download the base models automatically. For example, set CKPT="gpt2-large"
in scripts/gpt2/sft/sft_large.sh
causes download of the gpt2-large base model from the HugginFace model hub.
NOTE:
- LLaMA models require license and cannot be directly downloaded.
- If you want to use model parallel for training, it is recommended to download the models to
checkpoints
because you need to runtools/convert_mp.py
to change their model parallel sizes (see next section).
If you find the model is too large to fit in your GPUs, you can increase/decrease the tensor parallel sizes with
python3 tools/convert_mp.py \
--input_path results/llama/train/minillm/7B-init-13B-sft \
--source_mp_size 1 \
--target_mp_size 4 \
--model_type llama # choose from opt and llama
To use the model with Model Parallel, we provide two example scripts for training and evaluation.
NOTE: Model parallelism is not applied to gpt2 because these models are generally sufficiant small to fit in common GPUs.
bash scripts/gpt2/eval/run_eval.sh /PATH_TO/LMOps/minillm
bash scripts/opt/eval/run_eval.sh /PATH_TO/LMOps/minillm
bash scripts/llama/eval/run_eval.sh /PATH_TO/LMOps/minillm
We provide example commands for GPT-2 models. Similar scripts for model families can be found in scripts/opt
and scripts/llama
. All our experiments are conducted on 16 * 32V100, which can be reduced for small models.
Some large models require tensor parallel size = 4, which is set in the scripts with --model-parallel
and --model-parallel-size
options.
The final checkpoints are selected by the Rouge-L scores.
bash scripts/gpt2/sft/sft_xlarge.sh /PATH_TO/LMOps/minillm
bash scripts/gpt2/sft/sft_base.sh /PATH_TO/LMOps/minillm
bash scripts/gpt2/sft/sft_medium.sh /PATH_TO/LMOps/minillm
bash scripts/gpt2/sft/sft_large.sh /PATH_TO/LMOps/minillm
bash scripts/gpt2/kd/kd_base.sh /PATH_TO/LMOps/minillm
bash scripts/gpt2/kd/kd_medium.sh /PATH_TO/LMOps/minillm
bash scripts/gpt2/kd/kd_large.sh /PATH_TO/LMOps/minillm
Generate and process responses with the teacher:
bash scripts/gpt2/tools/generate_data_seqkd.sh /PATH_TO/LMOps/minillm
bash scripts/gpt2/tools/process_pseudo_data_seqkd.sh /PATH_TO/LMOps/minillm
Fine-tune the model with SeqKD:
bash scripts/gpt2/seqkd/seqkd_base.sh /PATH_TO/LMOps/minillm
bash scripts/gpt2/seqkd/seqkd_medium.sh /PATH_TO/LMOps/minillm
bash scripts/gpt2/seqkd/seqkd_large.sh /PATH_TO/LMOps/minillm
The final checkpoints are selected by the validation loss.
bash scripts/gpt2/sft/sft_base.sh /PATH_TO/LMOps/minillm
bash scripts/gpt2/sft/sft_medium.sh /PATH_TO/LMOps/minillm
bash scripts/gpt2/sft/sft_large.sh /PATH_TO/LMOps/minillm
The final checkpoints are selected by the Rouge-L scores.
bash scripts/gpt2/minillm/train_base_xl.sh /PATH_TO/LMOps/minillm
bash scripts/gpt2/minillm/train_medium_xl.sh /PATH_TO/LMOps/minillm
bash scripts/gpt2/minillm/train_large_xl.sh /PATH_TO/LMOps/minillm
For the data we use:
-
PROMPT_DATA_DIR
is the SFT data ($\mathcal{D}$ , Dolly), which is required. -
LM_DATA_DIR
is the plain-text corpus ($\mathcal{D}_\text{PT}$ ), which is optional. Seeminillm/scripts/gpt2/minillm/train_base_xl_no_pt.sh
for training withoutLM_DATA_DIR
(by just commenting out theOPTS+=" --lm-data-dir ${LM_DATA_DIR}"
line).
Multi-Node training is launched by deepspeed
. We provide an example script in scripts/llama/sft/sft_7B_mn.sh
for multi-node training. Compared to single-node scripts, some of the DISTRIBUTED_ARGS
are changed, and you need to specify a hostfile like configs/hostfiles/node_0_1
to tell the script which nodes to use. For more information, please refer to HuggingFace's tutorial.
@inproceedings{minillm,
title={MiniLLM: Knowledge Distillation of Large Language Models},
author={Gu, Yuxian and Dong, Li and Wei, Furu and Huang, Minlie},
booktitle={Proceedings of ICLR},
year={2024}
}