Qilin-Med: Multi-stage Knowledge Injection Advanced Medical Large Language Model

Qichen Ye, Junling Liu, Dading Chong, Peilin Zhou, Yining Hua, Andrew Liu, Xuxin Cheng

📖 Introduction

Integrating large language models (LLMs) into healthcare presents potential but faces challenges. Directly pre-training LLMs for domains like medicine is resource-heavy and sometimes unfeasible. Sole reliance on Supervised Fine-tuning (SFT) can result in overconfident predictions and may not tap into domainspecific insights. Addressing these challenges, we present a multi-stage training method combining Domain-specific Continued Pre-training (DCPT), SFT, and Direct Preference Optimization (DPO). A notable contribution of our study is the introduction of a 3Gb Chinese Medicine (ChiMed) dataset, encompassing medical question answering, plain texts, knowledge graphs, and dialogues, segmented into three training stages. The medical LLM trained with our pipeline, Qilin-Med, exhibits significant performance boosts.

✅ Todo

training scripts
training data
models

📚 Datasets

Our ChiMed dataset is avalibale at here, which containing the following three parts:

ChiMed-Pretrain(cpt.txt),
ChiMed-SFT(sft.jsonl)
ChiMed-DPO(dpo.json)

Stage 1: Domain-specific Continued Pretraining

Put the CHiMed-Pretrain data (i.e., cpt.txt) at data/pretrain/, then run the following scripts.

bash run_pt.sh

Stage 2: Supervised Fine-tuning

Put the CHiMed-SFT data (i.e., sft.jsonl) at data/sft/, then run the following scripts.

bash run_sft.sh

Stage 3: Direct Preference Optimization

Put the CHiMed-DPO data (i.e., dpo.json) at data/dpo/, then using scripts/merge_peft_adapter.py to merge the sft adapter with Qilin-Med-Pretrained, then put the resulting model to checkpoints/Qilin-Med-SFT-merged. Finally run the following scripts.

bash run_dpo.sh

Cite Us

@misc{ye2023qilinmed,
      title={Qilin-Med: Multi-stage Knowledge Injection Advanced Medical Large Language Model}, 
      author={Qichen Ye and Junling Liu and Dading Chong and Peilin Zhou and Yining Hua and Andrew Liu},
      year={2023},
      eprint={2310.09089},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Acknowledgement

Many thanks to the following awesome works!

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
config		config
eval		eval
images		images
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
run_dpo.sh		run_dpo.sh
run_pt.sh		run_pt.sh
run_sft.sh		run_sft.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Qilin-Med: Multi-stage Knowledge Injection Advanced Medical Large Language Model

📖 Introduction

✅ Todo

📚 Datasets

Stage 1: Domain-specific Continued Pretraining

Stage 2: Supervised Fine-tuning

Stage 3: Direct Preference Optimization

Cite Us

Acknowledgement

About

Releases

Packages

Languages

williamliujl/Qilin-Med

Folders and files

Latest commit

History

Repository files navigation

Qilin-Med: Multi-stage Knowledge Injection Advanced Medical Large Language Model

📖 Introduction

✅ Todo

📚 Datasets

Stage 1: Domain-specific Continued Pretraining

Stage 2: Supervised Fine-tuning

Stage 3: Direct Preference Optimization

Cite Us

Acknowledgement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages