Skip to content
forked from wenet-e2e/west

We Speech Transcript based on LLM, in 300 lines of code.

License

Notifications You must be signed in to change notification settings

JuliRao/west_fix

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WeST

We Speech Transcript, LLM based Speech Recognition/Transcript in 300 lines of code.

Details

Motivated by SLAM-ASR and LLaMA 3.1, Our model consists of a LLM, a Speech Encoder, and a Projector(speech adapter in LLaMA). Only the projector is trainable.

WeST Model

  • LLM, could be LLaMA, QWen, etc.
  • Speech Encoder, like whisper.

Install

pip install -r requirements.txt

Data Prepare

The training data(train.json) and test data(test.jsonl) should be prepared as jsonl format, which contains wav and txt in each line. Here is an example:

{"wav": "/data/BAC009S0764W0121.wav", "txt": "甚至出现交易几乎停滞的情况"}
{"wav": "/data/BAC009S0764W0122.wav", "txt": "一二线城市虽然也处于调整中"}

Training

torchrun --standalone --nnodes=1 --nproc_per_node=8 train.py \
    --llm_model_name_or_path Qwen2-1.5B-Instruct \
    --whisper_model_name_or_path tiny \
    --data_path train.jsonl \
    --bf16 True \
    --output_dir Qwen-1.5B-Instruct-whisper-tiny \
    --num_train_epochs 5 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 100 \
    --save_total_limit 10 \
    --learning_rate 3e-4 \
    --weight_decay 0.01 \
    --adam_beta2 0.95 \
    --warmup_ratio 0.01 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --report_to "none" \
    --model_max_length 512 \
    --gradient_checkpointing \
    --dataloader_num_workers 4 \
    --dataloader_prefetch_factor 10 \
    --deepspeed ds_config_zero3.json

Decoding

python recognize.py \
    --llm_model_name_or_path Qwen2-1.5B-Instruct \
    --whisper_model_name_or_path tiny \
    --projector_model_path Qwen-1.5B-Instruct-whisper-tiny/checkpoint-600/model.safetensors \
    --data_path test.jsonl \
    --result_path result.txt

Results

LibriSpeech(TODO)

AIShell

Different LLM

Exp LLM Speech Encoder Projector CER
1 QWen2 0.5B Whisper Large 1.5G Conv1d 12.07M 9.77
2 QWen2 1.5B Whisper Large 1.5G Conv1d 13.32M 7.45
3 QWen2 7B Whisper Large 1.5G Conv1d 17.32M 5.55

Different Speech Encoder

Exp LLM Speech Encoder Projector CER
1 QWen2 1.5B Whisper tiny 39M Conv1d 4.5M 35.82
2 QWen2 1.5B Whisper small 244M Conv1d 7.3M 12.41
3 QWen2 1.5B Whisper Large 1.5G Conv1d 13.32M 7.45

Training Loss

Different Decoding Beam

Based on QWen2 1.5B + Whisper Large 1.5G.

beam_size 1 3 5 8 10
CER 7.45 6.82 6.84 6.83 6.87

About

We Speech Transcript based on LLM, in 300 lines of code.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%