[Original Repo] [Example Colab]
In this setup we use a small part of the LibriSpeech Dataset for finetuning the English model, the other option is using the Vivos dataset for finetuning the Vietnamese model. In case you want to finetune in either another dataset or another language, check the "dataset.py". You are also able to change the hyperparameters by using other setup file base on the file "config/vn_base_example.yaml". The path to config file must be define in .env
Experiment on Vietnamese with Vivos Dataset, WER of the base Whisper model dropped from 45.56% to 24.27% after finetuning 5 epochs.
Python version: 3.8
Setup:
pip install -r requirements.txt
cp .env.copy .env
In case you want to finetune model in Vietnamese, run this command to download the dataset:
python data/download_data_vivos.py
tar -xvf vivos.tar.gz vivos
mv vivos data
Run demo page by running, it will take a while to download the model:
streamlit run interface.py
To Finetune (with only speech-to-text-task):
python finetune.py
In case you want to finetune Whisper for both tasks STT and translate (ex: using google api to translate Vietnamese text to English), you can see the example dataset at link
To evaluate the model:
python evaluate_wer.py
To inference:
You are able to record your own audio file and convert it from speech to text using "record.py" and "inference.py"
- Add python argument parser and refactor code
- Add dockerfile for deploy
- Add Vietnamese Text normalization / Postprocessing
- Add streamlit interface to record and inference