Unofficial PyTorch Implementation of paper. Most of codes are based on Link
- LibriTTS dataset (train-clean-100 and train-clean-360) is used.
- Sampling rate is set to 22050Hz (default).
- Clone this repository.
- Install python requirements. Please refer requirements.txt
- Run
python prepare_align.py --data_path [LibriTTS DATAPATH]
for some preparations. (You can change the sampling rate by adding --resample_rate [SR])
- Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. 1-1. Download MFA following the command in the website. 1-2. Run the below codes
$ conda activate aligner
$ mfa model download acoustic english_mfa
$ mfa align ......LibriTTS/wav22 lexicon.txt english_us_arpa .........LibriTTS/Textgrid
- Run
python preprocess.py
(Check input&output data paths)
python train.py --data_path [Preprocessed LibriTTS DATAPATH]
- Mel generation
python synthesize.py --checkpoint_path [CKPT PATH] --ref_audio [REF AUDIO PATH]
- Waveform generation (Use hifi-gan)
cd hifi-gan
python inference_e2e.py --checkpoint_file [VOCODER CKPT PATH]