PyTorch implementation of: Rhythm-Flexible Voice Conversion without Parallel Data Using Cycle-GAN over Phoneme Posteriorgram Sequences
- This repo is NOT completed yet
- This repo is NOT completed yet
- This repo is NOT completed yet
- Please new issues if you find something werid or not working, thanks!
Samples could be found here, the corresponding experiment is specified at section 5.3 in the paper. Only conventional and proposed methods are compared here.
Python: '3.5.2'
Numpy: '1.16.2'
PyTorch: '0.4.1'
Montreal-force-aligner: '1.1.0'
- Download and decompress VCTK corpus
- Put text file and audio file under same dir, run
rename.sh
- Run align_VCTK.sh to get aligned result
- Set path info in config/config.yaml
- Run
preprocess.py
to generate acoustic features with corresponding phone label
- All hyperparameters are listed in this .yaml file
- All modules training could be done by calling the
main.py
by adding different arguments.
usage: main.py [-h] [--config CONFIG]
[--seed SEED] [--train | --test]
[--ppr | --ppts | --uppt]
[--spk_id SPK_ID] [--A_id A_ID] [--B_id B_ID]
[--pre_train]
- The detailed usages of each module are listed below.
- The path of logging and model saving should be specified in config file first.
python3 main.py --config [path-to-config] --train --ppr
python3 main.py --config [path-to-config] --test --ppr
python3 main.py --config [path-to-config] --train --ppts \\
--spk_id [which-speaker-to-train]
python3 main.py --config [path-to-config] --test --ppts \\
--spk_id [which-speaker-to-train]
python3 main.py --config [path-to-config] --train --uppt \\
--pre_train --A_id [src-speaker] --B_id [tgt-speaker]
- If A_id and B_id are both set to "all", then data of two groups of fast and slow speakers instead of two single speaker will be used instead for pre-training.
- Ex.
... --A_id all --B_id all
python3 main.py --config [path-to-config] --train --uppt \\
--A_id [src-speaker] --B_id [tgt-speaker]
python3 main.py --config [path-to-config] --test --uppt \\
--A_id [src-speaker] --B_id [tgt-speaker]
python3 star_main.py
--config [path-to-config] --train --uppt --pre_train
python3 star_main.py --config [path-to-config] --train --uppt
python3 star_main.py --config [path-to-config] --test --uppt \\
--tgt_id [tgt-speaker]
- Phoneme 'spn' means Unknown in MFA, so currently map it with 'sp' to id 0 as well.
- Is padding 'sp' a good choice? Or maybe 'sil'?
- Add Logging method to solver, removing add summ redundancy in both train and eval
- Whole conversion process pipeline, adding functions to load from specified path at inference time
- StarGAN inference