Add Multi-GPU training, Griffin-Lim vocoder, safely restore checkpoints, pathlib, multiple hparams files #126

TheButlah · 2019-08-07T17:57:35Z

Follows up on PR #109.

Added the following features:

Multi-GPU Training.
Added Griffin-Lim as a baseline vocoder
Restoring checkpoints now also accounts for optimizer state, making it safe to save and restore training sessions
hparams module can be specified via argument to facilitate testing with different sets of hyperparameters
Vocoder and Tacotron both use gradient clipping by default now, and can be configured in hparams

Refactored the following:

Codebase now uses pathlib to be more platform independent and nice to use
Paths module is now clearer and less ambiguous by using pathlib, no longer need to worry if the paths should end in / or not
More things that should have been in PyTorch buffers are now in buffers
Checkpoint saving and restoring completely changed, explicitly delegates the responsibility of training-related state to the training scripts, and model-related state to the models.
Specifying training vs eval mode uses PyTorch conventions

Fixed the following bugs:

Issue where saving spectrograms via matplotlib was using interactive backend, requiring X11 to be running.
Missing imports
Other bugs

~~Please vet this code! It's not yet thoroughly tested.~~

Another caveat to note: The notebooks are now very very outdated. I haven't touched them, and they appear to use their own code so none of this should affect them.

+ Added `voc_clip_grad_norm` hparam + Enabled gradient clipping in vocoder + Added warning when gradient is nan in vocoder and synthesizer + Added workaround for broken nn.DataParallel in `utils.__init__.py` * Refactored `tacotron.py` and `fatchord_version.py` to store step in a PyTorch buffer instead of a gradient-less parameter * Refactored training code for WaveRNN and tacotron to automagically use data parallelism when in the presence of multiple GPUS. ! Note that your batch size must be divisible by the number of GPUS Made models `num_params()` actually useful if `print_out==False` Fixed a bunch of my own bugs, now saving optimizer state Fixed bug due to model.load() placing tensors on CPU. * Made model.load() always map devices to the device of the model's parameters instead of always to CPU. Fixed bug with improper comparison `device == torch.device('cuda')` * Fixed a bug where checking to see if a tensor was using cuda was done via comparison to the cuda device. This is not the correct way of doing things, not sure how it even worked before.

…ing.py Mode decode_mu_law() above xfade_and_unfold() Fix gta bug in gen_wavernn.py Fix target/overlap in quick_start.py - now definable through argparse Typo in _flatten_parameters Add newlines to optimiser loading display

+ Added type annotation for WaveRNN and Tacotron in train and generate files for each model * Fixed missing import for numpy in `fatchord_version.py` and `deepmind_version.py` Fixed bugs in multi-gpu tacotron where tensors were on wrong device - Removed get_r and set_r * Moved ownership of self.r buffer from Tacotron to Decoder * Now using property decorator in Tacotron for self.r * Made all buffers reside on same device as parameters to fix issue where nn.parallel.gather expected tensor to be on a particular GPU but it was on a different device (CPU) instead. * Updated gen_tacotron.py and train_tacotron.py to use self.r property instead of getter and setter methods * Made all returned values from forward() tensors to ensure that nn.parallel.gather works properly Made preprocess.py use a configurable number of workers Made self.r in Tacotron default to 1

* Now using pathlib in `paths.py`, so all properties are `Path` objects instead of strings. * Changed `Paths.data` to be independent of `hparams.voc_id` as it doesn't make sense for the vocoder id to matter for tacotron and hence the dataset should be independent of it Refactored preprocessing code to use pathlib * Refactored `utils/files.py` to use `Path.rglob()` * Refactored `preprocess.py` to use Path objects * Refactored most files to use a name other than `id` for filename ids, as it shadows the builtin `id()` function and caused hard to debug errors. * Fixed issue where `Paths.base` was the utils folder * Refactored `utils/dataset.py` to use Path objects * Added type annotation to `utils/recipes/files.py` * Renamed `get_tts_dataset()` to `get_tts_datasets()` to be same as `get_voc_datasets()` * Refactored `models/*` to use pathlib

+ Added `time` import in `train_tacotron.py` * Modified `utils/display.py` to only use x11 when necessary. This will allow training the model on servers where x11 is either not installed or not working (e.g. due to improperly forwarded X11)

- Removed checkpoint and restore from `models/tacotron.py` + Added `save_checkpoint()` and `restore_checkpoint()` in `train_tacotron.py` ! Not yet finished with the refactoring! Code is breaking.

TheButlah · 2019-08-07T18:02:05Z

One caveat that I should note: For some reason, I don't always experience speedups when attempting to run on multiple GPUs. I think this might just be because my GPUs are fast and the batch size is relatively small, so the overhead of the data parallel code is hiding any benefits one might see.

This isn't really an issue, just be sure to compare single-GPU vs multiple-GPU on your system to see which trains faster for you. Obviously if your batch doesn't fit into VRAM, you will need to use the multi-gpu.

Controlling which GPUs are used is easily done via the CUDA_VISIBLE_DEVICES environment variable.

For example, to run on all available GPUs, you would do:
python train_wavernn.py

But to run on GPUs 0 and 1, do:
CUDA_VISIBLE_DEVICES=0,1 python train_wavernn.py

As per #107, to run on CPU only, do:
python train_wavernn.py --force_cpu

If you don't have GPUs in the first place, no special steps are required as the default mode is to use all GPUs present - if no GPUs are present, it doesn't try to use CUDA.

fatchord · 2019-08-09T14:28:15Z

This looks awesome @TheButlah - I'm still training models. It'll just take a bit of time.

TheButlah · 2019-08-09T15:20:01Z

@fatchord Glad to help!

TheButlah · 2019-08-09T17:14:11Z

I just completed a training session of WaveRNN (without Taco) and even after all 1 million steps, I still get bad results. It seems like every epoch it alternates between good results and bad results.

Here are some samples. Note that I have modified some hparams, which may be the cause of all of this.
Results.zip

@fatchord how many iterations did you take to get good results? I just did the 1 million iters you specified in hparams.py, with your batch size.

Let me know if you think this is a bug in the patch, or in my hyperparameters.

TheButlah · 2019-08-10T00:29:19Z

Across the board whether its your parameters or mine, I see poor results. I would greatly appreciate any insights you have on why the results seem of low quality - I'm out of ideas. I've looked through the codebase but I'm not really seeing anything.

TheButlah · 2019-08-10T00:37:33Z

Archive.zip
Here are some outputs of tacotron trained with your hparams using Griffin-Lim as a vocoder. They seem more reasonable but still low quality. Are these results in line with what your results were?

TheButlah · 2019-08-10T04:38:06Z

Update: Added tts_stop_threshold hparam that controls the threshold at which tacotron stops generating. This value was originally hardcoded at -3.9, but I've made it -3.4 because that seemed to be better. Also made --force_train in tacotron actually force training.

As far as the convergence issues are concerned, I've come to the conclusion that I don't think the code is bugged. I think this is a hyperparameter issue. I have retrained using your hparams with your pretrained model. They are about equivalent at 180K steps, when r was equal to 2. However, training for the full schedule that you specified in hparams.py causes dramatically worse results. I suspect the reason is one or both of two possibilities:

Training for too many iterations causes overfitting
Setting r to 1 might need even more iterations to properly converge, or its just not a good choice. The original Tacotron paper used r=2 after all.

I don't know which of those two is true. I'm curious to see what your results are when you retrain. Let me know :)

Note that all of this investigation has focused on Tacotron only, I have not vetted WaveRNN yet but I don't see why it wouldn't apply to it too.

fatchord · 2019-08-12T08:15:17Z

@TheButlah On the wavernn side - I suspect that the gradient clipping is not ideal and not really needed either since I've never seen the gradient explode with that model.

On the tacotron side - what's the best sounding griffin lim sample you have?

TheButlah · 2019-08-12T16:50:17Z

@fatchord It sounds about equivalent to yours, for Tacotron (again haven't vetted WaveRNN but it shares a lot of its code with Tacotron so its probably working too) which means the fork should be working. Here are some griffin lim samples generated with your original tacotron pretrained model (no retraining):
Author Hparams Original.zip

And here is the same hparams trained to the same number of iters, but retrained on my fork. It sounds equivalent, which indicates that the fork is working:
Author Hparams Retrained (same training).zip

However, the quality significantly degrades when continuing to train on your hparams for the full 350 iters. This indicates that either the hparams in the last segment of the training schedule are not ideal, or that the model just needs even more time training to converge under those hparams:
Authors Hparams (longer training and lower r).zip

TheButlah · 2019-08-13T00:02:12Z

@fatchord yep I just took a look at the results from continuing to train over the weekend, and the version where r is reduced to 1 fails to converge even at 467K iters, whereas the version where I keep r=2 continues to improve (at 296K iters so far).

I would suggest updating the learning schedule to change r=1 in the last session to r=2. This works dramatically better in my tests.

I will start training WaveRNN today or tomorrow, so I can confirm the correctness of the vocoder soon too

TheButlah · 2019-08-14T18:11:33Z

@fatchord I have retrained both WaveRNN and Tacotron and have confirmed that they work. Here are the results, when using the tacotron inference script with wavernn as the vocoder. WaveRNN has not yet finished training. I've included both Griffin-Lim and WaveRNN for the sake of comparison. Tacotron trained with r=2 in the last session, with a total number of iters of 376K. WaveRNN not yet finished training, with a total number of iters of 650K (not even 2/3 trained!) on ground truth spectrograms (not using --gta).

These results should prove that this fork is tested and ready to merge in. Let me know if you have any questions or need help with my changes!

TheButlah · 2019-08-15T18:41:35Z

Update: finished training wavernn. Results good. Did notice that the quality seems to randomly deteriorate a little depending on which checkpoint I'm looking at. I think this is because the learning rate should be decreased later over the course of training. I trained to 1050K steps with the regular 1e-4 learning rate, and will train for another 200k steps with 1e-5 learning rate. The results at 1.1M steps already sound good, so this was probably a good call.

It would probably make sense to use training sessions like tacotron uses, but I'm not going to add that for now because its easy enough to find a checkpoint that sounds good, and the deterioration in quality is pretty minimal.

Again, let me know if you have any questions about the patch, it is ready to be merged in whenever.

fatchord · 2019-08-16T16:56:05Z

@TheButlah Wow, that's amazing, thanks - can I get the weights to try out?

fatchord · 2019-08-16T16:57:11Z

Btw - sorry I haven't been more active on this - work has been full on the last couple of weeks.

TheButlah · 2019-08-20T20:20:36Z

Hi @fatchord! No problem, its your repo after all, I've already got everything working on my end so its not like you're holding me up :)

I cannot release my trained models unfortunately, but you should be able to just use the edits I made to the hparams on my fork to get equivalent results to what I posted for Tacotron, and if you do the trick I mentioned about WaveRNN with the lower learning rate after training with the lr in hparams.py, you should be able to reproduce my results.

One way or another though, the fork should work 100% as it stands and is backwards compatible with your pretrained models.

TheButlah · 2019-09-03T02:16:11Z

@fatchord any updates on the status of the merge?

fatchord · 2019-09-04T14:49:37Z

@TheButlah Sorry about the delay - I only got around at looking at it last night. There's a couple of small things I'd like to change - I want to put the new checkpointing functions into utils as there's duplicate code in the training files. Other than that - just small things and it's good to merge.

I'll create a new branch from your fork and get working on that tomorrow morning. Hopefully we can get this merged by the weekend.

fatchord · 2019-09-05T10:31:50Z

Ok I did a little bit there. Have a look and see if that makes sense. There might be a few more things - I'll have another look tomorrow morning.

TheButlah · 2019-09-05T23:15:53Z

Looks good. I don't have my development environment set up right now for this project, so I'll take your word for it that it works. I did make one change to actually use the real r value in quick_start.py.

fatchord · 2019-09-08T12:48:51Z

I think it's good to merge now. @TheButlah Thanks again for your awesome contribution.

Add Multi-GPU training, Griffin-Lim vocoder, safely restore checkpoints, pathlib, multiple hparams files

TheButlah and others added 21 commits July 25, 2019 11:03

Fixed warnings for flattening parameters in rnns

f766ce7

Merge remote-tracking branch 'upstream/master' into fork

1a947d0

Fixed missing import and using x11

1ed4591

+ Added `time` import in `train_tacotron.py` * Modified `utils/display.py` to only use x11 when necessary. This will allow training the model on servers where x11 is either not installed or not working (e.g. due to improperly forwarded X11)

Fixed issue where attention tensor used as numpy array

62abd12

Made models use the regular way of detecting training vs eval

90bb56d

Added save_checkpoint and restore_checkpoint functions

c738ee5

- Removed checkpoint and restore from `models/tacotron.py` + Added `save_checkpoint()` and `restore_checkpoint()` in `train_tacotron.py` ! Not yet finished with the refactoring! Code is breaking.

Fixed train_tacotron.py so that it works with new way of checkpointing

924fbeb

Fixed bug where paths was not in scope for taco train loop

212b8b4

Silenced the printing of when checkpoints are saved during training

c549267

Fixed bug in create_gta_features where enumeated over int

54bb541

Enabled specifying hparams location in train_tacotron/wavernn

45707ef

Made preprocess.py accept argument to control location of hparams.py

146cfe1

Removed single letter flags and added error message for import from path

f71131d

Fixed issue where preprocess args were referencing hp before defined

eaa0b25

Added hparams pseudo-module to enable changing hparams at runtime

697ec9c

Fixed train_wavernn and inference scripts to use new ckpt code

e185a96

Made inference scripts use Pathlib, fixes minor bug

d30754f

Added note to call model.to(device) after restore_checkpoint()

d2fc0a5

Began adding Griffin-Lim vocoder

d0cc6b4

Finished adding Griffin-Lim vocoder, backward compat with saved models

b43cced

Added tts_stop_threshold hparam

f62bbc5

TheButlah force-pushed the master branch from f0b79fc to f62bbc5 Compare August 10, 2019 04:04

Made force_train in tacotron actually do something

1b4aaa7

Fixed bug where current_step in training loop not updated

4e940e1

TheButlah added 4 commits August 13, 2019 16:04

Made epochs 1-indexed

6b79dbb

Made r=2 in last session in tts hparams schedule

d2f0bb9

Made vocoder and synth weights explicitly separated for inference

352c51a

Updated readme to fix gen_tacotron commands

1798680

TheButlah changed the title ~~Add Multi-GPU training, safely restore checkpoints, pathlib, multiple hparams files~~ Add Multi-GPU training, Griffin-Lim vocoder, safely restore checkpoints, pathlib, multiple hparams files Aug 14, 2019

Allowed gen_wavernn.py to accept mel spectrogram .npy

942dddb

fatchord added 2 commits September 5, 2019 11:54

Generalise checkpointing for both model types

72d28b0

Fix quick_start.py

fb8fcb4

Made r value in quick_start.py for tts_model no longer hardcoded

31179a6

fatchord merged commit 12922a7 into fatchord:master Sep 8, 2019

datitran pushed a commit to as-ideas/ForwardTacotron that referenced this pull request Mar 9, 2020

Merge pull request fatchord#126 from TheButlah/master

a46d4aa

Add Multi-GPU training, Griffin-Lim vocoder, safely restore checkpoints, pathlib, multiple hparams files

geneing pushed a commit to geneing/WaveRNN that referenced this pull request Mar 14, 2020

Merge pull request fatchord#126 from TheButlah/master

1a63ac2

Add Multi-GPU training, Griffin-Lim vocoder, safely restore checkpoints, pathlib, multiple hparams files

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Multi-GPU training, Griffin-Lim vocoder, safely restore checkpoints, pathlib, multiple hparams files #126

Add Multi-GPU training, Griffin-Lim vocoder, safely restore checkpoints, pathlib, multiple hparams files #126

TheButlah commented Aug 7, 2019 •

edited

Loading

TheButlah commented Aug 7, 2019 •

edited

Loading

fatchord commented Aug 9, 2019

TheButlah commented Aug 9, 2019

TheButlah commented Aug 9, 2019 •

edited

Loading

TheButlah commented Aug 10, 2019

TheButlah commented Aug 10, 2019

TheButlah commented Aug 10, 2019 •

edited

Loading

fatchord commented Aug 12, 2019

TheButlah commented Aug 12, 2019 •

edited

Loading

TheButlah commented Aug 13, 2019

TheButlah commented Aug 14, 2019 •

edited

Loading

TheButlah commented Aug 15, 2019

fatchord commented Aug 16, 2019

fatchord commented Aug 16, 2019

TheButlah commented Aug 20, 2019 •

edited

Loading

TheButlah commented Sep 3, 2019

fatchord commented Sep 4, 2019

fatchord commented Sep 5, 2019

TheButlah commented Sep 5, 2019

fatchord commented Sep 8, 2019

Add Multi-GPU training, Griffin-Lim vocoder, safely restore checkpoints, pathlib, multiple hparams files #126

Add Multi-GPU training, Griffin-Lim vocoder, safely restore checkpoints, pathlib, multiple hparams files #126

Conversation

TheButlah commented Aug 7, 2019 • edited Loading

TheButlah commented Aug 7, 2019 • edited Loading

fatchord commented Aug 9, 2019

TheButlah commented Aug 9, 2019

TheButlah commented Aug 9, 2019 • edited Loading

TheButlah commented Aug 10, 2019

TheButlah commented Aug 10, 2019

TheButlah commented Aug 10, 2019 • edited Loading

fatchord commented Aug 12, 2019

TheButlah commented Aug 12, 2019 • edited Loading

TheButlah commented Aug 13, 2019

TheButlah commented Aug 14, 2019 • edited Loading

TheButlah commented Aug 15, 2019

fatchord commented Aug 16, 2019

fatchord commented Aug 16, 2019

TheButlah commented Aug 20, 2019 • edited Loading

TheButlah commented Sep 3, 2019

fatchord commented Sep 4, 2019

fatchord commented Sep 5, 2019

TheButlah commented Sep 5, 2019

fatchord commented Sep 8, 2019

TheButlah commented Aug 7, 2019 •

edited

Loading

TheButlah commented Aug 7, 2019 •

edited

Loading

TheButlah commented Aug 9, 2019 •

edited

Loading

TheButlah commented Aug 10, 2019 •

edited

Loading

TheButlah commented Aug 12, 2019 •

edited

Loading

TheButlah commented Aug 14, 2019 •

edited

Loading

TheButlah commented Aug 20, 2019 •

edited

Loading