Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Multi-GPU training, Griffin-Lim vocoder, safely restore checkpoints, pathlib, multiple hparams files #126

Merged
merged 36 commits into from
Sep 8, 2019

Conversation

TheButlah
Copy link
Contributor

@TheButlah TheButlah commented Aug 7, 2019

Follows up on PR #109.

Added the following features:

  • Multi-GPU Training.
  • Added Griffin-Lim as a baseline vocoder
  • Restoring checkpoints now also accounts for optimizer state, making it safe to save and restore training sessions
  • hparams module can be specified via argument to facilitate testing with different sets of hyperparameters
  • Vocoder and Tacotron both use gradient clipping by default now, and can be configured in hparams

Refactored the following:

  • Codebase now uses pathlib to be more platform independent and nice to use
  • Paths module is now clearer and less ambiguous by using pathlib, no longer need to worry if the paths should end in / or not
  • More things that should have been in PyTorch buffers are now in buffers
  • Checkpoint saving and restoring completely changed, explicitly delegates the responsibility of training-related state to the training scripts, and model-related state to the models.
  • Specifying training vs eval mode uses PyTorch conventions

Fixed the following bugs:

  • Issue where saving spectrograms via matplotlib was using interactive backend, requiring X11 to be running.
  • Missing imports
  • Other bugs

Please vet this code! It's not yet thoroughly tested.

Another caveat to note: The notebooks are now very very outdated. I haven't touched them, and they appear to use their own code so none of this should affect them.

TheButlah and others added 21 commits July 25, 2019 11:03
+ Added `voc_clip_grad_norm` hparam
+ Enabled gradient clipping in vocoder
+ Added warning when gradient is nan in vocoder and synthesizer
+ Added workaround for broken nn.DataParallel in `utils.__init__.py`
* Refactored `tacotron.py` and `fatchord_version.py` to store step
  in a PyTorch buffer instead of a gradient-less parameter
* Refactored training code for WaveRNN and tacotron to automagically
  use data parallelism when in the presence of multiple GPUS.
! Note that your batch size must be divisible by the number of GPUS

Made models `num_params()` actually useful if `print_out==False`

Fixed a bunch of my own bugs, now saving optimizer state

Fixed bug due to model.load() placing tensors on CPU.

* Made model.load() always map devices to the device of the model's
  parameters instead of always to CPU.

Fixed bug with improper comparison `device == torch.device('cuda')`

* Fixed a bug where checking to see if a tensor was using cuda was
  done via comparison to the cuda device. This is not the correct
  way of doing things, not sure how it even worked before.
…ing.py

Mode decode_mu_law() above xfade_and_unfold()

Fix gta bug in gen_wavernn.py

Fix target/overlap in quick_start.py - now definable through argparse

Typo in _flatten_parameters

Add newlines to optimiser loading display
+ Added type annotation for WaveRNN and Tacotron in train and
  generate files for each model
* Fixed missing import for numpy in `fatchord_version.py` and
  `deepmind_version.py`

Fixed bugs in multi-gpu tacotron where tensors were on wrong device

- Removed get_r and set_r
* Moved ownership of self.r buffer from Tacotron to Decoder
* Now using property decorator in Tacotron for self.r
* Made all buffers reside on same device as parameters to fix issue
  where nn.parallel.gather expected tensor to be on a particular
  GPU but it was on a different device (CPU) instead.
* Updated gen_tacotron.py and train_tacotron.py to use self.r
  property instead of getter and setter methods
* Made all returned values from forward() tensors to ensure that
  nn.parallel.gather works properly

Made preprocess.py use a configurable number of workers

Made self.r in Tacotron default to 1
* Now using pathlib in `paths.py`, so all properties are `Path`
  objects instead of strings.
* Changed `Paths.data` to be independent of `hparams.voc_id` as
  it doesn't make sense for the vocoder id to matter for tacotron
  and hence the dataset should be independent of it

Refactored preprocessing code to use pathlib

* Refactored `utils/files.py` to use `Path.rglob()`
* Refactored `preprocess.py` to use Path objects
* Refactored most files to use a name other than `id`
  for filename ids, as it shadows the builtin `id()`
  function and caused hard to debug errors.
* Fixed issue where `Paths.base` was the utils folder
* Refactored `utils/dataset.py` to use Path objects
* Added type annotation to `utils/recipes/files.py`
* Renamed `get_tts_dataset()` to `get_tts_datasets()`
  to be same as `get_voc_datasets()`
* Refactored `models/*` to use pathlib
+ Added `time` import in `train_tacotron.py`
* Modified `utils/display.py` to only use x11 when necessary.
  This will allow training the model on servers where x11 is either
  not installed or not working (e.g. due to improperly forwarded X11)
- Removed checkpoint and restore from `models/tacotron.py`
+ Added `save_checkpoint()` and `restore_checkpoint()` in
  `train_tacotron.py`
! Not yet finished with the refactoring! Code is breaking.
@TheButlah
Copy link
Contributor Author

TheButlah commented Aug 7, 2019

One caveat that I should note: For some reason, I don't always experience speedups when attempting to run on multiple GPUs. I think this might just be because my GPUs are fast and the batch size is relatively small, so the overhead of the data parallel code is hiding any benefits one might see.

This isn't really an issue, just be sure to compare single-GPU vs multiple-GPU on your system to see which trains faster for you. Obviously if your batch doesn't fit into VRAM, you will need to use the multi-gpu.

Controlling which GPUs are used is easily done via the CUDA_VISIBLE_DEVICES environment variable.

For example, to run on all available GPUs, you would do:
python train_wavernn.py

But to run on GPUs 0 and 1, do:
CUDA_VISIBLE_DEVICES=0,1 python train_wavernn.py

As per #107, to run on CPU only, do:
python train_wavernn.py --force_cpu

If you don't have GPUs in the first place, no special steps are required as the default mode is to use all GPUs present - if no GPUs are present, it doesn't try to use CUDA.

@fatchord
Copy link
Owner

fatchord commented Aug 9, 2019

This looks awesome @TheButlah - I'm still training models. It'll just take a bit of time.

@TheButlah
Copy link
Contributor Author

@fatchord Glad to help!

@TheButlah
Copy link
Contributor Author

TheButlah commented Aug 9, 2019

I just completed a training session of WaveRNN (without Taco) and even after all 1 million steps, I still get bad results. It seems like every epoch it alternates between good results and bad results.

Here are some samples. Note that I have modified some hparams, which may be the cause of all of this.
Results.zip

@fatchord how many iterations did you take to get good results? I just did the 1 million iters you specified in hparams.py, with your batch size.

Let me know if you think this is a bug in the patch, or in my hyperparameters.

@TheButlah
Copy link
Contributor Author

Across the board whether its your parameters or mine, I see poor results. I would greatly appreciate any insights you have on why the results seem of low quality - I'm out of ideas. I've looked through the codebase but I'm not really seeing anything.

@TheButlah
Copy link
Contributor Author

Archive.zip
Here are some outputs of tacotron trained with your hparams using Griffin-Lim as a vocoder. They seem more reasonable but still low quality. Are these results in line with what your results were?

@TheButlah
Copy link
Contributor Author

TheButlah commented Aug 10, 2019

Update: Added tts_stop_threshold hparam that controls the threshold at which tacotron stops generating. This value was originally hardcoded at -3.9, but I've made it -3.4 because that seemed to be better. Also made --force_train in tacotron actually force training.

As far as the convergence issues are concerned, I've come to the conclusion that I don't think the code is bugged. I think this is a hyperparameter issue. I have retrained using your hparams with your pretrained model. They are about equivalent at 180K steps, when r was equal to 2. However, training for the full schedule that you specified in hparams.py causes dramatically worse results. I suspect the reason is one or both of two possibilities:

  1. Training for too many iterations causes overfitting
  2. Setting r to 1 might need even more iterations to properly converge, or its just not a good choice. The original Tacotron paper used r=2 after all.

I don't know which of those two is true. I'm curious to see what your results are when you retrain. Let me know :)

Note that all of this investigation has focused on Tacotron only, I have not vetted WaveRNN yet but I don't see why it wouldn't apply to it too.

@fatchord
Copy link
Owner

@TheButlah On the wavernn side - I suspect that the gradient clipping is not ideal and not really needed either since I've never seen the gradient explode with that model.

On the tacotron side - what's the best sounding griffin lim sample you have?

@TheButlah
Copy link
Contributor Author

TheButlah commented Aug 12, 2019

@fatchord It sounds about equivalent to yours, for Tacotron (again haven't vetted WaveRNN but it shares a lot of its code with Tacotron so its probably working too) which means the fork should be working. Here are some griffin lim samples generated with your original tacotron pretrained model (no retraining):
Author Hparams Original.zip

And here is the same hparams trained to the same number of iters, but retrained on my fork. It sounds equivalent, which indicates that the fork is working:
Author Hparams Retrained (same training).zip

However, the quality significantly degrades when continuing to train on your hparams for the full 350 iters. This indicates that either the hparams in the last segment of the training schedule are not ideal, or that the model just needs even more time training to converge under those hparams:
Authors Hparams (longer training and lower r).zip

@TheButlah
Copy link
Contributor Author

@fatchord yep I just took a look at the results from continuing to train over the weekend, and the version where r is reduced to 1 fails to converge even at 467K iters, whereas the version where I keep r=2 continues to improve (at 296K iters so far).

I would suggest updating the learning schedule to change r=1 in the last session to r=2. This works dramatically better in my tests.

I will start training WaveRNN today or tomorrow, so I can confirm the correctness of the vocoder soon too

@TheButlah
Copy link
Contributor Author

TheButlah commented Aug 14, 2019

@fatchord I have retrained both WaveRNN and Tacotron and have confirmed that they work. Here are the results, when using the tacotron inference script with wavernn as the vocoder. WaveRNN has not yet finished training. I've included both Griffin-Lim and WaveRNN for the sake of comparison. Tacotron trained with r=2 in the last session, with a total number of iters of 376K. WaveRNN not yet finished training, with a total number of iters of 650K (not even 2/3 trained!) on ground truth spectrograms (not using --gta).

These results should prove that this fork is tested and ready to merge in. Let me know if you have any questions or need help with my changes!

@TheButlah TheButlah changed the title Add Multi-GPU training, safely restore checkpoints, pathlib, multiple hparams files Add Multi-GPU training, Griffin-Lim vocoder, safely restore checkpoints, pathlib, multiple hparams files Aug 14, 2019
@TheButlah
Copy link
Contributor Author

Update: finished training wavernn. Results good. Did notice that the quality seems to randomly deteriorate a little depending on which checkpoint I'm looking at. I think this is because the learning rate should be decreased later over the course of training. I trained to 1050K steps with the regular 1e-4 learning rate, and will train for another 200k steps with 1e-5 learning rate. The results at 1.1M steps already sound good, so this was probably a good call.

It would probably make sense to use training sessions like tacotron uses, but I'm not going to add that for now because its easy enough to find a checkpoint that sounds good, and the deterioration in quality is pretty minimal.

Again, let me know if you have any questions about the patch, it is ready to be merged in whenever.

@fatchord
Copy link
Owner

@TheButlah Wow, that's amazing, thanks - can I get the weights to try out?

@fatchord
Copy link
Owner

Btw - sorry I haven't been more active on this - work has been full on the last couple of weeks.

@TheButlah
Copy link
Contributor Author

TheButlah commented Aug 20, 2019

Hi @fatchord! No problem, its your repo after all, I've already got everything working on my end so its not like you're holding me up :)

I cannot release my trained models unfortunately, but you should be able to just use the edits I made to the hparams on my fork to get equivalent results to what I posted for Tacotron, and if you do the trick I mentioned about WaveRNN with the lower learning rate after training with the lr in hparams.py, you should be able to reproduce my results.

One way or another though, the fork should work 100% as it stands and is backwards compatible with your pretrained models.

@TheButlah
Copy link
Contributor Author

@fatchord any updates on the status of the merge?

@fatchord
Copy link
Owner

fatchord commented Sep 4, 2019

@TheButlah Sorry about the delay - I only got around at looking at it last night. There's a couple of small things I'd like to change - I want to put the new checkpointing functions into utils as there's duplicate code in the training files. Other than that - just small things and it's good to merge.

I'll create a new branch from your fork and get working on that tomorrow morning. Hopefully we can get this merged by the weekend.

@fatchord
Copy link
Owner

fatchord commented Sep 5, 2019

Ok I did a little bit there. Have a look and see if that makes sense. There might be a few more things - I'll have another look tomorrow morning.

@TheButlah
Copy link
Contributor Author

Looks good. I don't have my development environment set up right now for this project, so I'll take your word for it that it works. I did make one change to actually use the real r value in quick_start.py.

@fatchord
Copy link
Owner

fatchord commented Sep 8, 2019

I think it's good to merge now. @TheButlah Thanks again for your awesome contribution.

@fatchord fatchord merged commit 12922a7 into fatchord:master Sep 8, 2019
datitran pushed a commit to as-ideas/ForwardTacotron that referenced this pull request Mar 9, 2020
Add Multi-GPU training, Griffin-Lim vocoder, safely restore checkpoints, pathlib, multiple hparams files
geneing pushed a commit to geneing/WaveRNN that referenced this pull request Mar 14, 2020
Add Multi-GPU training, Griffin-Lim vocoder, safely restore checkpoints, pathlib, multiple hparams files
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants