-
Notifications
You must be signed in to change notification settings - Fork 698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Multi-GPU training, Griffin-Lim vocoder, safely restore checkpoints, pathlib, multiple hparams files #126
Conversation
+ Added `voc_clip_grad_norm` hparam + Enabled gradient clipping in vocoder + Added warning when gradient is nan in vocoder and synthesizer + Added workaround for broken nn.DataParallel in `utils.__init__.py` * Refactored `tacotron.py` and `fatchord_version.py` to store step in a PyTorch buffer instead of a gradient-less parameter * Refactored training code for WaveRNN and tacotron to automagically use data parallelism when in the presence of multiple GPUS. ! Note that your batch size must be divisible by the number of GPUS Made models `num_params()` actually useful if `print_out==False` Fixed a bunch of my own bugs, now saving optimizer state Fixed bug due to model.load() placing tensors on CPU. * Made model.load() always map devices to the device of the model's parameters instead of always to CPU. Fixed bug with improper comparison `device == torch.device('cuda')` * Fixed a bug where checking to see if a tensor was using cuda was done via comparison to the cuda device. This is not the correct way of doing things, not sure how it even worked before.
…ing.py Mode decode_mu_law() above xfade_and_unfold() Fix gta bug in gen_wavernn.py Fix target/overlap in quick_start.py - now definable through argparse Typo in _flatten_parameters Add newlines to optimiser loading display
+ Added type annotation for WaveRNN and Tacotron in train and generate files for each model * Fixed missing import for numpy in `fatchord_version.py` and `deepmind_version.py` Fixed bugs in multi-gpu tacotron where tensors were on wrong device - Removed get_r and set_r * Moved ownership of self.r buffer from Tacotron to Decoder * Now using property decorator in Tacotron for self.r * Made all buffers reside on same device as parameters to fix issue where nn.parallel.gather expected tensor to be on a particular GPU but it was on a different device (CPU) instead. * Updated gen_tacotron.py and train_tacotron.py to use self.r property instead of getter and setter methods * Made all returned values from forward() tensors to ensure that nn.parallel.gather works properly Made preprocess.py use a configurable number of workers Made self.r in Tacotron default to 1
* Now using pathlib in `paths.py`, so all properties are `Path` objects instead of strings. * Changed `Paths.data` to be independent of `hparams.voc_id` as it doesn't make sense for the vocoder id to matter for tacotron and hence the dataset should be independent of it Refactored preprocessing code to use pathlib * Refactored `utils/files.py` to use `Path.rglob()` * Refactored `preprocess.py` to use Path objects * Refactored most files to use a name other than `id` for filename ids, as it shadows the builtin `id()` function and caused hard to debug errors. * Fixed issue where `Paths.base` was the utils folder * Refactored `utils/dataset.py` to use Path objects * Added type annotation to `utils/recipes/files.py` * Renamed `get_tts_dataset()` to `get_tts_datasets()` to be same as `get_voc_datasets()` * Refactored `models/*` to use pathlib
+ Added `time` import in `train_tacotron.py` * Modified `utils/display.py` to only use x11 when necessary. This will allow training the model on servers where x11 is either not installed or not working (e.g. due to improperly forwarded X11)
- Removed checkpoint and restore from `models/tacotron.py` + Added `save_checkpoint()` and `restore_checkpoint()` in `train_tacotron.py` ! Not yet finished with the refactoring! Code is breaking.
One caveat that I should note: For some reason, I don't always experience speedups when attempting to run on multiple GPUs. I think this might just be because my GPUs are fast and the batch size is relatively small, so the overhead of the data parallel code is hiding any benefits one might see. This isn't really an issue, just be sure to compare single-GPU vs multiple-GPU on your system to see which trains faster for you. Obviously if your batch doesn't fit into VRAM, you will need to use the multi-gpu. Controlling which GPUs are used is easily done via the For example, to run on all available GPUs, you would do: But to run on GPUs 0 and 1, do: As per #107, to run on CPU only, do: If you don't have GPUs in the first place, no special steps are required as the default mode is to use all GPUs present - if no GPUs are present, it doesn't try to use CUDA. |
This looks awesome @TheButlah - I'm still training models. It'll just take a bit of time. |
@fatchord Glad to help! |
I just completed a training session of WaveRNN (without Taco) and even after all 1 million steps, I still get bad results. It seems like every epoch it alternates between good results and bad results. Here are some samples. Note that I have modified some hparams, which may be the cause of all of this. @fatchord how many iterations did you take to get good results? I just did the 1 million iters you specified in hparams.py, with your batch size. Let me know if you think this is a bug in the patch, or in my hyperparameters. |
Across the board whether its your parameters or mine, I see poor results. I would greatly appreciate any insights you have on why the results seem of low quality - I'm out of ideas. I've looked through the codebase but I'm not really seeing anything. |
Archive.zip |
Update: Added As far as the convergence issues are concerned, I've come to the conclusion that I don't think the code is bugged. I think this is a hyperparameter issue. I have retrained using your hparams with your pretrained model. They are about equivalent at 180K steps, when
I don't know which of those two is true. I'm curious to see what your results are when you retrain. Let me know :) Note that all of this investigation has focused on Tacotron only, I have not vetted WaveRNN yet but I don't see why it wouldn't apply to it too. |
@TheButlah On the wavernn side - I suspect that the gradient clipping is not ideal and not really needed either since I've never seen the gradient explode with that model. On the tacotron side - what's the best sounding griffin lim sample you have? |
@fatchord It sounds about equivalent to yours, for Tacotron (again haven't vetted WaveRNN but it shares a lot of its code with Tacotron so its probably working too) which means the fork should be working. Here are some griffin lim samples generated with your original tacotron pretrained model (no retraining): And here is the same hparams trained to the same number of iters, but retrained on my fork. It sounds equivalent, which indicates that the fork is working: However, the quality significantly degrades when continuing to train on your hparams for the full 350 iters. This indicates that either the hparams in the last segment of the training schedule are not ideal, or that the model just needs even more time training to converge under those hparams: |
@fatchord yep I just took a look at the results from continuing to train over the weekend, and the version where r is reduced to 1 fails to converge even at 467K iters, whereas the version where I keep r=2 continues to improve (at 296K iters so far). I would suggest updating the learning schedule to change r=1 in the last session to r=2. This works dramatically better in my tests. I will start training WaveRNN today or tomorrow, so I can confirm the correctness of the vocoder soon too |
@fatchord I have retrained both WaveRNN and Tacotron and have confirmed that they work. Here are the results, when using the tacotron inference script with wavernn as the vocoder. WaveRNN has not yet finished training. I've included both Griffin-Lim and WaveRNN for the sake of comparison. Tacotron trained with r=2 in the last session, with a total number of iters of 376K. WaveRNN not yet finished training, with a total number of iters of 650K (not even 2/3 trained!) on ground truth spectrograms (not using --gta). These results should prove that this fork is tested and ready to merge in. Let me know if you have any questions or need help with my changes! |
Update: finished training wavernn. Results good. Did notice that the quality seems to randomly deteriorate a little depending on which checkpoint I'm looking at. I think this is because the learning rate should be decreased later over the course of training. I trained to 1050K steps with the regular 1e-4 learning rate, and will train for another 200k steps with 1e-5 learning rate. The results at 1.1M steps already sound good, so this was probably a good call. It would probably make sense to use training sessions like tacotron uses, but I'm not going to add that for now because its easy enough to find a checkpoint that sounds good, and the deterioration in quality is pretty minimal. Again, let me know if you have any questions about the patch, it is ready to be merged in whenever. |
@TheButlah Wow, that's amazing, thanks - can I get the weights to try out? |
Btw - sorry I haven't been more active on this - work has been full on the last couple of weeks. |
Hi @fatchord! No problem, its your repo after all, I've already got everything working on my end so its not like you're holding me up :) I cannot release my trained models unfortunately, but you should be able to just use the edits I made to the hparams on my fork to get equivalent results to what I posted for Tacotron, and if you do the trick I mentioned about WaveRNN with the lower learning rate after training with the lr in hparams.py, you should be able to reproduce my results. One way or another though, the fork should work 100% as it stands and is backwards compatible with your pretrained models. |
@fatchord any updates on the status of the merge? |
@TheButlah Sorry about the delay - I only got around at looking at it last night. There's a couple of small things I'd like to change - I want to put the new checkpointing functions into utils as there's duplicate code in the training files. Other than that - just small things and it's good to merge. I'll create a new branch from your fork and get working on that tomorrow morning. Hopefully we can get this merged by the weekend. |
Ok I did a little bit there. Have a look and see if that makes sense. There might be a few more things - I'll have another look tomorrow morning. |
Looks good. I don't have my development environment set up right now for this project, so I'll take your word for it that it works. I did make one change to actually use the real r value in quick_start.py. |
I think it's good to merge now. @TheButlah Thanks again for your awesome contribution. |
Add Multi-GPU training, Griffin-Lim vocoder, safely restore checkpoints, pathlib, multiple hparams files
Add Multi-GPU training, Griffin-Lim vocoder, safely restore checkpoints, pathlib, multiple hparams files
Follows up on PR #109.
Added the following features:
hparams
module can be specified via argument to facilitate testing with different sets of hyperparametershparams
Refactored the following:
/
or notFixed the following bugs:
Please vet this code! It's not yet thoroughly tested.Another caveat to note: The notebooks are now very very outdated. I haven't touched them, and they appear to use their own code so none of this should affect them.