PyTorch + Catalyst implementation of Looking to Listen at a Cocktail Party.
This repository handles the training process. For inference, checkout the GUI wrapper: SpeechSeparationUI in PyQT.
This repository has been merged with asteroid as a recipe.
-
Computation
We ran this program on two GPUs, 1050 Mobile and Tesla V100. We did not conduct any benchmarks but, V100 was roughly 400x faster. It also depends on how much data you download. Hence, any server grade GPU should be feasible.
-
Storage
This program does generate a lot of files (download and otherwise). Each audio file is 96kiB in size. For 7k unique audio clips and at a 70/30 train and validation split it occupied ~120GiB of storage space. Hence, 1TB minimum if you download more audio clips.
-
Memory
Minimum of 4GB VRAM is required. It can handle a batch size of 2. At 20 batch size, on two GPUs, it occupied 16GiB VRAM on each GPU.
./setup.sh && ./install.sh
-
Setup the directory structure
./setup.sh
-
Install dependencies
pip install -r requirements.txt
Additional dependencies:
i. ffmpeg ii. libav-tools ii. youtube-dl iii. sox
-
Install
./install.sh
During inference
from src import generate_audio, load_model
Run all these files as scripts.
cd src/loader
NOTE: Make Sure AVSPEECH dataset is in data/audio_visual/ folder. Downloading requires a Google account.
python3 download.py
Video length can be more than 3 seconds. Hence, extract multiple audio from a single video file.
python3 extract_audio.py
Synthetically mix clean audio. This can take a lot of space of the disk. 96Kb approx for each file. Total number of files can be: total_filesCinput_audio_size for each train and val.
python3 audio_mixer_generator.py
Generating lots of synthetically mixed audio (100+ per second) generates a lot of empty audio files. Hence, we need to remove the empty audio files.
python3 remove_empty_audio.py
Path changes from src and src/loader. Both directory has files that need to manipulate the data/ directory. Hence, create a copy with the correct path in src/loader/
python3 transform_df.py
Create video embedding from all the video files. This will also store video which are corrupted. Corrupted video include where face was not detected.
python3 generate_video_embedding.py
Hence, remove corrupted video frames as well.
python3 remove_corrupt.py
Cache, all the spectrograms This takes a lot of storage. Tens/Hundreds of GB
python3 convert_to_spec.py
python3 train.py --bs 20 --workers 4 --cuda True
Unfortunately, we could not train on a bigger dataset.
- Looking to Listen at a Cocktail Party: https://arxiv.org/abs/1804.03619
- Discriminative Loss: https://arxiv.org/abs/1502.04149
- PyTorch: pytorch.org
- Catalyst: https://github.com/catalyst-team/catalyst
- mir_eval: https://github.com/craffel/mir_eval
- pysndfx: https://github.com/carlthome/python-audio-effects/tree/master/pysndfx