Cpc vox populi (#965)

Summary: **IMPORTANT: Please do not create a Pull Request without creating an issue first.** Changes *must* be discussed. **Original Issue**: #957 `closes #[issue 957]` Patched version of Chaitanya Talnikar's implementation of masked_cpc: we needed to include the pre-training for the VoxPopuli dataset. ### Test Plan (required) Fine-tuning with Common Voices Latvian After downloading Common Voices: ``` export COMMON_VOICE_DIR=[Path to the parent directory containing all common voices subset] export WAV2LETTERDIR=[Path to wav2letter root directory] cd prepare_data bash build_cc_data.sh lv ``` You should get the following output: ``` Building /private/home/mriviere/Common_voices/lv/lv_grapheme.tokens 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5098/5098 [00:12<00:00, 419.89it/s] 5098 files found out of 5098 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5098/5098 [01:20<00:00, 63.61it/s] 59 speakers found Building /private/home/mriviere/Common_voices/lv/dev.lst 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1125/1125 [00:00<00:00, 1403.27it/s] 1125 files found out of 1125 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1125/1125 [00:17<00:00, 65.65it/s] 3 speakers found Building /private/home/mriviere/Common_voices/lv/test.lst 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1629/1629 [00:01<00:00, 1563.26it/s] 1629 files found out of 1629 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1629/1629 [00:26<00:00, 62.50it/s] 54 speakers found Building /private/home/mriviere/Common_voices/lv/train.lst 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2336/2336 [00:01<00:00, 1679.79it/s] 2336 files found out of 2336 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2336/2336 [00:34<00:00, 68.01it/s] 2 speakers found ``` Download and uncompress the checkpoint from https://dl.fbaipublicfiles.com/voxpopuli/wav2letter_100k_small.tar.gz To fine-tune the model: ``` cd scripts_voxpopuli bash train_lang.sh PATH_DIR_CHECKPOINT lv ``` Pull Request resolved: #965 Reviewed By: xuqiantong Differential Revision: D28806871 Pulled By: tlikhomanenko fbshipit-source-id: 18c91869708fb2ee94a864722ef35fa460a7c2ae
flashlight · Jun 1, 2021 · 04b0305 · 04b0305
1 parent 614110f
commit 04b0305
Show file tree

Hide file tree

Showing 23 changed files with 5,419 additions and 0 deletions.
diff --git a/recipes/CMakeLists.txt b/recipes/CMakeLists.txt
@@ -3,6 +3,7 @@ cmake_minimum_required(VERSION 3.10)
 # Scripts which are common for our recipes
 add_subdirectory(${CMAKE_CURRENT_LIST_DIR}/utilities/convlm_serializer)
 add_subdirectory(${CMAKE_CURRENT_LIST_DIR}/self_training/pseudo_labeling)
+add_subdirectory(${CMAKE_CURRENT_LIST_DIR}/joint_training_vox_populi/cpc)
 
 # TODO: fix
 # add_subdirectory(${PROJECT_SOURCE_DIR}/local_prior_match)
diff --git a/recipes/joint_training_vox_populi/README.md b/recipes/joint_training_vox_populi/README.md
@@ -0,0 +1,147 @@
+# VoxPopuli : Wav2letter checkpoints
+
+Wav2letter checkpoints from the [voxpopuli paper](https://arxiv.org/abs/2101.00390) as well as some code to load them. They correspond to the implementation of wav2vec as described in https://arxiv.org/abs/2011.00093.
+
+The VoxPopuli dataset can be directly downloaded from the [VoxPopuli repository](https://github.com/facebookresearch/voxpopuli/).
+
+The code included in this folder is a patched version of the original code developped by
+[Chaitanya Talnikar](https://scholar.google.com/citations?user=aHLUKlQAAAAJ) from the [wav2letter team](https://github.com/facebookresearch/wav2letter/tree/masked_cpc/recipes/joint_training) in order to include the pre-training.
+
+## Flashlight version
+
+You can always refer to this flahslight commit for the stable release https://github.com/flashlight/flashlight/commit/8f7af9ec1188bfd7050c47abfac528d21650890f
+
+## Loading the checkpoint
+
+Wav2letter small wav2vec model : https://dl.fbaipublicfiles.com/voxpopuli/vox_populi_100k_500iters.tar.gz
+[Depreciated checkpoint https://dl.fbaipublicfiles.com/voxpopuli/wav2letter_100k_small.tar.gz]
+
+Our checkpoint is using fl::ext::Serializer. The items are saved in the following order:
+
+```
+filename,
+FL_APP_ASR_VERSION, // std::string
+config,             // std::unordered_map<std::string, std::string>
+network,            // fl::Sequential
+criterion,          // CPCCriterion (Subclass of fl::app::asr::SequenceCriterion) : unsupervised CPC criterion
+criterion2,         // fl::app::asr::SequenceCriterion : supervised CTC criterion
+netoptim,           // fl::FirstOrderOptimizer : Optimizer for the unsupervised loss (adam)
+netoptim2,          // fl::FirstOrderOptimizer : Optimizer for the supervised loss (adam)
+critoptim,          // fl::FirstOrderOptimizer
+critoptim2          // fl::FirstOrderOptimizer
+```
+
+The network consists in a base feature network topped with a classifier.
+To use it for fine-tuning, you need to load the network without its last layer:
+
+```
+void LoadFeatures(std::shared_ptr<fl::Sequential>  net0, std::shared_ptr<fl::Sequential> net){
+
+    auto modules_0 = net0->modules();
+    int n_layers = modules_0.size() - 1
+    for (int i =0; i< n_layers; i++){
+        net->add(modules_0[i]);
+    }
+}
+```
+
+
+## Building the Common Voice manifest files
+
+First, download the datasets you're interested in from the [Common Voice website](https://commonvoice.mozilla.org/en/datasets).
+Uncompress the data and copy them into $COMMON_VOICE_DIR/$LANG. You should get the following structure:
+```
+[COMMON_VOICE_DIR]
+├──[LANG]/
+    ├── clips/
+    │   ├── *.mp3 files
+    |__ dev.tsv
+    |__ invalidated.tsv
+    |__ other.tsv
+    |__ test.tsv
+    |__ train.tsv
+    |__ validated.tsv
+    |__ reported.tsv (as of Corpus 5.0)
+```
+
+Then run the following command:
+```
+export COMMON_VOICE_DIR= path to the common voice directory described above
+cd prepare_data
+bash build_cc_data.sh $LANG
+```
+The script will produce the manifest files associated with the validated, train, dev and test sets. As well as the lexicon and the token files.
+
+## Fine-tuning the model
+
+A training script is available is the scripts folder. To use it run:
+```
+export COMMON_VOICE_DIR= path to the common voice directory described above
+export WAV2LETTERDIR= path to wav2letter root directory
+bash train_lang.sh $DIR_CHECKPOINT $LANG
+```
+
+Where DIR_CHECKPOINT is the directory where you have uncomprssed the checkpoint and $LANG is the language you want to train your model on.
+
+## Decoder
+
+A decoding script is also available. It will run the decoding on the dev subset.
+```
+export COMMON_VOICE_DIR= path to the common voice directory described above
+export WAV2LETTERDIR= path to wav2letter root directory
+bash decode_lang.sh $DIR_CHECKPOINT $LANG
+```
+
+## Results
+
+Performances on CommonVoices without language model (old version: checking the non-regression):
+
+| Language        | Fine-tuning size |                 Dev      |       Test        |
+| --------------- |:----------------:|:------------------------:|:-----------------:|
+| De              | 314h             | CER 3.83 WER: 15.0       | CER 4.70 WER: 17.0|
+| Es              | 203h             | CER 3.49 WER: 10.7       | CER 4.04 WER: 11.9|
+| Fr              | 364h             | CER 4.9 WER: 16.9        | CER 5.89 WER: 18.8|
+
+Performances on CommonVoices using a language model built out from CommonVoices data (excluding dev / test):
+
+| Language        | Fine-tuning size |                 Dev      |       Test        |
+| --------------- |:----------------:|:------------------------:|:-----------------:|
+| De              | 314h             | CER 2.36 WER: 6.76       | CER 2.98 WER: 7.82|
+| Es              | 203h             | CER 3.11 WER: 8.93       | CER 3.60 WER: 10.0|
+| Fr              | 364h             | CER 2.73 WER: 8.31       | CER 3.57 WER: 9.56|
+
+## Pretrain a model
+
+To pretrain a model by yourself you can run ```sh_voxpopuli/pretrain.sh```.
+First, prepare an .lst files listing all of the audio sequences you intend to use for pretraining, let's call it ```unlabelled.lst```.
+In this file, each sequence can be given an arbitrary transcription, it doesn't matter for unsupervised training.
+For example, ```unlabelled.lst``` could look like this:
+```
+ID0 PATH_SEQUENCE0 SIZE_0_MS i love potatoes
+ID1 PATH_SEQUENCE1 SIZE_1_MS i love potatoes
+ID2 PATH_SEQUENCE2 SIZE_2_MS i love potatoes
+ID3 PATH_SEQUENCE3 SIZE_3_MS i love potatoes
+ID4 PATH_SEQUENCE4 SIZE_4_MS i love potatoes
+ID5 PATH_SEQUENCE5 SIZE_5_MS i love potatoes
+```
+
+You will also need to provdide the script with a validation set, and token file and a valid lexicon.
+If you are running the pre-training fully unsupervised (default option) the kind of tokens and the lexicon don't matter, you just need to provide valid files for wav2letter.
+
+You can also add some supervision to the pretraining procedure, as shown in  https://arxiv.org/abs/2011.00093.
+In this case you will need to build a different .lst file with labelled data and make sure that your lexicon and token files are appropriate.
+
+See ```sh_voxpopuli/pretrain.sh``` for more details.
+
+## Citation
+
+```
+@misc{wang2021voxpopuli,
+      title={VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation},
+      author={Changhan Wang and Morgane Rivière and Ann Lee and Anne Wu and Chaitanya Talnikar and Daniel Haziza and Mary Williamson and Juan Pino and Emmanuel Dupoux},
+      year={2021},
+      eprint={2101.00390},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```
diff --git a/recipes/joint_training_vox_populi/cpc/CMakeLists.txt b/recipes/joint_training_vox_populi/cpc/CMakeLists.txt
@@ -0,0 +1,47 @@
+cmake_minimum_required(VERSION 3.10)
+
+#project(wav2letter++-experimental-cpc)
+
+
+#find_package(flashlight REQUIRED)
+set(common
+   CPCCriterion.cpp
+   CPCSpecAugment.cpp
+   SequentialBuilder.cpp
+   TransformerCPC.cpp)
+
+# ------------------------- Dedicated SSL Train executable --------------------------
+add_executable(
+  Train_cpc
+  ${common}
+  Train.cpp
+  )
+
+target_link_libraries(
+  Train_cpc
+  flashlight::flashlight-app-asr
+  flashlight::flashlight-app-common
+)
+
+add_executable(
+  Decode_cpc
+  ${common}
+  Decode.cpp
+)
+
+target_link_libraries(
+  Decode_cpc
+  flashlight::flashlight-app-asr
+  flashlight::flashlight-app-common
+)
+
+#add_executable(
+#  Test_cpc
+#  ${common}
+#  Test.cpp
+#  )
+
+#target_link_libraries(
+#  Test_cpc
+#  flashlight::flashlight-app-asr
+#)