Skip to content

Commit

Permalink
Cpc vox populi (#965)
Browse files Browse the repository at this point in the history
Summary:
**IMPORTANT: Please do not create a Pull Request without creating an issue first.** Changes *must* be discussed.

**Original Issue**: #957

`closes #[issue 957]`

Patched version of Chaitanya Talnikar's implementation of masked_cpc: we needed to include the pre-training for the VoxPopuli dataset.

### Test Plan (required)

Fine-tuning with Common Voices Latvian

After downloading Common Voices:
```
export COMMON_VOICE_DIR=[Path to the parent directory containing all common voices subset]
export WAV2LETTERDIR=[Path to wav2letter root directory]
cd prepare_data
bash build_cc_data.sh lv
```

You should get the following output:
```
Building /private/home/mriviere/Common_voices/lv/lv_grapheme.tokens
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5098/5098 [00:12<00:00, 419.89it/s]
5098 files found out of 5098
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5098/5098 [01:20<00:00, 63.61it/s]
59 speakers found
Building /private/home/mriviere/Common_voices/lv/dev.lst
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1125/1125 [00:00<00:00, 1403.27it/s]
1125 files found out of 1125
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1125/1125 [00:17<00:00, 65.65it/s]
3 speakers found
Building /private/home/mriviere/Common_voices/lv/test.lst
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1629/1629 [00:01<00:00, 1563.26it/s]
1629 files found out of 1629
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1629/1629 [00:26<00:00, 62.50it/s]
54 speakers found
Building /private/home/mriviere/Common_voices/lv/train.lst
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2336/2336 [00:01<00:00, 1679.79it/s]
2336 files found out of 2336
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2336/2336 [00:34<00:00, 68.01it/s]
2 speakers found
```
Download and uncompress the checkpoint from https://dl.fbaipublicfiles.com/voxpopuli/wav2letter_100k_small.tar.gz

To fine-tune the model:

```
cd scripts_voxpopuli
bash train_lang.sh PATH_DIR_CHECKPOINT lv
```

Pull Request resolved: #965

Reviewed By: xuqiantong

Differential Revision: D28806871

Pulled By: tlikhomanenko

fbshipit-source-id: 18c91869708fb2ee94a864722ef35fa460a7c2ae
  • Loading branch information
Morgane Rivière authored and facebook-github-bot committed Jun 1, 2021
1 parent 614110f commit 04b0305
Show file tree
Hide file tree
Showing 23 changed files with 5,419 additions and 0 deletions.
1 change: 1 addition & 0 deletions recipes/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ cmake_minimum_required(VERSION 3.10)
# Scripts which are common for our recipes
add_subdirectory(${CMAKE_CURRENT_LIST_DIR}/utilities/convlm_serializer)
add_subdirectory(${CMAKE_CURRENT_LIST_DIR}/self_training/pseudo_labeling)
add_subdirectory(${CMAKE_CURRENT_LIST_DIR}/joint_training_vox_populi/cpc)

# TODO: fix
# add_subdirectory(${PROJECT_SOURCE_DIR}/local_prior_match)
147 changes: 147 additions & 0 deletions recipes/joint_training_vox_populi/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
# VoxPopuli : Wav2letter checkpoints

Wav2letter checkpoints from the [voxpopuli paper](https://arxiv.org/abs/2101.00390) as well as some code to load them. They correspond to the implementation of wav2vec as described in https://arxiv.org/abs/2011.00093.

The VoxPopuli dataset can be directly downloaded from the [VoxPopuli repository](https://github.com/facebookresearch/voxpopuli/).

The code included in this folder is a patched version of the original code developped by
[Chaitanya Talnikar](https://scholar.google.com/citations?user=aHLUKlQAAAAJ) from the [wav2letter team](https://github.com/facebookresearch/wav2letter/tree/masked_cpc/recipes/joint_training) in order to include the pre-training.

## Flashlight version

You can always refer to this flahslight commit for the stable release https://github.com/flashlight/flashlight/commit/8f7af9ec1188bfd7050c47abfac528d21650890f

## Loading the checkpoint

Wav2letter small wav2vec model : https://dl.fbaipublicfiles.com/voxpopuli/vox_populi_100k_500iters.tar.gz
[Depreciated checkpoint https://dl.fbaipublicfiles.com/voxpopuli/wav2letter_100k_small.tar.gz]

Our checkpoint is using fl::ext::Serializer. The items are saved in the following order:

```
filename,
FL_APP_ASR_VERSION, // std::string
config, // std::unordered_map<std::string, std::string>
network, // fl::Sequential
criterion, // CPCCriterion (Subclass of fl::app::asr::SequenceCriterion) : unsupervised CPC criterion
criterion2, // fl::app::asr::SequenceCriterion : supervised CTC criterion
netoptim, // fl::FirstOrderOptimizer : Optimizer for the unsupervised loss (adam)
netoptim2, // fl::FirstOrderOptimizer : Optimizer for the supervised loss (adam)
critoptim, // fl::FirstOrderOptimizer
critoptim2 // fl::FirstOrderOptimizer
```

The network consists in a base feature network topped with a classifier.
To use it for fine-tuning, you need to load the network without its last layer:

```
void LoadFeatures(std::shared_ptr<fl::Sequential> net0, std::shared_ptr<fl::Sequential> net){
auto modules_0 = net0->modules();
int n_layers = modules_0.size() - 1
for (int i =0; i< n_layers; i++){
net->add(modules_0[i]);
}
}
```


## Building the Common Voice manifest files

First, download the datasets you're interested in from the [Common Voice website](https://commonvoice.mozilla.org/en/datasets).
Uncompress the data and copy them into $COMMON_VOICE_DIR/$LANG. You should get the following structure:
```
[COMMON_VOICE_DIR]
├──[LANG]/
├── clips/
│ ├── *.mp3 files
|__ dev.tsv
|__ invalidated.tsv
|__ other.tsv
|__ test.tsv
|__ train.tsv
|__ validated.tsv
|__ reported.tsv (as of Corpus 5.0)
```

Then run the following command:
```
export COMMON_VOICE_DIR= path to the common voice directory described above
cd prepare_data
bash build_cc_data.sh $LANG
```
The script will produce the manifest files associated with the validated, train, dev and test sets. As well as the lexicon and the token files.

## Fine-tuning the model

A training script is available is the scripts folder. To use it run:
```
export COMMON_VOICE_DIR= path to the common voice directory described above
export WAV2LETTERDIR= path to wav2letter root directory
bash train_lang.sh $DIR_CHECKPOINT $LANG
```

Where DIR_CHECKPOINT is the directory where you have uncomprssed the checkpoint and $LANG is the language you want to train your model on.

## Decoder

A decoding script is also available. It will run the decoding on the dev subset.
```
export COMMON_VOICE_DIR= path to the common voice directory described above
export WAV2LETTERDIR= path to wav2letter root directory
bash decode_lang.sh $DIR_CHECKPOINT $LANG
```

## Results

Performances on CommonVoices without language model (old version: checking the non-regression):

| Language | Fine-tuning size | Dev | Test |
| --------------- |:----------------:|:------------------------:|:-----------------:|
| De | 314h | CER 3.83 WER: 15.0 | CER 4.70 WER: 17.0|
| Es | 203h | CER 3.49 WER: 10.7 | CER 4.04 WER: 11.9|
| Fr | 364h | CER 4.9 WER: 16.9 | CER 5.89 WER: 18.8|

Performances on CommonVoices using a language model built out from CommonVoices data (excluding dev / test):

| Language | Fine-tuning size | Dev | Test |
| --------------- |:----------------:|:------------------------:|:-----------------:|
| De | 314h | CER 2.36 WER: 6.76 | CER 2.98 WER: 7.82|
| Es | 203h | CER 3.11 WER: 8.93 | CER 3.60 WER: 10.0|
| Fr | 364h | CER 2.73 WER: 8.31 | CER 3.57 WER: 9.56|

## Pretrain a model

To pretrain a model by yourself you can run ```sh_voxpopuli/pretrain.sh```.
First, prepare an .lst files listing all of the audio sequences you intend to use for pretraining, let's call it ```unlabelled.lst```.
In this file, each sequence can be given an arbitrary transcription, it doesn't matter for unsupervised training.
For example, ```unlabelled.lst``` could look like this:
```
ID0 PATH_SEQUENCE0 SIZE_0_MS i love potatoes
ID1 PATH_SEQUENCE1 SIZE_1_MS i love potatoes
ID2 PATH_SEQUENCE2 SIZE_2_MS i love potatoes
ID3 PATH_SEQUENCE3 SIZE_3_MS i love potatoes
ID4 PATH_SEQUENCE4 SIZE_4_MS i love potatoes
ID5 PATH_SEQUENCE5 SIZE_5_MS i love potatoes
```

You will also need to provdide the script with a validation set, and token file and a valid lexicon.
If you are running the pre-training fully unsupervised (default option) the kind of tokens and the lexicon don't matter, you just need to provide valid files for wav2letter.

You can also add some supervision to the pretraining procedure, as shown in https://arxiv.org/abs/2011.00093.
In this case you will need to build a different .lst file with labelled data and make sure that your lexicon and token files are appropriate.

See ```sh_voxpopuli/pretrain.sh``` for more details.

## Citation

```
@misc{wang2021voxpopuli,
title={VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation},
author={Changhan Wang and Morgane Rivière and Ann Lee and Anne Wu and Chaitanya Talnikar and Daniel Haziza and Mary Williamson and Juan Pino and Emmanuel Dupoux},
year={2021},
eprint={2101.00390},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
47 changes: 47 additions & 0 deletions recipes/joint_training_vox_populi/cpc/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
cmake_minimum_required(VERSION 3.10)

#project(wav2letter++-experimental-cpc)


#find_package(flashlight REQUIRED)
set(common
CPCCriterion.cpp
CPCSpecAugment.cpp
SequentialBuilder.cpp
TransformerCPC.cpp)

# ------------------------- Dedicated SSL Train executable --------------------------
add_executable(
Train_cpc
${common}
Train.cpp
)

target_link_libraries(
Train_cpc
flashlight::flashlight-app-asr
flashlight::flashlight-app-common
)

add_executable(
Decode_cpc
${common}
Decode.cpp
)

target_link_libraries(
Decode_cpc
flashlight::flashlight-app-asr
flashlight::flashlight-app-common
)

#add_executable(
# Test_cpc
# ${common}
# Test.cpp
# )

#target_link_libraries(
# Test_cpc
# flashlight::flashlight-app-asr
#)
Loading

0 comments on commit 04b0305

Please sign in to comment.