Skip to content

Commit

Permalink
adding colab and small fixes for multilingual model recipe (#1003)
Browse files Browse the repository at this point in the history
Summary:
**IMPORTANT: Please do not create a Pull Request without creating an issue first.** Changes *must* be discussed.

**Original Issue**: n/a

*Note:* You can add `closes #[issue number]` to automatically close the issue that this PR resolves when it is merged.

adding colab and small fixes for multilingual model recipe

### Test Plan (required)
n/a

Pull Request resolved: #1003

Differential Revision: D34379954

Pulled By: vineelpratap

fbshipit-source-id: e032cad89808ee19467fcff112b3fda601894d15
  • Loading branch information
lorenlugosch authored and facebook-github-bot committed Mar 2, 2022
1 parent 0393ac7 commit 004e3cc
Show file tree
Hide file tree
Showing 3 changed files with 2,230 additions and 79 deletions.
30 changes: 17 additions & 13 deletions recipes/mling_pl/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,16 @@

Semi-supervised learning through pseudo-labeling has become a staple of state-of-the-art monolingual speech recognition systems. In this work, we extend pseudo-labeling to massively multilingual speech recognition with 60 languages. We propose a simple pseudo-labeling recipe that works well even with low-resource languages: train a supervised multilingual model, fine-tune it with semi-supervised learning on a target language, generate pseudo-labels for that language, and train a final model using pseudo-labels for all languages, either from scratch or by fine-tuning. Experiments on the labeled Common Voice and unlabeled VoxPopuli datasets show that our recipe can yield a model with better performance for many languages that also transfers well to LibriSpeech.

We provide are pretrained models and a script to run inference on a sample audio file.
We provide our pretrained models and a script to run inference on a sample audio file.

## Inference

#### Step 1:
Download the pretrained model and tokens file
Download the pretrained model and tokens file:

| Model | Arch | Link |
| - | - | - |
Large | model_with_externally_controlled_reshaping_big_lid.cpp | https://dl.fbaipublicfiles.com/wav2letter/mling_pl/checkpoint_large.bin
Large | mling_large.cpp | https://dl.fbaipublicfiles.com/wav2letter/mling_pl/checkpoint_cv_finetune.bin

Tokens file : https://dl.fbaipublicfiles.com/wav2letter/mling_pl/tokens-all.lst

Expand All @@ -20,7 +20,7 @@ Tokens file : https://dl.fbaipublicfiles.com/wav2letter/mling_pl/tokens-all.lst
Install flashlight - https://github.com/flashlight/flashlight with ASR app flag `FL_BUILD_APP_ASR=ON`. Use the commit id `8f7af9ec1188bfd7050c47abfac528d21650890f` .

#### Step 3:
Prepare a file with the list of audio files in this format
Prepare a file with the list of audio files in this format:
```
0 <path_to_file1> <duration1>
1 <path_to_file2> <duration2>
Expand All @@ -29,28 +29,32 @@ Prepare a file with the list of audio files in this format

#### Step 4:

Run inference using the following command from flashlight build directory
Run inference using the following command from flashlight build directory:

```
bin/asr/fl_asr_test \
--test <audio_file_list> \
--am <path_to_model_checkpoint.bin> \
--arch <path_to_model_arch.so> \
--tokens <path_to_tokens_file/tokens-all.lst> \
--datadir '' \
--emission_dir '' \
--test=<audio_file_list> \
--am=<path_to_model_checkpoint.bin> \
--arch=<path_to_model_arch.so> \
--tokens=<path_to_tokens_file/tokens-all.lst> \
--lexicon=lexicon.txt \
--datadir='' \
--emission_dir='' \
--show
```

To compile `*.cpp` architectures into `*.so` use cmake/make command in flashlight and provide `-DFL_PLUGIN_MODULE_SRC_PATH=path/to/*.cpp` flag.

A lexicon file is required for inference, but because we use greedy decoding, the lexicon isn't actually used. You can create a dummy lexicon using this command: `echo 'a a |' > lexicon.txt`

A Colab notebook with an example of using the model can be found in this repo.

## Citation
```
@article{lugosch2021pseudo,
title={Pseudo-Labeling for Massively Multilingual Speech Recognition},
author={Lugosch, Loren and Likhomanenko, Tatiana and Synnaeve, Gabriel and Collobert, Ronan},
journal={arXiv preprint arXiv:2111.00161},
year={2021}
journal={ICASSP},
year={2022}
}
```
Original file line number Diff line number Diff line change
Expand Up @@ -59,63 +59,17 @@ class myModel : public fl::Container {
const std::vector<fl::Variable>& input) override {
auto out = input[0];
auto xSizes = input[1].array();
float reshaping_factor = 1;
if (input.size() > 2) {
reshaping_factor = af::sum<float>(input[2].array());
}
float dp = -1;
if (input.size() > 3) {
dp = af::sum<float>(input[3].array());
if (input.size() > 2) {
dp = af::sum<float>(input[2].array());
}
// expected input dims T x C x 1 x B
out = convFrontend_->forward(out);
///////// reshape ////////
int time_dim = 1, feat_dim = 0, other_dim = 3, batch_dim = 2;
int old_B = out.dims(batch_dim);
int old_T = out.dims(time_dim);
int new_B = old_B;
int new_T = old_T;
int T_padded = old_T;
if (reshaping_factor != 1) {
new_T = ceil(reshaping_factor * old_T);
new_T += old_B -
(new_T % old_B); // add this chunk so that new_T is divisible by old_B
new_B = ceil((float)(old_B * old_T) / (float)new_T);
T_padded = (new_B * new_T) / old_B;
std::vector<std::pair<int, int>> pad_amount;
pad_amount.push_back(std::make_pair(0, 0));
pad_amount.push_back(std::make_pair(0, T_padded - old_T));
pad_amount.push_back(std::make_pair(0, 0));
pad_amount.push_back(std::make_pair(0, 0));
out = fl::padding(out, pad_amount, 0.0);
out = fl::reorder(out, time_dim, batch_dim, feat_dim, other_dim);
time_dim = 0, feat_dim = 2, other_dim = 3, batch_dim = 1;
auto new_out_dims = out.dims();
new_out_dims[time_dim] = new_T;
new_out_dims[batch_dim] = new_B;
out = fl::moddims(out, new_out_dims);
out = fl::reorder(out, feat_dim, time_dim, batch_dim, other_dim);
// std::cout << "(reshaping)\n";
} else {
// std::cout << "(not reshaping)\n";
}
// std::cout << "old_B: " << old_B << "\n";
// std::cout << "old_T: " << old_T << "\n";
// std::cout << "new_B: " << new_B << "\n";
// std::cout << "new_T: " << new_T << "\n";
// std::cout << "T_padded: " << T_padded << "\n";
if (T_padded * old_B != new_T * new_B) {
std::cout << "error, T_padded * old_B != new_T * new_B\n";
exit(0);
}
//////////////////////////
af::array inputNotPaddedSize(1, old_B, 1, 1);
for (int bIdx = 0; bIdx < old_B; bIdx++) {
inputNotPaddedSize(0, bIdx, 0, 0) = old_T;
} // TODO: use actual xSizes here
auto padMask = af::iota(af::dim4(T_padded, 1), af::dim4(1, old_B)) <
af::tile(inputNotPaddedSize, T_padded, 1);
padMask = af::moddims(padMask, af::dim4(new_T, new_B, 1, 1));
int T = out.dims(0), B = out.dims(3);
auto inputMaxSize = af::tile(af::max(xSizes), 1, B);
af::array inputNotPaddedSize = af::ceil(xSizes * T / inputMaxSize);
auto padMask = af::iota(af::dim4(T, 1), af::dim4(1, B)) <
af::tile(inputNotPaddedSize, T, 1);
for (int trIdx = 0; trIdx < transformers_.size(); trIdx++) {
// NOTE: not required for inference
// if (dp >= 0) {
Expand All @@ -124,19 +78,7 @@ class myModel : public fl::Container {
// }
out = transformers_[trIdx]->forward({out, fl::noGrad(padMask)}).front();
}
///////// reshape ////////
if (reshaping_factor != 1) {
time_dim = 1, feat_dim = 0, other_dim = 3, batch_dim = 2;
out = fl::reorder(out, time_dim, batch_dim, feat_dim, other_dim);
time_dim = 0, feat_dim = 2, other_dim = 3, batch_dim = 1;
auto new_tr_out_dims = out.dims();
new_tr_out_dims[time_dim] = T_padded;
new_tr_out_dims[batch_dim] = old_B;
out = fl::moddims(out, new_tr_out_dims);
out = fl::reorder(out, feat_dim, time_dim, batch_dim, other_dim);
out = out(af::span, af::seq(old_T), af::span, af::span);
}
//////////////////////////

auto ctc_head_out = linear_->forward(out);
auto LID_head_out = LID_head_->forward(out);
LID_head_out = fl::mean(LID_head_out.as(f32), std::vector<int>{1}).as(f32);
Expand Down
Loading

0 comments on commit 004e3cc

Please sign in to comment.