adding colab and small fixes for multilingual model recipe (#1003)

Summary: **IMPORTANT: Please do not create a Pull Request without creating an issue first.** Changes *must* be discussed. **Original Issue**: n/a *Note:* You can add `closes #[issue number]` to automatically close the issue that this PR resolves when it is merged. adding colab and small fixes for multilingual model recipe ### Test Plan (required) n/a Pull Request resolved: #1003 Differential Revision: D34379954 Pulled By: vineelpratap fbshipit-source-id: e032cad89808ee19467fcff112b3fda601894d15
flashlight · Mar 2, 2022 · 004e3cc · 004e3cc
1 parent 0393ac7
commit 004e3cc
Show file tree

Hide file tree

Showing 3 changed files with 2,230 additions and 79 deletions.
diff --git a/recipes/mling_pl/README.md b/recipes/mling_pl/README.md
@@ -2,16 +2,16 @@
 
 Semi-supervised learning through pseudo-labeling has become a staple of state-of-the-art monolingual speech recognition systems. In this work, we extend pseudo-labeling to massively multilingual speech recognition with 60 languages. We propose a simple pseudo-labeling recipe that works well even with low-resource languages: train a supervised multilingual model, fine-tune it with semi-supervised learning on a target language, generate pseudo-labels for that language, and train a final model using pseudo-labels for all languages, either from scratch or by fine-tuning. Experiments on the labeled Common Voice and unlabeled VoxPopuli datasets show that our recipe can yield a model with better performance for many languages that also transfers well to LibriSpeech.
 
-We provide are pretrained models and a script to run inference on a sample audio file.
+We provide our pretrained models and a script to run inference on a sample audio file.
 
 ## Inference
 
 #### Step 1:
-Download the pretrained model and tokens file
+Download the pretrained model and tokens file:
 
 | Model | Arch | Link |
 | - | - | - |
-Large | model_with_externally_controlled_reshaping_big_lid.cpp | https://dl.fbaipublicfiles.com/wav2letter/mling_pl/checkpoint_large.bin
+Large | mling_large.cpp | https://dl.fbaipublicfiles.com/wav2letter/mling_pl/checkpoint_cv_finetune.bin
 
 Tokens file : https://dl.fbaipublicfiles.com/wav2letter/mling_pl/tokens-all.lst
 
@@ -20,7 +20,7 @@ Tokens file : https://dl.fbaipublicfiles.com/wav2letter/mling_pl/tokens-all.lst
 Install flashlight - https://github.com/flashlight/flashlight with ASR app flag `FL_BUILD_APP_ASR=ON`. Use the commit id `8f7af9ec1188bfd7050c47abfac528d21650890f` .
 
 #### Step 3:
-Prepare a file with the list of audio files in this format
+Prepare a file with the list of audio files in this format:
 ```
 0 <path_to_file1> <duration1>
 1 <path_to_file2> <duration2>
@@ -29,28 +29,32 @@ Prepare a file with the list of audio files in this format
 
 #### Step 4:
 
-Run inference using the following command from flashlight build directory
+Run inference using the following command from flashlight build directory:
 
 ```
 bin/asr/fl_asr_test \
-    --test <audio_file_list> \
-    --am <path_to_model_checkpoint.bin> \
-    --arch <path_to_model_arch.so> \
-    --tokens <path_to_tokens_file/tokens-all.lst> \
-    --datadir ''  \
-    --emission_dir ''  \
+    --test=<audio_file_list> \
+    --am=<path_to_model_checkpoint.bin> \
+    --arch=<path_to_model_arch.so> \
+    --tokens=<path_to_tokens_file/tokens-all.lst> \
+    --lexicon=lexicon.txt \
+    --datadir=''  \
+    --emission_dir=''  \
     --show
 ```
 
 To compile `*.cpp` architectures into `*.so` use cmake/make command in flashlight and provide `-DFL_PLUGIN_MODULE_SRC_PATH=path/to/*.cpp` flag.
 
+A lexicon file is required for inference, but because we use greedy decoding, the lexicon isn't actually used. You can create a dummy lexicon using this command: `echo 'a a |' > lexicon.txt`
+
+A Colab notebook with an example of using the model can be found in this repo.
 
 ## Citation
 ```
 @article{lugosch2021pseudo,
   title={Pseudo-Labeling for Massively Multilingual Speech Recognition},
   author={Lugosch, Loren and Likhomanenko, Tatiana and Synnaeve, Gabriel and Collobert, Ronan},
-  journal={arXiv preprint arXiv:2111.00161},
-  year={2021}
+  journal={ICASSP},
+  year={2022}
 }
 ```
diff --git a/...ternally_controlled_reshaping_big_lid.cpp → recipes/mling_pl/mling_large.cpp b/...ternally_controlled_reshaping_big_lid.cpp → recipes/mling_pl/mling_large.cpp
@@ -59,63 +59,17 @@ class myModel : public fl::Container {
       const std::vector<fl::Variable>& input) override {
     auto out = input[0];
     auto xSizes = input[1].array();
-    float reshaping_factor = 1;
-    if (input.size() > 2) {
-      reshaping_factor = af::sum<float>(input[2].array());
-    }
     float dp = -1;
-    if (input.size() > 3) {
-      dp = af::sum<float>(input[3].array());
+    if (input.size() > 2) {
+      dp = af::sum<float>(input[2].array());
     }
     // expected input dims T x C x 1 x B
     out = convFrontend_->forward(out);
-    ///////// reshape ////////
-    int time_dim = 1, feat_dim = 0, other_dim = 3, batch_dim = 2;
-    int old_B = out.dims(batch_dim);
-    int old_T = out.dims(time_dim);
-    int new_B = old_B;
-    int new_T = old_T;
-    int T_padded = old_T;
-    if (reshaping_factor != 1) {
-      new_T = ceil(reshaping_factor * old_T);
-      new_T += old_B -
-          (new_T % old_B); // add this chunk so that new_T is divisible by old_B
-      new_B = ceil((float)(old_B * old_T) / (float)new_T);
-      T_padded = (new_B * new_T) / old_B;
-      std::vector<std::pair<int, int>> pad_amount;
-      pad_amount.push_back(std::make_pair(0, 0));
-      pad_amount.push_back(std::make_pair(0, T_padded - old_T));
-      pad_amount.push_back(std::make_pair(0, 0));
-      pad_amount.push_back(std::make_pair(0, 0));
-      out = fl::padding(out, pad_amount, 0.0);
-      out = fl::reorder(out, time_dim, batch_dim, feat_dim, other_dim);
-      time_dim = 0, feat_dim = 2, other_dim = 3, batch_dim = 1;
-      auto new_out_dims = out.dims();
-      new_out_dims[time_dim] = new_T;
-      new_out_dims[batch_dim] = new_B;
-      out = fl::moddims(out, new_out_dims);
-      out = fl::reorder(out, feat_dim, time_dim, batch_dim, other_dim);
-      //   std::cout << "(reshaping)\n";
-    } else {
-      //   std::cout << "(not reshaping)\n";
-    }
-    // std::cout << "old_B: " << old_B << "\n";
-    // std::cout << "old_T: " << old_T << "\n";
-    // std::cout << "new_B: " << new_B << "\n";
-    // std::cout << "new_T: " << new_T << "\n";
-    // std::cout << "T_padded: " << T_padded << "\n";
-    if (T_padded * old_B != new_T * new_B) {
-      std::cout << "error, T_padded * old_B != new_T * new_B\n";
-      exit(0);
-    }
-    //////////////////////////
-    af::array inputNotPaddedSize(1, old_B, 1, 1);
-    for (int bIdx = 0; bIdx < old_B; bIdx++) {
-      inputNotPaddedSize(0, bIdx, 0, 0) = old_T;
-    } // TODO: use actual xSizes here
-    auto padMask = af::iota(af::dim4(T_padded, 1), af::dim4(1, old_B)) <
-        af::tile(inputNotPaddedSize, T_padded, 1);
-    padMask = af::moddims(padMask, af::dim4(new_T, new_B, 1, 1));
+    int T = out.dims(0), B = out.dims(3);
+    auto inputMaxSize = af::tile(af::max(xSizes), 1, B);
+    af::array inputNotPaddedSize = af::ceil(xSizes * T / inputMaxSize);
+    auto padMask = af::iota(af::dim4(T, 1), af::dim4(1, B)) <
+        af::tile(inputNotPaddedSize, T, 1);
     for (int trIdx = 0; trIdx < transformers_.size(); trIdx++) {
       // NOTE: not required for inference
       //   if (dp >= 0) {
@@ -124,19 +78,7 @@ class myModel : public fl::Container {
       //   }
       out = transformers_[trIdx]->forward({out, fl::noGrad(padMask)}).front();
     }
-    ///////// reshape ////////
-    if (reshaping_factor != 1) {
-      time_dim = 1, feat_dim = 0, other_dim = 3, batch_dim = 2;
-      out = fl::reorder(out, time_dim, batch_dim, feat_dim, other_dim);
-      time_dim = 0, feat_dim = 2, other_dim = 3, batch_dim = 1;
-      auto new_tr_out_dims = out.dims();
-      new_tr_out_dims[time_dim] = T_padded;
-      new_tr_out_dims[batch_dim] = old_B;
-      out = fl::moddims(out, new_tr_out_dims);
-      out = fl::reorder(out, feat_dim, time_dim, batch_dim, other_dim);
-      out = out(af::span, af::seq(old_T), af::span, af::span);
-    }
-    //////////////////////////
+
     auto ctc_head_out = linear_->forward(out);
     auto LID_head_out = LID_head_->forward(out);
     LID_head_out = fl::mean(LID_head_out.as(f32), std::vector<int>{1}).as(f32);