Skip to content

Commit

Permalink
MLS release
Browse files Browse the repository at this point in the history
Summary: title

Reviewed By: vineelpratap

Differential Revision: D25506325

fbshipit-source-id: 2b788c95c7cfa75fb3b45d447b575085ec2088b7
  • Loading branch information
xuqiantong authored and facebook-github-bot committed Dec 11, 2020
1 parent c61fac3 commit e3ea940
Show file tree
Hide file tree
Showing 17 changed files with 482 additions and 0 deletions.
81 changes: 81 additions & 0 deletions recipes/mls/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Multilingual LibriSpeech (MLS)

Multilingual LibriSpeech (MLS) dataset is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish. It is released on [OpenSLR](http://openslr.org/).

This directory contains pretrained monolingual model releasing and steps for results reproduction.


## Dependencies

- [flashlight](https://github.com/facebookresearch/flashlight)


## Tokens and Lexicons

| Language | Token Set | Train Lexicon | Joint Lexicon (Train + GB) |
|:----------:|:-------------------------------------------------------------------------:|:----------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------:|
| English | [TOKEN](s3://dl.fbaipublicfiles.com/wav2letter/mls/english/tokens.txt) | [Lexicon](s3://dl.fbaipublicfiles.com/wav2letter/mls/english/train_lexicon.txt) | [Lexicon](s3://dl.fbaipublicfiles.com/wav2letter/mls/english/joint_lexicon.txt) |
| German | [TOKEN](s3://dl.fbaipublicfiles.com/wav2letter/mls/german/tokens.txt) | [Lexicon](s3://dl.fbaipublicfiles.com/wav2letter/mls/german/train_lexicon.txt) | [Lexicon](s3://dl.fbaipublicfiles.com/wav2letter/mls/german/joint_lexicon.txt) |
| Dutch | [TOKEN](s3://dl.fbaipublicfiles.com/wav2letter/mls/dutch/tokens.txt) | [Lexicon](s3://dl.fbaipublicfiles.com/wav2letter/mls/dutch/train_lexicon.txt) | [Lexicon](s3://dl.fbaipublicfiles.com/wav2letter/mls/dutch/joint_lexicon.txt) |
| French | [TOKEN](s3://dl.fbaipublicfiles.com/wav2letter/mls/french/tokens.txt) | [Lexicon](s3://dl.fbaipublicfiles.com/wav2letter/mls/french/train_lexicon.txt) | [Lexicon](s3://dl.fbaipublicfiles.com/wav2letter/mls/french/joint_lexicon.txt) |
| Spanish | [TOKEN](s3://dl.fbaipublicfiles.com/wav2letter/mls/spanish/tokens.txt) | [Lexicon](s3://dl.fbaipublicfiles.com/wav2letter/mls/spanish/train_lexicon.txt) | [Lexicon](s3://dl.fbaipublicfiles.com/wav2letter/mls/spanish/joint_lexicon.txt) |
| Italian | [TOKEN](s3://dl.fbaipublicfiles.com/wav2letter/mls/italian/tokens.txt) | [Lexicon](s3://dl.fbaipublicfiles.com/wav2letter/mls/italian/train_lexicon.txt) | [Lexicon](s3://dl.fbaipublicfiles.com/wav2letter/mls/italian/joint_lexicon.txt) |
| Portuguese | [TOKEN](s3://dl.fbaipublicfiles.com/wav2letter/mls/portuguese/tokens.txt) | [Lexicon](s3://dl.fbaipublicfiles.com/wav2letter/mls/portuguese/train_lexicon.txt) | [Lexicon](s3://dl.fbaipublicfiles.com/wav2letter/mls/portuguese/joint_lexicon.txt) |
| Polish | [TOKEN](s3://dl.fbaipublicfiles.com/wav2letter/mls/polish/tokens.txt) | [Lexicon](s3://dl.fbaipublicfiles.com/wav2letter/mls/polish/train_lexicon.txt) | [Lexicon](s3://dl.fbaipublicfiles.com/wav2letter/mls/polish/joint_lexicon.txt) |


## Pre-trained acoustic models

| Language | Architecture | Acoustic Model |
|:----------:|:----------------------------------------------------------------------:|:---------------------------------------------------------------------:|
| English | [Arch](s3://dl.fbaipublicfiles.com/wav2letter/mls/english/arch.txt) | [Model](s3://dl.fbaipublicfiles.com/wav2letter/mls/english/am.bin) |
| German | [Arch](s3://dl.fbaipublicfiles.com/wav2letter/mls/german/arch.txt) | [Model](s3://dl.fbaipublicfiles.com/wav2letter/mls/german/am.bin) |
| Dutch | [Arch](s3://dl.fbaipublicfiles.com/wav2letter/mls/dutch/arch.txt) | [Model](s3://dl.fbaipublicfiles.com/wav2letter/mls/dutch/am.bin) |
| French | [Arch](s3://dl.fbaipublicfiles.com/wav2letter/mls/french/arch.txt) | [Model](s3://dl.fbaipublicfiles.com/wav2letter/mls/french/am.bin) |
| Spanish | [Arch](s3://dl.fbaipublicfiles.com/wav2letter/mls/spanish/arch.txt) | [Model](s3://dl.fbaipublicfiles.com/wav2letter/mls/spanish/am.bin) |
| Italian | [Arch](s3://dl.fbaipublicfiles.com/wav2letter/mls/italian/arch.txt) | [Model](s3://dl.fbaipublicfiles.com/wav2letter/mls/italian/am.bin) |
| Portuguese | [Arch](s3://dl.fbaipublicfiles.com/wav2letter/mls/portuguese/arch.txt) | [Model](s3://dl.fbaipublicfiles.com/wav2letter/mls/portuguese/am.bin) |
| Polish | [Arch](s3://dl.fbaipublicfiles.com/wav2letter/mls/polish/arch.txt) | [Model](s3://dl.fbaipublicfiles.com/wav2letter/mls/polish/am.bin) |


## Pre-trained language models

The `5-gram_lm.arpa` from the tar ball should be used to decode each acoustic model. For faster serialization, people may convert those arpa files into binaries following steps [here](https://kheafield.com/code/kenlm/estimation/).

| Language | Language Model |
|:----------:|:--------------------------------------------------------------------:|
| English | [Model](https://dl.fbaipublicfiles.com/mls/mls_lm_english.tar.gz) |
| German | [Model](https://dl.fbaipublicfiles.com/mls/mls_lm_german.tar.gz) |
| Dutch | [Model](https://dl.fbaipublicfiles.com/mls/mls_lm_dutch.tar.gz) |
| French | [Model](https://dl.fbaipublicfiles.com/mls/mls_lm_french.tar.gz) |
| Spanish | [Model](https://dl.fbaipublicfiles.com/mls/mls_lm_spanish.tar.gz) |
| Italian | [Model](https://dl.fbaipublicfiles.com/mls/mls_lm_italian.tar.gz) |
| Portuguese | [Model](https://dl.fbaipublicfiles.com/mls/mls_lm_portuguese.tar.gz) |
| Polish | [Model](https://dl.fbaipublicfiles.com/mls/mls_lm_polish.tar.gz) |


## Usage

### Training
```
[...]/flashlight/build/bin/asr/fl_asr_train train --flagsfile=train/<lang>.cfg --minloglevel=0 --logtostderr=1
```

### Decoding
```
[...]/flashlight/build/bin/asr/fl_asr_decode --flagsfile=decode/<lang>.cfg
```

## Citation

```
@article{Pratap2020MLSAL,
title={MLS: A Large-Scale Multilingual Dataset for Speech Research},
author={Vineel Pratap and Qiantong Xu and Anuroop Sriram and Gabriel Synnaeve and Ronan Collobert},
journal={ArXiv},
year={2020},
volume={abs/2012.03411}
}
```

NOTE: We have made few updates to the MLS dataset after our INTERSPEECH paper was submitted to include more number of hours and also to improve the quality of transcripts. To avoid confusion (by having multiple versions), we are making **ONLY** one release with all the improvements included. For accurate dataset statistics and baselines, please refer to the arXiv paper above.
11 changes: 11 additions & 0 deletions recipes/mls/decode/Dutch.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
--am=[...]/am.bin
--lm=[...]/5-gram_lm.arpa
--lexicon=[...]/joint_lexicon.txt
--datadir=[...]
--test=test.lst
--emission_dir=''
--lmweight=1.37
--wordscore=-0.72
--beamsize=1500
--beamthreshold=100
--beamsizetoken=50
11 changes: 11 additions & 0 deletions recipes/mls/decode/English.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
--am=[...]/am.bin
--lm=[...]/5-gram_lm.arpa
--lexicon=[...]/joint_lexicon.txt
--datadir=[...]
--test=test.lst
--emission_dir=''
--lmweight=0.92
--wordscore=1.04
--beamsize=1500
--beamthreshold=100
--beamsizetoken=50
11 changes: 11 additions & 0 deletions recipes/mls/decode/French.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
--am=[...]/am.bin
--lm=[...]/5-gram_lm.arpa
--lexicon=[...]/joint_lexicon.txt
--datadir=[...]
--test=test.lst
--emission_dir=''
--lmweight=1.58
--wordscore=1.36
--beamsize=1500
--beamthreshold=100
--beamsizetoken=50
11 changes: 11 additions & 0 deletions recipes/mls/decode/German.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
--am=[...]/am.bin
--lm=[...]/5-gram_lm.arpa
--lexicon=[...]/joint_lexicon.txt
--datadir=[...]
--test=test.lst
--emission_dir=''
--lmweight=1.03
--wordscore=-0.19
--beamsize=1500
--beamthreshold=100
--beamsizetoken=50
11 changes: 11 additions & 0 deletions recipes/mls/decode/Italian.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
--am=[...]/am.bin
--lm=[...]/5-gram_lm.arpa
--lexicon=[...]/joint_lexicon.txt
--datadir=[...]
--test=test.lst
--emission_dir=''
--lmweight=2.82
--wordscore=-1.28
--beamsize=1500
--beamthreshold=100
--beamsizetoken=50
11 changes: 11 additions & 0 deletions recipes/mls/decode/Polish.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
--am=[...]/am.bin
--lm=[...]/5-gram_lm.arpa
--lexicon=[...]/joint_lexicon.txt
--datadir=[...]
--test=test.lst
--emission_dir=''
--lmweight=2.40
--wordscore=-0.82
--beamsize=1500
--beamthreshold=100
--beamsizetoken=50
11 changes: 11 additions & 0 deletions recipes/mls/decode/Portuguese.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
--am=[...]/am.bin
--lm=[...]/5-gram_lm.arpa
--lexicon=[...]/joint_lexicon.txt
--datadir=[...]
--test=test.lst
--emission_dir=''
--lmweight=1.34
--wordscore=-0.81
--beamsize=1500
--beamthreshold=100
--beamsizetoken=50
11 changes: 11 additions & 0 deletions recipes/mls/decode/Spanish.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
--am=[...]/am.bin
--lm=[...]/5-gram_lm.arpa
--lexicon=[...]/joint_lexicon.txt
--datadir=[...]
--test=test.lst
--emission_dir=''
--lmweight=0.91
--wordscore=1.12
--beamsize=1500
--beamthreshold=100
--beamsizetoken=50
39 changes: 39 additions & 0 deletions recipes/mls/train/Dutch.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
--target=ltr
--mfsc
--surround=|
--wordseparator=|
--criterion=ctc
--labelsmooth=0.05
--dataorder=input
--memstepsize=5000000
--pcttraineval=1
--pctteacherforcing=99
--sampletarget=0.01
--netoptim=adagrad
--lr=0.02
--lr_decay=200
--lr_decay_step=100
--adambeta1=0.95
--adambeta2=0.99
--momentum=0.95
--warmup=64001
--maxgradnorm=1
--onorm=target
--sqnorm
--nthread=6
--batchsize=8
--filterbanks=80
--saug_start_update=10000
--saug_fmaskf=30
--saug_fmaskn=2
--saug_tmaskt=50
--saug_tmaskp=0.1
--saug_tmaskn=10
--datadir=[...]
--archdir=[...]
--arch=arch.txt
--tokensdir=[...]
--tokens=tokens.txt
--lexicon=[...]/train_lexicon.txt
--train=train.lst
--valid=dev.lst
40 changes: 40 additions & 0 deletions recipes/mls/train/English.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
--target=ltr
--mfsc
--surround=|
--wordseparator=|
--criterion=ctc
--labelsmooth=0.05
--dataorder=input
--memstepsize=5000000
--pcttraineval=1
--pctteacherforcing=99
--sampletarget=0.01
--netoptim=adagrad
--lr=0.02
--lr_decay=30
--lr_decay_step=20
--adambeta1=0.95
--adambeta2=0.99
--momentum=0.95
--warmup=64001
--maxgradnorm=1
--onorm=target
--sqnorm
--nthread=6
--batchsize=8
--filterbanks=80
--saug_start_update=10000
--saug_fmaskf=30
--saug_fmaskn=2
--saug_tmaskt=50
--saug_tmaskp=0.1
--saug_tmaskn=10
--reportiters=5000
--datadir=[...]
--archdir=[...]
--arch=arch.txt
--tokensdir=[...]
--tokens=tokens.txt
--lexicon=[...]/train_lexicon.txt
--train=train.lst
--valid=dev.lst
39 changes: 39 additions & 0 deletions recipes/mls/train/French.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
--target=ltr
--mfsc
--surround=|
--wordseparator=|
--criterion=ctc
--labelsmooth=0.05
--dataorder=input
--memstepsize=5000000
--pcttraineval=1
--pctteacherforcing=99
--sampletarget=0.01
--netoptim=adagrad
--lr=0.02
--lr_decay=200
--lr_decay_step=100
--adambeta1=0.95
--adambeta2=0.99
--momentum=0.95
--warmup=64001
--maxgradnorm=1
--onorm=target
--sqnorm
--nthread=6
--batchsize=8
--filterbanks=80
--saug_start_update=10000
--saug_fmaskf=30
--saug_fmaskn=2
--saug_tmaskt=50
--saug_tmaskp=0.1
--saug_tmaskn=10
--datadir=[...]
--archdir=[...]
--arch=arch.txt
--tokensdir=[...]
--tokens=tokens.txt
--lexicon=[...]/train_lexicon.txt
--train=train.lst
--valid=dev.lst
39 changes: 39 additions & 0 deletions recipes/mls/train/German.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
--target=ltr
--mfsc
--surround=|
--wordseparator=|
--criterion=ctc
--labelsmooth=0.05
--dataorder=input
--memstepsize=5000000
--pcttraineval=1
--pctteacherforcing=99
--sampletarget=0.01
--netoptim=adagrad
--lr=0.02
--lr_decay=200
--lr_decay_step=100
--adambeta1=0.95
--adambeta2=0.99
--momentum=0.95
--warmup=64001
--maxgradnorm=1
--onorm=target
--sqnorm
--nthread=6
--batchsize=8
--filterbanks=80
--saug_start_update=10000
--saug_fmaskf=30
--saug_fmaskn=2
--saug_tmaskt=50
--saug_tmaskp=0.1
--saug_tmaskn=10
--datadir=[...]
--archdir=[...]
--arch=arch.txt
--tokensdir=[...]
--tokens=tokens.txt
--lexicon=[...]/train_lexicon.txt
--train=train.lst
--valid=dev.lst
39 changes: 39 additions & 0 deletions recipes/mls/train/Italian.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
--target=ltr
--mfsc
--surround=|
--wordseparator=|
--criterion=ctc
--labelsmooth=0.05
--dataorder=input
--memstepsize=5000000
--pcttraineval=1
--pctteacherforcing=99
--sampletarget=0.01
--netoptim=adagrad
--lr=0.02
--lr_decay=400
--lr_decay_step=200
--adambeta1=0.95
--adambeta2=0.99
--momentum=0.95
--warmup=64001
--maxgradnorm=1
--onorm=target
--sqnorm
--nthread=6
--batchsize=8
--filterbanks=80
--saug_start_update=10000
--saug_fmaskf=30
--saug_fmaskn=2
--saug_tmaskt=50
--saug_tmaskp=0.1
--saug_tmaskn=10
--datadir=[...]
--archdir=[...]
--arch=arch.txt
--tokensdir=[...]
--tokens=tokens.txt
--lexicon=[...]/train_lexicon.txt
--train=train.lst
--valid=dev.lst
Loading

0 comments on commit e3ea940

Please sign in to comment.