Skip to content

Something2019/open_stt

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Russian Open Speech To Text (STT/ASR) Dataset

Arguably the largest public Russian STT dataset up to date:

  • (new!) Now in .mp3 to reduce download time 7-8x;
  • (new!) Now whole .wav version also available via torrent;
  • ~4.6m utterances;
  • ~4000 hours;
  • 431 GB (in .wav format in int16);

Prove us wrong! Open issues, collaborate, submit a PR, contribute, share your datasets! Let's make STT in Russian (and more) as open and available as CV models.

Planned releases:

  • 1000+ additional hours of YouTube;
  • 1000-10,000 additional hours of books;
  • Some validation / test sets;
  • Plain benchmarks, "bad files";
  • Mp3 torrent;
  • Wav torrent;
  • ... and more!;

Table of contents

Dataset composition

Dataset Utterances Hours GB Av s/chars Comment Annotation Quality/noise
public_youtube1500 (*) 1,500 * Coming soon
audiobook_2 1,149,404 1,511 166 4.7s / 56 Books Alignment (*) 95% / crisp
public_youtube700 759,483 701 75 3.3s / 43 Youtube videos Subtitles 95% / ~crisp
tts_russian_addresses 1,741,838 754 81 1.6s / 20 Russian addresses TTS 4 voices 100% / crisp
asr_public_phone_calls_2 603,797 601 66 3.6s / 37 Phone calls ASR 70% / noisy
asr_public_phone_calls_1 233,868 211 23 3.3s / 29 Phone calls ASR 70% / noisy
asr_public_stories_2 78,186 78 9 3.5s / 43 Books ASR 80% / crisp
asr_public_stories_1 46,142 38 4 3.0s / 30 Books ASR 80% / crisp
public_series_1 20,243 17 2 3.1s / 38 Youtube videos Subtitles 95% / ~crisp
ru_RU 5,826 17 2 11s / 12 Public dataset Alignment 99% / crisp
voxforge_ru 8,344 17 2 7.5s / 77 Public dataset Reading 100% / crisp
russian_single 3,357 9 1 9.3s / 102 Public dataset Alignment 99% / crisp
public_lecture_1 6,803 6 1 3.4s / 47 Lectures Subtitles 95% / crisp
Total 4,657,291 3,961 431

(*) Automatic alignment

This alignment was performed using Yuri's alignment tool. Contact him if you need alignment for your own dataset.

Updates

Update 2019-05-19

Also shared a wav version via torrent.

Click to expand

Update 2019-05-13

Added the forgotten txt files to mp3 archives. Updating the torrent.

Update 2019-05-12

Torrent created and uploaded to academictorrents.

Update 2019-05-10

Quickly converted the dataset to MP3 thanks to the community! Waiting for our account for academic torrents to be approved. v0.4 will boast MP3 download links.

Update 2019-05-07 Help needed!

If you want to support the project, you can:

  • Help us with hosting (create a mirror) / provide a reliable node for torrent;
  • Help us with writing some helper functions;
  • Donate (each coffee pays for several full downloads) / use our DO referral link to help;

We are converting the dataset to MP3 now. Please contact us using the below contacts, if you would like to help.

Downloads

Via torrent

Save us a couple of bucks, download via torrent:

You can download separate files via torrent. Try several torrent clients if some do not work.

Links

Meta data file.

Dataset GB, wav GB, mp3 Wav Mp3 Source Manifest
audiobook_2 166 21.0 down part1 Sources from the Internet + alignment link
asr_public_phone_calls_2 66 7.5 down part1 Sources from the Internet + ASR link
asr_public_stories_2 9 1.1 down part1 Sources from the Internet + alignment link
tts_russian_addresses_rhvoice_4voices 80.9 9.9 down part1 TTS link
public_youtube700 75.0 9.6 down part1 YouTube videos link
asr_public_phone_calls_1 22.7 2.6 down part1 Sources from the Internet + ASR link
asr_public_stories_1 4.1 0.5 down part1 Public stories link
public_series_1 1.9 0.2 down part1 Public series link
ru_RU 1.9 0.2 down part1 Caito.de dataset link
voxforge_ru 1.9 0.2 down part1 Voxforge dataset link
russian_single 0.9 0.1 down part1 Russian single speaker dataset link
public_lecture_1 0.7 0.1 down part1 Sources from the Internet link
Total 431 52

Download instructions

  1. Download each dataset separately:

Via wget

wget https://ru-open-stt.ams3.digitaloceanspaces.com/some_file

For multi-threaded downloads use aria2 with -x flag, i.e.

aria2c -c -x5 https://ru-open-stt.ams3.digitaloceanspaces.com/some_file

If necessary, merge chunks like this:

cat ru_open_stt_v01.tar.gz_* > ru_open_stt_v01.tar.gz
  1. Download the meta data and manifests for each dataset:
  2. Merge files (where applicable), unpack and enjoy!

Check md5sum

Including links to deprecated files. md5sum /path/to/downloaded/file

Click to expand
type md5sum file
audio f24e21c69c03062d667caf0f055244f2 asr_public_stories_2_mp3.tar.gz
audio a6f888c53d7cbded85ab51627ef57c96 asr_public_phone_calls_1_mp3.tar.gz
audio f707e34f488c62af2e3142085ff595ad asr_public_phone_calls_2_mp3.tar.gz
audio baa491ed0b526b2a989b8c4a8897429d asr_public_stories_1_mp3.tar.gz
audio 42b9c8c2e31100d6c5b972c9ac000167 private_buriy_audiobooks_2_mp3.tar.gz
audio 7a5704721012fafa115e7316e5f6e058 public_lecture_1_mp3.tar.gz
audio 16cf820330f9f8b388395d777b2331ac public_series_1_mp3.tar.gz
audio dd048e7110c0c852c353759dad8fec0f public_youtube700_mp3.tar.gz
audio 579e9d98bd159a27d3573641edee69b0 ru_ru_mp3.tar.gz
audio 177b041594684623ec7d038613e1330d russian_single_mp3.tar.gz
audio d7ce4c4116dcc655be2b466f82c98b6e tts_russian_addresses_rhvoice_4voices_mp3.tar.gz
audio 25ea6d9e249a242ecc217acc28c8077b voxforge_ru_mp3.tar.gz
manifest b0ce7564ba90b121aeb13aada73a6e30 asr_public_phone_calls_1.csv
manifest 6867d14dfdec1f9e9b8ca2f1de9ceda6 asr_public_phone_calls_2.csv
manifest 0bdd77e15172e654d9a1999a86e92c7f asr_public_stories_1.csv
manifest f388013039d94dc36970547944db51c7 asr_public_stories_2.csv
manifest 3b67e27c1429593cccbf7c516c4b582d private_buriy_audiobooks_2.csv
manifest 04027c20eb3aff05f6067957ecff856b public_lecture_1.csv
manifest 89da3f1b6afcd4d4936662ceabf3033e public_series_1.csv
manifest a81dfb018c88d0ecd5194ab3d8ff6c95 public_youtube700.csv
manifest c858f020729c34ba0ab525bbb8950d0c ru_RU.csv
manifest 0275525914825dec663fd53390fdc9a0 russian_single.csv
manifest 52f406f4e30fcc8c634f992befd91beb tts_russian_addresses_rhvoice_4voices.csv
audio 7533581bb26975212817bcacb25546d0 asr_public_stories_2.tar.gz

End to end download scripts

You can use this script with this config file. Please check the config first. You can also contribute a similar script in python.

Annotation methodology

The dataset is compiled using open domain sources. Some audio types are annotated automatically and verified statistically / using heuristics.

Audio normalization

All files are normalized for easier / faster runtime augmentations and processing as follows:

  • Converted to mono, if necessary;
  • Converted to 16 kHz sampling rate, if necessary;
  • Stored as 16-bit integers;

On disk DB methodology

Each audio file is hashed. Its hash is used to create a folder hierarchy for more optimal fs operation.

target_format = 'wav'
wavb = wav.tobytes()

f_hash = hashlib.sha1(wavb).hexdigest()

store_path = Path(root_folder,
                  f_hash[0],
                  f_hash[1:3],
                  f_hash[3:15]+'.'+target_format)

Helper functions

Use helper functions from here for easier work with manifest files.

Read manifests

See example

from utils.open_stt_utils import read_manifest

manifest_df = read_manifest('path/to/manifest.csv')

Merge, check and save manifests

See example

from utils.open_stt_utils import (plain_merge_manifests,
                                  check_files,
                                  save_manifest)
train_manifests = [
 'path/to/manifest1.csv',
 'path/to/manifest2.csv',
]
train_manifest = plain_merge_manifests(train_manifests,
                                        MIN_DURATION=0.1,
                                        MAX_DURATION=100)
check_files(train_manifest)
save_manifest(train_manifest,
             'my_manifest.csv')

Contacts

Please contact us here or just create a GitHub issue!

Authors in alphabetic order:

  • Anna Slizhikova;
  • Alexander Veysov;
  • Dmitry Voronin;
  • Yuri Baburov;

FAQ

0. Why not MP3? MP3 encoding / decoding

Encoding

Mostly we used pydub (via ffmpeg) to convert to MP3. We omitted blank files (YouTube mostly). We used the following parameters:

  • 16kHz;
  • 32 kbps;
  • Mono;

Usually 128-192 kbps is enough for music with sr of 44 kHz, 64-96 is enough for speech. But here we have mono, 16 kHz and usually only one speaker. So 32 kbps was a good choice. We did not use other formats like .ogg, because .mp3 is much more popular.

See example

from pydub import AudioSegment

sound = AudioSegment.from_file(temp_path,
                               format="wav")

file_handle = sound.export(store_mp3_path,
                           format="mp3",
                           parameters =["-ar", "{}".format(str(16000)),"-ac", "1"],
                           bitrate="{}k".format(str(32)))

Decoding

It is up to you, but to save space and spare CPU during training, I would suggest the following pipeline to extract the files:

See example

# you can also use pydub, torchaudio, sox or whatever
# we ended up using scipy for speed
# this example also includes hashing step which is not necessary
import librosa
import hashlib
import numpy as np
from pathlib import Path
from scipy.io import wavfile

def save_wav_diskdb(wav,
                    root_folder='../data/ru_open_stt/',
                    target_sr=16000):
    assert type(wav) == np.ndarray
    assert wav.dtype == np.dtype('int16')
    assert len(wav.shape)==1

    target_format = 'wav'
    wavb = wav.tobytes()

    # f_path = Path(audio_path)
    f_hash = hashlib.sha1(wavb).hexdigest()

    store_path = Path(root_folder,
                      f_hash[0],
                      f_hash[1:3],
                      f_hash[3:15]+'.'+target_format)

    store_path.parent.mkdir(parents=True,
                            exist_ok=True)

    wavfile.write(filename=str(store_path),
                  rate=target_sr,
                  data=wav)

    return str(store_path)

root_folder = '../data/'
# save to int16, mono, 16 kHz to save space
target_dtype = np.dtype('int16')
target_sr = 16000
# librosa reads mp3
wav, sr = librosa.load(source_mp3_path,
                       mono=True,
                       sr=target_sr)

# librosa converts to float32 by default
wav = (wav * 32767).astype(target_dtype) # cast to int

wav_path = save_wav_diskdb(wav,
                           root_folder=root_folder,
                           target_sr=target_sr)

1. Issues with reading files

Maybe try this approach:

See example

from scipy.io import wavfile

sample_rate, sound = wavfile.read(path)

abs_max = np.abs(sound).max()
sound = sound.astype('float32')
if abs_max>0:
    sound *= 1/abs_max

2. Why share such dataset?

We are not altruists, life just is not a zero sum game.

Consider the progress in computer vision, that was made possible by:

  • Public datasets;
  • Public pre-trained models;
  • Open source frameworks;
  • Open research;

TTS does not enjoy the same attention by ML community because it is data hungry and public datasets are lacking, especially for languages other than English. Ultimately it leads to worse-off situation for the general community.

3. Known issues with the dataset to be fixed

  • Blank files in Youtube dataset. Removed in mp3 archive. Meta-data not cleaned;
  • Some files that have low values / crash with tochaudio;
  • Looks like scipy does not always write meta-data when saving wavs (or you should save (N,1) shaped file) - this can be fixed as shown above;

License

Dual license, cc-by-nc and commercial usage available after agreement with dataset authors. Except for VoxForge, its license is GNU GPL 3.0. Except for Caito.de dataset, its licence is here.

Donations

Donate (each coffee pays for several full downloads) / use our DO referral link to help.

About

Russian open STT dataset

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 91.3%
  • Shell 8.7%