Russian Open Speech To Text (STT/ASR) Dataset

Arguably the largest public Russian STT dataset up to date:

(new!) Now in .mp3 to reduce download time 7-8x;
(new!) Now whole .wav version also available via torrent;
~4.6m utterances;
~4000 hours;
431 GB (in .wav format in int16);

Prove us wrong! Open issues, collaborate, submit a PR, contribute, share your datasets! Let's make STT in Russian (and more) as open and available as CV models.

Planned releases:

1000+ additional hours of YouTube;
1000-10,000 additional hours of books;
Some validation / test sets;
~~Plain benchmarks, "bad files"~~;
~~Mp3 torrent~~;
~~Wav torrent~~;
... and more!;

Table of contents

Dataset composition

Dataset	Utterances	Hours	GB	Av s/chars	Comment	Annotation	Quality/noise
public_youtube1500 (*)		1,500			* Coming soon
audiobook_2	1,149,404	1,511	166	4.7s / 56	Books	Alignment (*)	95% / crisp
public_youtube700	759,483	701	75	3.3s / 43	Youtube videos	Subtitles	95% / ~crisp
tts_russian_addresses	1,741,838	754	81	1.6s / 20	Russian addresses	TTS 4 voices	100% / crisp
asr_public_phone_calls_2	603,797	601	66	3.6s / 37	Phone calls	ASR	70% / noisy
asr_public_phone_calls_1	233,868	211	23	3.3s / 29	Phone calls	ASR	70% / noisy
asr_public_stories_2	78,186	78	9	3.5s / 43	Books	ASR	80% / crisp
asr_public_stories_1	46,142	38	4	3.0s / 30	Books	ASR	80% / crisp
public_series_1	20,243	17	2	3.1s / 38	Youtube videos	Subtitles	95% / ~crisp
ru_RU	5,826	17	2	11s / 12	Public dataset	Alignment	99% / crisp
voxforge_ru	8,344	17	2	7.5s / 77	Public dataset	Reading	100% / crisp
russian_single	3,357	9	1	9.3s / 102	Public dataset	Alignment	99% / crisp
public_lecture_1	6,803	6	1	3.4s / 47	Lectures	Subtitles	95% / crisp
Total	4,657,291	3,961	431

(*) Automatic alignment

This alignment was performed using Yuri's alignment tool. Contact him if you need alignment for your own dataset.

Updates

Update 2019-05-19

Also shared a wav version via torrent.

Click to expand

Update 2019-05-13

Added the forgotten txt files to mp3 archives. Updating the torrent.

Update 2019-05-12

Torrent created and uploaded to academictorrents.

Update 2019-05-10

Quickly converted the dataset to MP3 thanks to the community! Waiting for our account for academic torrents to be approved. v0.4 will boast MP3 download links.

Update 2019-05-07 Help needed!

If you want to support the project, you can:

Help us with hosting (create a mirror) / provide a reliable node for torrent;
Help us with writing some helper functions;
Donate (each coffee pays for several full downloads) / use our DO referral link to help;

~~We are converting the dataset to MP3 now.~~ Please contact us using the below contacts, if you would like to help.

Downloads

Via torrent

Save us a couple of bucks, download via torrent:

An MP3 version of the dataset;
A WAV version of the dataset;

You can download separate files via torrent. Try several torrent clients if some do not work.

Links

Meta data file.

Dataset	GB, wav	GB, mp3	Wav	Mp3	Source	Manifest
audiobook_2	166	21.0	down	part1	Sources from the Internet + alignment	link
asr_public_phone_calls_2	66	7.5	down	part1	Sources from the Internet + ASR	link
asr_public_stories_2	9	1.1	down	part1	Sources from the Internet + alignment	link
tts_russian_addresses_rhvoice_4voices	80.9	9.9	down	part1	TTS	link
public_youtube700	75.0	9.6	down	part1	YouTube videos	link
asr_public_phone_calls_1	22.7	2.6	down	part1	Sources from the Internet + ASR	link
asr_public_stories_1	4.1	0.5	down	part1	Public stories	link
public_series_1	1.9	0.2	down	part1	Public series	link
ru_RU	1.9	0.2	down	part1	Caito.de dataset	link
voxforge_ru	1.9	0.2	down	part1	Voxforge dataset	link
russian_single	0.9	0.1	down	part1	Russian single speaker dataset	link
public_lecture_1	0.7	0.1	down	part1	Sources from the Internet	link
Total	431	52

Download instructions

Download each dataset separately:

Via wget

wget https://ru-open-stt.ams3.digitaloceanspaces.com/some_file

For multi-threaded downloads use aria2 with -x flag, i.e.

aria2c -c -x5 https://ru-open-stt.ams3.digitaloceanspaces.com/some_file

If necessary, merge chunks like this:

cat ru_open_stt_v01.tar.gz_* > ru_open_stt_v01.tar.gz

Download the meta data and manifests for each dataset:
Merge files (where applicable), unpack and enjoy!

Check md5sum

Including links to deprecated files. md5sum /path/to/downloaded/file

Click to expand

type	md5sum	file
audio	f24e21c69c03062d667caf0f055244f2	asr_public_stories_2_mp3.tar.gz
audio	a6f888c53d7cbded85ab51627ef57c96	asr_public_phone_calls_1_mp3.tar.gz
audio	f707e34f488c62af2e3142085ff595ad	asr_public_phone_calls_2_mp3.tar.gz
audio	baa491ed0b526b2a989b8c4a8897429d	asr_public_stories_1_mp3.tar.gz
audio	42b9c8c2e31100d6c5b972c9ac000167	private_buriy_audiobooks_2_mp3.tar.gz
audio	7a5704721012fafa115e7316e5f6e058	public_lecture_1_mp3.tar.gz
audio	16cf820330f9f8b388395d777b2331ac	public_series_1_mp3.tar.gz
audio	dd048e7110c0c852c353759dad8fec0f	public_youtube700_mp3.tar.gz
audio	579e9d98bd159a27d3573641edee69b0	ru_ru_mp3.tar.gz
audio	177b041594684623ec7d038613e1330d	russian_single_mp3.tar.gz
audio	d7ce4c4116dcc655be2b466f82c98b6e	tts_russian_addresses_rhvoice_4voices_mp3.tar.gz
audio	25ea6d9e249a242ecc217acc28c8077b	voxforge_ru_mp3.tar.gz
manifest	b0ce7564ba90b121aeb13aada73a6e30	asr_public_phone_calls_1.csv
manifest	6867d14dfdec1f9e9b8ca2f1de9ceda6	asr_public_phone_calls_2.csv
manifest	0bdd77e15172e654d9a1999a86e92c7f	asr_public_stories_1.csv
manifest	f388013039d94dc36970547944db51c7	asr_public_stories_2.csv
manifest	3b67e27c1429593cccbf7c516c4b582d	private_buriy_audiobooks_2.csv
manifest	04027c20eb3aff05f6067957ecff856b	public_lecture_1.csv
manifest	89da3f1b6afcd4d4936662ceabf3033e	public_series_1.csv
manifest	a81dfb018c88d0ecd5194ab3d8ff6c95	public_youtube700.csv
manifest	c858f020729c34ba0ab525bbb8950d0c	ru_RU.csv
manifest	0275525914825dec663fd53390fdc9a0	russian_single.csv
manifest	52f406f4e30fcc8c634f992befd91beb	tts_russian_addresses_rhvoice_4voices.csv
audio	7533581bb26975212817bcacb25546d0	asr_public_stories_2.tar.gz

End to end download scripts

You can use this script with this config file. Please check the config first. You can also contribute a similar script in python.

Annotation methodology

The dataset is compiled using open domain sources. Some audio types are annotated automatically and verified statistically / using heuristics.

Audio normalization

All files are normalized for easier / faster runtime augmentations and processing as follows:

Converted to mono, if necessary;
Converted to 16 kHz sampling rate, if necessary;
Stored as 16-bit integers;

On disk DB methodology

Each audio file is hashed. Its hash is used to create a folder hierarchy for more optimal fs operation.

target_format = 'wav'
wavb = wav.tobytes()

f_hash = hashlib.sha1(wavb).hexdigest()

store_path = Path(root_folder,
                  f_hash[0],
                  f_hash[1:3],
                  f_hash[3:15]+'.'+target_format)

Helper functions

Use helper functions from here for easier work with manifest files.

Read manifests

See example

from utils.open_stt_utils import read_manifest

manifest_df = read_manifest('path/to/manifest.csv')

Merge, check and save manifests

See example

from utils.open_stt_utils import (plain_merge_manifests,
                                  check_files,
                                  save_manifest)
train_manifests = [
 'path/to/manifest1.csv',
 'path/to/manifest2.csv',
]
train_manifest = plain_merge_manifests(train_manifests,
                                        MIN_DURATION=0.1,
                                        MAX_DURATION=100)
check_files(train_manifest)
save_manifest(train_manifest,
             'my_manifest.csv')

Contacts

Please contact us here or just create a GitHub issue!

Authors in alphabetic order:

Anna Slizhikova;
Alexander Veysov;
Dmitry Voronin;
Yuri Baburov;

FAQ

0. Why not MP3? MP3 encoding / decoding

Encoding

Mostly we used pydub (via ffmpeg) to convert to MP3. We omitted blank files (YouTube mostly). We used the following parameters:

16kHz;
32 kbps;
Mono;

Usually 128-192 kbps is enough for music with sr of 44 kHz, 64-96 is enough for speech. But here we have mono, 16 kHz and usually only one speaker. So 32 kbps was a good choice. We did not use other formats like .ogg, because .mp3 is much more popular.

See example

from pydub import AudioSegment

sound = AudioSegment.from_file(temp_path,
                               format="wav")

file_handle = sound.export(store_mp3_path,
                           format="mp3",
                           parameters =["-ar", "{}".format(str(16000)),"-ac", "1"],
                           bitrate="{}k".format(str(32)))

Decoding

It is up to you, but to save space and spare CPU during training, I would suggest the following pipeline to extract the files:

See example

# you can also use pydub, torchaudio, sox or whatever
# we ended up using scipy for speed
# this example also includes hashing step which is not necessary
import librosa
import hashlib
import numpy as np
from pathlib import Path
from scipy.io import wavfile

def save_wav_diskdb(wav,
                    root_folder='../data/ru_open_stt/',
                    target_sr=16000):
    assert type(wav) == np.ndarray
    assert wav.dtype == np.dtype('int16')
    assert len(wav.shape)==1

    target_format = 'wav'
    wavb = wav.tobytes()

    # f_path = Path(audio_path)
    f_hash = hashlib.sha1(wavb).hexdigest()

    store_path = Path(root_folder,
                      f_hash[0],
                      f_hash[1:3],
                      f_hash[3:15]+'.'+target_format)

    store_path.parent.mkdir(parents=True,
                            exist_ok=True)

    wavfile.write(filename=str(store_path),
                  rate=target_sr,
                  data=wav)

    return str(store_path)

root_folder = '../data/'
# save to int16, mono, 16 kHz to save space
target_dtype = np.dtype('int16')
target_sr = 16000
# librosa reads mp3
wav, sr = librosa.load(source_mp3_path,
                       mono=True,
                       sr=target_sr)

# librosa converts to float32 by default
wav = (wav * 32767).astype(target_dtype) # cast to int

wav_path = save_wav_diskdb(wav,
                           root_folder=root_folder,
                           target_sr=target_sr)

1. Issues with reading files

Maybe try this approach:

See example

from scipy.io import wavfile

sample_rate, sound = wavfile.read(path)

abs_max = np.abs(sound).max()
sound = sound.astype('float32')
if abs_max>0:
    sound *= 1/abs_max

2. Why share such dataset?

We are not altruists, life just is not a zero sum game.

Consider the progress in computer vision, that was made possible by:

Public datasets;
Public pre-trained models;
Open source frameworks;
Open research;

TTS does not enjoy the same attention by ML community because it is data hungry and public datasets are lacking, especially for languages other than English. Ultimately it leads to worse-off situation for the general community.

3. Known issues with the dataset to be fixed

~~Blank files in Youtube dataset~~. Removed in mp3 archive. Meta-data not cleaned;
Some files that have low values / crash with tochaudio;
Looks like scipy does not always write meta-data when saving wavs (or you should save (N,1) shaped file) - this can be fixed as shown above;

License

Dual license, cc-by-nc and commercial usage available after agreement with dataset authors. Except for VoxForge, its license is GNU GPL 3.0. Except for Caito.de dataset, its licence is here.

Donations

Donate (each coffee pays for several full downloads) / use our DO referral link to help.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
utils		utils
LICENSE		LICENSE
README.md		README.md
download.sh		download.sh
md5sum.lst		md5sum.lst

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Russian Open Speech To Text (STT/ASR) Dataset

Dataset composition

Updates

Update 2019-05-19

Update 2019-05-13

Update 2019-05-12

Update 2019-05-10

Update 2019-05-07 Help needed!

Downloads

Via torrent

Links

Download instructions

Check md5sum

End to end download scripts

Annotation methodology

Audio normalization

On disk DB methodology

Helper functions

Read manifests

Merge, check and save manifests

Contacts

FAQ

0. Why not MP3? MP3 encoding / decoding

Encoding

Decoding

1. Issues with reading files

Maybe try this approach:

2. Why share such dataset?

3. Known issues with the dataset to be fixed

License

Donations

About

Releases

Packages

Languages

License

Something2019/open_stt

Folders and files

Latest commit

History

Repository files navigation

Russian Open Speech To Text (STT/ASR) Dataset

Dataset composition

Updates

Update 2019-05-19

Update 2019-05-13

Update 2019-05-12

Update 2019-05-10

Update 2019-05-07 Help needed!

Downloads

Via torrent

Links

Download instructions

Check md5sum

End to end download scripts

Annotation methodology

Audio normalization

On disk DB methodology

Helper functions

Read manifests

Merge, check and save manifests

Contacts

FAQ

0. Why not MP3? MP3 encoding / decoding

Encoding

Decoding

1. Issues with reading files

Maybe try this approach:

2. Why share such dataset?

3. Known issues with the dataset to be fixed

License

Donations

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages