Arguably the largest public Russian STT dataset up to date:
- (new!) Now in
.mp3
to reduce download time 7-8x; - (new!) Now whole
.wav
version also available via torrent; - ~4.6m utterances;
- ~4000 hours;
- 431 GB (in
.wav
format inint16
);
Prove us wrong! Open issues, collaborate, submit a PR, contribute, share your datasets! Let's make STT in Russian (and more) as open and available as CV models.
Planned releases:
- 1000+ additional hours of YouTube;
- 1000-10,000 additional hours of books;
- Some validation / test sets;
Plain benchmarks, "bad files";Mp3 torrent;Wav torrent;- ... and more!;
Table of contents
- Dataset composition
- Downloads
- Annotation methodology
- Audio normalization
- Disk db methodology
- Helper functions
- Contacts
- FAQ
- License
- Donations
Dataset | Utterances | Hours | GB | Av s/chars | Comment | Annotation | Quality/noise |
---|---|---|---|---|---|---|---|
public_youtube1500 (*) | 1,500 | * Coming soon | |||||
audiobook_2 | 1,149,404 | 1,511 | 166 | 4.7s / 56 | Books | Alignment (*) | 95% / crisp |
public_youtube700 | 759,483 | 701 | 75 | 3.3s / 43 | Youtube videos | Subtitles | 95% / ~crisp |
tts_russian_addresses | 1,741,838 | 754 | 81 | 1.6s / 20 | Russian addresses | TTS 4 voices | 100% / crisp |
asr_public_phone_calls_2 | 603,797 | 601 | 66 | 3.6s / 37 | Phone calls | ASR | 70% / noisy |
asr_public_phone_calls_1 | 233,868 | 211 | 23 | 3.3s / 29 | Phone calls | ASR | 70% / noisy |
asr_public_stories_2 | 78,186 | 78 | 9 | 3.5s / 43 | Books | ASR | 80% / crisp |
asr_public_stories_1 | 46,142 | 38 | 4 | 3.0s / 30 | Books | ASR | 80% / crisp |
public_series_1 | 20,243 | 17 | 2 | 3.1s / 38 | Youtube videos | Subtitles | 95% / ~crisp |
ru_RU | 5,826 | 17 | 2 | 11s / 12 | Public dataset | Alignment | 99% / crisp |
voxforge_ru | 8,344 | 17 | 2 | 7.5s / 77 | Public dataset | Reading | 100% / crisp |
russian_single | 3,357 | 9 | 1 | 9.3s / 102 | Public dataset | Alignment | 99% / crisp |
public_lecture_1 | 6,803 | 6 | 1 | 3.4s / 47 | Lectures | Subtitles | 95% / crisp |
Total | 4,657,291 | 3,961 | 431 |
(*) Automatic alignment
This alignment was performed using Yuri's alignment tool. Contact him if you need alignment for your own dataset.
Also shared a wav version via torrent.
Click to expand
Added the forgotten txt files to mp3 archives. Updating the torrent.
Torrent created and uploaded to academictorrents.
Quickly converted the dataset to MP3 thanks to the community! Waiting for our account for academic torrents to be approved. v0.4 will boast MP3 download links.
If you want to support the project, you can:
- Help us with hosting (create a mirror) / provide a reliable node for torrent;
- Help us with writing some helper functions;
- Donate (each coffee pays for several full downloads) / use our DO referral link to help;
We are converting the dataset to MP3 now.
Please contact us using the below contacts, if you would like to help.
Save us a couple of bucks, download via torrent:
You can download separate files via torrent. Try several torrent clients if some do not work.
Meta data file.
Dataset | GB, wav | GB, mp3 | Wav | Mp3 | Source | Manifest |
---|---|---|---|---|---|---|
audiobook_2 | 166 | 21.0 | down | part1 | Sources from the Internet + alignment | link |
asr_public_phone_calls_2 | 66 | 7.5 | down | part1 | Sources from the Internet + ASR | link |
asr_public_stories_2 | 9 | 1.1 | down | part1 | Sources from the Internet + alignment | link |
tts_russian_addresses_rhvoice_4voices | 80.9 | 9.9 | down | part1 | TTS | link |
public_youtube700 | 75.0 | 9.6 | down | part1 | YouTube videos | link |
asr_public_phone_calls_1 | 22.7 | 2.6 | down | part1 | Sources from the Internet + ASR | link |
asr_public_stories_1 | 4.1 | 0.5 | down | part1 | Public stories | link |
public_series_1 | 1.9 | 0.2 | down | part1 | Public series | link |
ru_RU | 1.9 | 0.2 | down | part1 | Caito.de dataset | link |
voxforge_ru | 1.9 | 0.2 | down | part1 | Voxforge dataset | link |
russian_single | 0.9 | 0.1 | down | part1 | Russian single speaker dataset | link |
public_lecture_1 | 0.7 | 0.1 | down | part1 | Sources from the Internet | link |
Total | 431 | 52 |
- Download each dataset separately:
Via wget
wget https://ru-open-stt.ams3.digitaloceanspaces.com/some_file
For multi-threaded downloads use aria2 with -x
flag, i.e.
aria2c -c -x5 https://ru-open-stt.ams3.digitaloceanspaces.com/some_file
If necessary, merge chunks like this:
cat ru_open_stt_v01.tar.gz_* > ru_open_stt_v01.tar.gz
- Download the meta data and manifests for each dataset:
- Merge files (where applicable), unpack and enjoy!
Including links to deprecated files.
md5sum /path/to/downloaded/file
Click to expand
type | md5sum | file |
---|---|---|
audio | f24e21c69c03062d667caf0f055244f2 | asr_public_stories_2_mp3.tar.gz |
audio | a6f888c53d7cbded85ab51627ef57c96 | asr_public_phone_calls_1_mp3.tar.gz |
audio | f707e34f488c62af2e3142085ff595ad | asr_public_phone_calls_2_mp3.tar.gz |
audio | baa491ed0b526b2a989b8c4a8897429d | asr_public_stories_1_mp3.tar.gz |
audio | 42b9c8c2e31100d6c5b972c9ac000167 | private_buriy_audiobooks_2_mp3.tar.gz |
audio | 7a5704721012fafa115e7316e5f6e058 | public_lecture_1_mp3.tar.gz |
audio | 16cf820330f9f8b388395d777b2331ac | public_series_1_mp3.tar.gz |
audio | dd048e7110c0c852c353759dad8fec0f | public_youtube700_mp3.tar.gz |
audio | 579e9d98bd159a27d3573641edee69b0 | ru_ru_mp3.tar.gz |
audio | 177b041594684623ec7d038613e1330d | russian_single_mp3.tar.gz |
audio | d7ce4c4116dcc655be2b466f82c98b6e | tts_russian_addresses_rhvoice_4voices_mp3.tar.gz |
audio | 25ea6d9e249a242ecc217acc28c8077b | voxforge_ru_mp3.tar.gz |
manifest | b0ce7564ba90b121aeb13aada73a6e30 | asr_public_phone_calls_1.csv |
manifest | 6867d14dfdec1f9e9b8ca2f1de9ceda6 | asr_public_phone_calls_2.csv |
manifest | 0bdd77e15172e654d9a1999a86e92c7f | asr_public_stories_1.csv |
manifest | f388013039d94dc36970547944db51c7 | asr_public_stories_2.csv |
manifest | 3b67e27c1429593cccbf7c516c4b582d | private_buriy_audiobooks_2.csv |
manifest | 04027c20eb3aff05f6067957ecff856b | public_lecture_1.csv |
manifest | 89da3f1b6afcd4d4936662ceabf3033e | public_series_1.csv |
manifest | a81dfb018c88d0ecd5194ab3d8ff6c95 | public_youtube700.csv |
manifest | c858f020729c34ba0ab525bbb8950d0c | ru_RU.csv |
manifest | 0275525914825dec663fd53390fdc9a0 | russian_single.csv |
manifest | 52f406f4e30fcc8c634f992befd91beb | tts_russian_addresses_rhvoice_4voices.csv |
audio | 7533581bb26975212817bcacb25546d0 | asr_public_stories_2.tar.gz |
You can use this script with this config file. Please check the config first. You can also contribute a similar script in python.
The dataset is compiled using open domain sources. Some audio types are annotated automatically and verified statistically / using heuristics.
All files are normalized for easier / faster runtime augmentations and processing as follows:
- Converted to mono, if necessary;
- Converted to 16 kHz sampling rate, if necessary;
- Stored as 16-bit integers;
Each audio file is hashed. Its hash is used to create a folder hierarchy for more optimal fs operation.
target_format = 'wav'
wavb = wav.tobytes()
f_hash = hashlib.sha1(wavb).hexdigest()
store_path = Path(root_folder,
f_hash[0],
f_hash[1:3],
f_hash[3:15]+'.'+target_format)
Use helper functions from here for easier work with manifest files.
See example
from utils.open_stt_utils import read_manifest
manifest_df = read_manifest('path/to/manifest.csv')
See example
from utils.open_stt_utils import (plain_merge_manifests,
check_files,
save_manifest)
train_manifests = [
'path/to/manifest1.csv',
'path/to/manifest2.csv',
]
train_manifest = plain_merge_manifests(train_manifests,
MIN_DURATION=0.1,
MAX_DURATION=100)
check_files(train_manifest)
save_manifest(train_manifest,
'my_manifest.csv')
Please contact us here or just create a GitHub issue!
Authors in alphabetic order:
- Anna Slizhikova;
- Alexander Veysov;
- Dmitry Voronin;
- Yuri Baburov;
Mostly we used pydub
(via ffmpeg) to convert to MP3.
We omitted blank files (YouTube mostly).
We used the following parameters:
- 16kHz;
- 32 kbps;
- Mono;
Usually 128-192 kbps is enough for music with sr of 44 kHz, 64-96 is enough for speech.
But here we have mono, 16 kHz and usually only one speaker. So 32 kbps was a good choice.
We did not use other formats like .ogg
, because .mp3
is much more popular.
See example
from pydub import AudioSegment
sound = AudioSegment.from_file(temp_path,
format="wav")
file_handle = sound.export(store_mp3_path,
format="mp3",
parameters =["-ar", "{}".format(str(16000)),"-ac", "1"],
bitrate="{}k".format(str(32)))
It is up to you, but to save space and spare CPU during training, I would suggest the following pipeline to extract the files:
See example
# you can also use pydub, torchaudio, sox or whatever
# we ended up using scipy for speed
# this example also includes hashing step which is not necessary
import librosa
import hashlib
import numpy as np
from pathlib import Path
from scipy.io import wavfile
def save_wav_diskdb(wav,
root_folder='../data/ru_open_stt/',
target_sr=16000):
assert type(wav) == np.ndarray
assert wav.dtype == np.dtype('int16')
assert len(wav.shape)==1
target_format = 'wav'
wavb = wav.tobytes()
# f_path = Path(audio_path)
f_hash = hashlib.sha1(wavb).hexdigest()
store_path = Path(root_folder,
f_hash[0],
f_hash[1:3],
f_hash[3:15]+'.'+target_format)
store_path.parent.mkdir(parents=True,
exist_ok=True)
wavfile.write(filename=str(store_path),
rate=target_sr,
data=wav)
return str(store_path)
root_folder = '../data/'
# save to int16, mono, 16 kHz to save space
target_dtype = np.dtype('int16')
target_sr = 16000
# librosa reads mp3
wav, sr = librosa.load(source_mp3_path,
mono=True,
sr=target_sr)
# librosa converts to float32 by default
wav = (wav * 32767).astype(target_dtype) # cast to int
wav_path = save_wav_diskdb(wav,
root_folder=root_folder,
target_sr=target_sr)
See example
from scipy.io import wavfile
sample_rate, sound = wavfile.read(path)
abs_max = np.abs(sound).max()
sound = sound.astype('float32')
if abs_max>0:
sound *= 1/abs_max
We are not altruists, life just is not a zero sum game.
Consider the progress in computer vision, that was made possible by:
- Public datasets;
- Public pre-trained models;
- Open source frameworks;
- Open research;
TTS does not enjoy the same attention by ML community because it is data hungry and public datasets are lacking, especially for languages other than English. Ultimately it leads to worse-off situation for the general community.
Blank files in Youtube dataset. Removed in mp3 archive. Meta-data not cleaned;- Some files that have low values / crash with tochaudio;
- Looks like scipy does not always write meta-data when saving wavs (or you should save (N,1) shaped file) - this can be fixed as shown above;
Dual license, cc-by-nc and commercial usage available after agreement with dataset authors. Except for VoxForge, its license is GNU GPL 3.0. Except for Caito.de dataset, its licence is here.
Donate (each coffee pays for several full downloads) / use our DO referral link to help.