thchs30

THCHS30

This is the data part of the THCHS30 2015 acoustic data & scripts dataset.

The dataset is described in more detail in the paper THCHS-30 : A Free Chinese Speech Corpus by Dong Wang, Xuewei Zhang.

A paper (if it can be called a paper) 13 years ago regarding the database:

Dong Wang, Dalei Wu, Xiaoyan Zhu, TCMSD: A new Chinese Continuous Speech Database, International Conference on Chinese Computing (ICCC'01), 2001, Singapore.

The layout of this data pack is the following:

data *.wav audio data

  ``*.wav.trn``  
    transcriptions

{train,dev,test} contain symlinks into the data directory for both audio and transcription files. Contents of these directories define the train/dev/test split of the data.

{lm_word} word.3gram.lm trigram LM based on word lexicon.txt lexicon based on word

{lm_phone} phone.3gram.lm trigram LM based on phone lexicon.txt lexicon based on phone

README.TXT this file

Data statistics

Statistics for the data are as follows:

===========  ==========  ==========  ===========
**dataset**  **audio**   **#sents**  **#words**
===========  ==========  ==========  ===========
    train        25        10,000      198,252
    dev         2:14         893        17,743
    test        6:15        2,495       49,085
===========  ==========  ==========  ===========

Name		Name	Last commit message	Last commit date
parent directory ..
.gitignore		.gitignore
README.md		README.md
thchs30.py		thchs30.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

thchs30

thchs30

README.md

THCHS30

Data statistics

Files

thchs30

Directory actions

More options

Directory actions

More options

Latest commit

History

thchs30

Folders and files

parent directory

README.md

THCHS30

Data statistics