Prepare CSJ #851

teowenshen · 2022-10-12T05:22:13Z

Prepare for CSJ

This pull request diverges from the standard prepare of other recipes by a few points. I was hoping to get your input on these points in order to standardise my prepare_csj before the merge.

Train-Valid Splitting
CSJ does not define a separate validation dataset, so I follow the splitting scheme in espnet2 where 4000 utterances from the training datasets are picked as validation. Because one recording corresponds to multiple supervisions, I first create a concatenated cutset, trim the cutsets to its supervisions, then split that resulting cutset. The train cutset is further perturbed in speed here. Then, I returned the cuts instead in the return value of prepare_csj.
- Is there a recommended way to split the resulting cutset into its constituent recordingset and supervisionset so that it can returned?
- Alternatively, is there a recommended way or an existing recipe to refer to for this train-valid splitting?
Cutset saved to disc Since the cutsets are already created in this step, I saved the cutsets directly to disc so that compute_fbank_csj.py just loads the cuts_XXX.jsonl.gz. What do you think about this flow? (Once I can extract the recordingset and supervisionset from the cutset, I will save those to disc too.)
Transcript preparation Generating the transcript from the raw tsv file took me a separate python program. I start prepare_csj with the assumption that the transcript has been generated using that program, which I plan to include in the icefall recipe. What do you think?

Also, please let me know if I need to change anything else in my codes. It's my first time using click, so I am not confident there.

jtrmal · 2022-10-12T17:30:17Z

I think fwiw not making the normalization part of this makes sense. But please document it somewhere (maybe even print it after data preparation) so that it's easy to figure out. y.

…

On Wed, Oct 12, 2022 at 1:22 AM Teo Wen Shen ***@***.***> wrote: Prepare for CSJ This pull request diverges from the standard prepare of other recipes by a few points. I was hoping to get your input on these points in order to standardise my prepare_csj before the merge. 1. *Train-Valid Splitting* CSJ does not define a separate validation dataset, so I follow the splitting scheme in espnet2 where 4000 utterances from the training datasets are picked as validation. Because one recording corresponds to multiple supervisions, I first create a concatenated cutset, trim the cutsets to its supervisions, then split that resulting cutset. The train cutset is further perturbed in speed here. Then, I returned the cuts instead in the return value of prepare_csj. - Is there a recommended way to split the resulting cutset into its constituent recordingset and supervisionset so that it can returned? - Alternatively, is there a recommended way or an existing recipe to refer to for this train-valid splitting? 2. *Cutset saved to disc* Since the cutsets are already created in this step, I saved the cutsets directly to disc so that compute_fbank_csj.py just loads the cuts_XXX.jsonl.gz. What do you think about this flow? (Once I can extract the recordingset and supervisionset from the cutset, I will save those to disc too.) 3. *Transcript preparation* Generating the transcript from the raw tsv file took me a separate python program. I start prepare_csj with the assumption that the transcript has been generated using that program, which I plan to include in the icefall recipe. What do you think? Also, please let me know if I need to change anything else in my codes. It's my first time using click, so I am not confident there. ------------------------------ You can view, comment on, or merge this pull request online at: #851 Commit Summary - ca2430b <ca2430b> prepare_csj - 6e8b659 <6e8b659> Merge branch 'lhotse-speech:master' into csj File Changes (4 files <https://github.com/lhotse-speech/lhotse/pull/851/files>) - *M* lhotse/bin/modes/recipes/__init__.py <https://github.com/lhotse-speech/lhotse/pull/851/files#diff-bf6d6d33b1a4b0ae281fe41733df689412eb11caa1c671a728f0666fa997bae5> (1) - *A* lhotse/bin/modes/recipes/csj.py <https://github.com/lhotse-speech/lhotse/pull/851/files#diff-71f94e124a117785537195febc2fe7f0c989deaf39e2766b8568be30225b4650> (51) - *M* lhotse/recipes/__init__.py <https://github.com/lhotse-speech/lhotse/pull/851/files#diff-3d496cc9788e121f8db0dae3eeff260db9c80c7f22b529dfc00bfddf36697dec> (1) - *A* lhotse/recipes/csj.py <https://github.com/lhotse-speech/lhotse/pull/851/files#diff-89c8d88e453eedf820b047da7363c4e8f3422cd7f2c92c7981a81c1a821a3107> (246) Patch Links: - https://github.com/lhotse-speech/lhotse/pull/851.patch - https://github.com/lhotse-speech/lhotse/pull/851.diff — Reply to this email directly, view it on GitHub <#851>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACUKYX6EDJMM4AXDE75S5KDWCZDJJANCNFSM6AAAAAARC5HLEQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

desh2608 · 2022-10-12T18:34:32Z

Prepare for CSJ

This pull request diverges from the standard prepare of other recipes by a few points. I was hoping to get your input on these points in order to standardise my prepare_csj before the merge.

Train-Valid Splitting
CSJ does not define a separate validation dataset, so I follow the splitting scheme in espnet2 where 4000 utterances from the training datasets are picked as validation. Because one recording corresponds to multiple supervisions, I first create a concatenated cutset, trim the cutsets to its supervisions, then split that resulting cutset. The train cutset is further perturbed in speed here. Then, I returned the cuts instead in the return value of prepare_csj.

Is there a recommended way to split the resulting cutset into its constituent recordingset and supervisionset so that it can returned?

Alternatively, is there a recommended way or an existing recipe to refer to for this train-valid splitting?

In general, if the corpus does not provide a data split, we just return the consolidated recordings and supervisions instead of creating our own splits, and the user can then split it in the way they want depending on their use case. See the cmu_kids recipe for example.

Cutset saved to disc Since the cutsets are already created in this step, I saved the cutsets directly to disc so that compute_fbank_csj.py just loads the cuts_XXX.jsonl.gz. What do you think about this flow? (Once I can extract the recordingset and supervisionset from the cutset, I will save those to disc too.)

You can use the CutSet.decompose() function to get the consitituents of the cut set. But you won't need this if you only return the full recordings and supervisions.

Transcript preparation Generating the transcript from the raw tsv file took me a separate python program. I start prepare_csj with the assumption that the transcript has been generated using that program, which I plan to include in the icefall recipe. What do you think?

I think this is reasonable. Some corpora require a bit of preprocessing to get them in a suitable format to read with a python program. CHiME-5, for example, requires array synchronization, and LibriCSS requires running a segmentation script provided with the official data. You can just mention the details about the assumption in the header so that the user can prepare it in the format required.

Also, please let me know if I need to change anything else in my codes. It's my first time using click, so I am not confident there.

lhotse/bin/modes/recipes/__init__.py

lhotse/recipes/csj.py

pzelasko

Thanks, the recipe looks good overall! Can you also update the table in docs/corpus.rst?

Regarding the transcripts, Lhotse recipes are generally supposed to work with raw data distribution as-is and be the very first step of the pipeline. I would suggest to incorporate your separate script for preparing the transcript as a function that's called inside this data prep recipe. I don't know CSJ, maybe the transcript preprocessing work is very involved and not suitable for Lhotse recipe, but I'd like you to at least attempt to convince me first :)

Honestly, I would avoid returning cuts as done here. I agree with Desh this is something that is typically done in downstream applications, if there's no dev set, we just don't provide it. The code you wrote for cuts could be moved to the Icefall recipe.

lhotse/recipes/csj.py

jtrmal · 2022-10-12T18:43:22Z

I think we do, desh -- thats even the change you were doing -- saving as "all_supervisions" vs "{dev,test,train}_supervisions"?

…

On Wed, Oct 12, 2022 at 2:34 PM Desh Raj ***@***.***> wrote: Prepare for CSJ This pull request diverges from the standard prepare of other recipes by a few points. I was hoping to get your input on these points in order to standardise my prepare_csj before the merge. 1. *Train-Valid Splitting* CSJ does not define a separate validation dataset, so I follow the splitting scheme in espnet2 where 4000 utterances from the training datasets are picked as validation. Because one recording corresponds to multiple supervisions, I first create a concatenated cutset, trim the cutsets to its supervisions, then split that resulting cutset. The train cutset is further perturbed in speed here. Then, I returned the cuts instead in the return value of prepare_csj. - Is there a recommended way to split the resulting cutset into its constituent recordingset and supervisionset so that it can returned? - Alternatively, is there a recommended way or an existing recipe to refer to for this train-valid splitting? In general, if the corpus does not provide a data split, we just return the consolidated recordings and supervisions instead of creating our own splits, and the user can then split it in the way they want depending on their use case. See the cmu_kids recipe for example. 1. *Cutset saved to disc* Since the cutsets are already created in this step, I saved the cutsets directly to disc so that compute_fbank_csj.py just loads the cuts_XXX.jsonl.gz. What do you think about this flow? (Once I can extract the recordingset and supervisionset from the cutset, I will save those to disc too.) You can use the CutSet.decompose() function to get the consitituents of the cut set. But you won't need this if you only return the full recordings and supervisions. 1. *Transcript preparation* Generating the transcript from the raw tsv file took me a separate python program. I start prepare_csj with the assumption that the transcript has been generated using that program, which I plan to include in the icefall recipe. What do you think? I think this is reasonable. Some corpora require a bit of preprocessing to get them in a suitable format to read with a python program. CHiME-5, for example, requires array synchronization, and LibriCSS requires running a segmentation script provided with the official data. You can just mention the details about the assumption in the header so that the user can prepare it in the format required. Also, please let me know if I need to change anything else in my codes. It's my first time using click, so I am not confident there. — Reply to this email directly, view it on GitHub <#851 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACUKYX64YP3R4ZEMFPELW4DWC4AEFANCNFSM6AAAAAARC5HLEQ> . You are receiving this because you commented.Message ID: ***@***.***>

desh2608 · 2022-10-12T19:09:56Z

@jtrmal Yeah, in such cases we would save as recordings_<corpus>_all.jsonl.gz instead.

teowenshen · 2022-10-13T03:20:41Z

Added some changes:-

Return supervisions and recordings instead of cutset
Parsing dataset parts as-is. (train dataset actually is a concatenation of core and noncore parts. Here, I return them as separate dataset parts.)
Added gender info in SupervisionSegment
Updated docstrings

In general, if the corpus does not provide a data split, we just return the consolidated recordings and supervisions instead of creating our own splits, and the user can then split it in the way they want depending on their use case. See the cmu_kids recipe for example.

Yes. Probably the best way is to have a separate program for the dataset preparation before compute_fbank_csj.py.

Can you also update the table in docs/corpus.rst?

Done.

Regarding the transcripts, Lhotse recipes are generally supposed to work with raw data distribution as-is and be the very first step of the pipeline. I would suggest to incorporate your separate script for preparing the transcript as a function that's called inside this data prep recipe. I don't know CSJ, maybe the transcript preprocessing work is very involved and not suitable for Lhotse recipe, but I'd like you to at least attempt to convince me first :)

It's a 700-liner program that reads in a config file and optionally a segment file, and my prepare.sh runs it recursively for each transcript mode. 😂 However, I do understand the motivation behind consolidating all transcript preparation scripts into lhotse.

I plan to update my Icefall recipe and pull in a working version in Icefall. I will link it here once it's done.

Ideally, I was thinking that maybe a separate command in lhotse could be used for these processes. lhotse parse ?

teowenshen · 2022-10-13T11:11:35Z

I have just sent in the pull request to Icefall for the data preparation part of the recipe.

This is the file that generates the transcripts from the raw tsv file: csj_make_transcript.py

The transcript generation is so involved, partly because there are many tags that mark slurring, laughing, clapping, stuttering, orthography variation, etc.

Another Japanese corpus that we plan to work on (not so soon) also requires an external script to parse the transcript from raw subtitle files. That corpus probably will have the same issue.

desh2608 · 2022-10-13T14:44:58Z

Can you fix the style checks? You can use:

pre-commit install
pre-commit run

teowenshen · 2022-10-14T00:05:40Z

Okay, I think I have done cleared the style checks.

desh2608 · 2022-10-14T00:11:55Z

Okay, I think I have done cleared the style checks.

Black seems to be failing still.

teowenshen · 2022-10-14T00:24:56Z

Somehow pre-commit shows that there are no files to check.
Anyhow I ran black again. I was splitting the long lines, which was why black complained.

check that executables have shebangs.................(no files to check)Skipped
fix end of files.....................................(no files to check)Skipped
mixed line ending....................................(no files to check)Skipped
trim trailing whitespace.............................(no files to check)Skipped
flake8...............................................(no files to check)Skipped
isort................................................(no files to check)Skipped
black................................................(no files to check)Skipped

pzelasko · 2022-10-14T13:14:53Z

I am almost convinced to merge as is, but actually I have one issue. This would be the first Lhotse recipe that requires some undocumented, external step to prepare the data in order to generate the manifests. It introduces an implicit dependency on Icefall where the script would live, but the user doesn't have any way to find out about it. I don't think this will be useful to anybody outside of Icefall recipe users.

I don't have an issue with adding a 700 line script that does text preprocessing. It shouldn't make a difference if it lives in Lhotse or in Icefall, except that when it lives in Lhotse, the data prep recipe works out of the box even for non-Icefall users. I think it should just be a part of prepare_csj as the very first function called in the recipe. This way this recipe will be compatible with all of the other Lhotse recipes.

jtrmal · 2022-10-14T13:18:33Z

makes sense to me -- I agree y.

…

On Fri, Oct 14, 2022 at 9:15 AM Piotr Żelasko ***@***.***> wrote: I am almost convinced to merge as is, but actually I have one issue. This would be the first Lhotse recipe that requires some undocumented, external step to prepare the data in order to generate the manifests. It introduces an implicit dependency on Icefall where the script would live, but the user doesn't have any way to find out about it. I don't think this will be useful to anybody outside of Icefall recipe users. I don't have an issue with adding a 700 line script that does text preprocessing. It shouldn't make a difference if it lives in Lhotse or in Icefall, except that when it lives in Lhotse, the data prep recipe works out of the box even for non-Icefall users. I think it should just be a part of prepare_csj as the very first function called in the recipe. This way this recipe will be compatible with all of the other Lhotse recipes. — Reply to this email directly, view it on GitHub <#851 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACUKYX73VN7ZQO2YLDBQRI3WDFMFRANCNFSM6AAAAAARC5HLEQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

desh2608 · 2022-10-14T13:47:58Z

I second Piotr. If the conf.ini files are needed for the corpus preparation, they can be put on a server (e.g. openslr) and downloaded inside the preparation recipe.

teowenshen · 2022-10-14T13:55:48Z

Yes. I would also like to standardise this corpus with other corpora too.

Is it okay to add more optional arguments behind lhotse prepare? I will include a default config in the codes. If the user wants to change the behaviour, they can do so by supplying the path to their own modified config file.

pzelasko · 2022-10-14T14:20:53Z

Yes, various recipe have their own recipe-specific optional arguments.

teowenshen · 2022-10-14T14:27:29Z

Great! I will update my codes and send in a pull request soon.

teowenshen · 2022-10-16T11:10:31Z

I have moved the transcript generation part over to Lhotse, so now this recipe should be performing the same functions as other recipes. I included a default .ini file directly inside the codes so that this recipe is as self sufficient as possible.

pzelasko · 2022-10-17T20:53:17Z

Thanks! Great work.

teowenshen and others added 2 commits October 12, 2022 12:25

prepare_csj

ca2430b

Merge branch 'lhotse-speech:master' into csj

6e8b659

desh2608 requested changes Oct 12, 2022

View reviewed changes

lhotse/bin/modes/recipes/__init__.py Show resolved Hide resolved

lhotse/recipes/csj.py Show resolved Hide resolved

lhotse/recipes/csj.py Outdated Show resolved Hide resolved

lhotse/recipes/csj.py Outdated Show resolved Hide resolved

pzelasko reviewed Oct 12, 2022

View reviewed changes

lhotse/recipes/csj.py Outdated Show resolved Hide resolved

lhotse/recipes/csj.py Outdated Show resolved Hide resolved

return supervisions/recordings as per data parts

6fb484e

remove reliance on glob.glob

f72e9ff

teowenshen mentioned this pull request Oct 13, 2022

CSJ Data Preparation k2-fsa/icefall#617

Merged

fix styles

26927d5

fix styles 2

2bd8d41

desh2608 previously approved these changes Oct 14, 2022

View reviewed changes

Parse transcript in lhotse

f8e2326

teowenshen dismissed desh2608’s stale review via f8e2326 October 16, 2022 09:08

teowenshen added 2 commits October 16, 2022 18:35

Clean up, add header

c73682f

remove multiprocessing package

ac7c5c8

pzelasko changed the title ~~[WIP] Prepare CSJ~~ Prepare CSJ Oct 17, 2022

pzelasko merged commit c73a22c into lhotse-speech:master Oct 17, 2022

teowenshen deleted the csj branch October 18, 2022 00:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prepare CSJ #851

Prepare CSJ #851

teowenshen commented Oct 12, 2022

jtrmal commented Oct 12, 2022 via email

desh2608 commented Oct 12, 2022

Prepare for CSJ

pzelasko left a comment

jtrmal commented Oct 12, 2022 via email

desh2608 commented Oct 12, 2022

teowenshen commented Oct 13, 2022

teowenshen commented Oct 13, 2022

desh2608 commented Oct 13, 2022

teowenshen commented Oct 14, 2022

desh2608 commented Oct 14, 2022

teowenshen commented Oct 14, 2022

pzelasko commented Oct 14, 2022

jtrmal commented Oct 14, 2022 via email

desh2608 commented Oct 14, 2022

teowenshen commented Oct 14, 2022

pzelasko commented Oct 14, 2022

teowenshen commented Oct 14, 2022

teowenshen commented Oct 16, 2022

pzelasko commented Oct 17, 2022

Prepare CSJ #851

Prepare CSJ #851

Conversation

teowenshen commented Oct 12, 2022

Prepare for CSJ

jtrmal commented Oct 12, 2022 via email

desh2608 commented Oct 12, 2022

Prepare for CSJ

pzelasko left a comment

Choose a reason for hiding this comment

jtrmal commented Oct 12, 2022 via email

desh2608 commented Oct 12, 2022

teowenshen commented Oct 13, 2022

teowenshen commented Oct 13, 2022

desh2608 commented Oct 13, 2022

teowenshen commented Oct 14, 2022

desh2608 commented Oct 14, 2022

teowenshen commented Oct 14, 2022

pzelasko commented Oct 14, 2022

jtrmal commented Oct 14, 2022 via email

desh2608 commented Oct 14, 2022

teowenshen commented Oct 14, 2022

pzelasko commented Oct 14, 2022

teowenshen commented Oct 14, 2022

teowenshen commented Oct 16, 2022

pzelasko commented Oct 17, 2022