Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepare CSJ #851

Merged
merged 9 commits into from
Oct 17, 2022
Merged

Prepare CSJ #851

merged 9 commits into from
Oct 17, 2022

Conversation

teowenshen
Copy link
Contributor

Prepare for CSJ

This pull request diverges from the standard prepare of other recipes by a few points. I was hoping to get your input on these points in order to standardise my prepare_csj before the merge.

  1. Train-Valid Splitting
    CSJ does not define a separate validation dataset, so I follow the splitting scheme in espnet2 where 4000 utterances from the training datasets are picked as validation. Because one recording corresponds to multiple supervisions, I first create a concatenated cutset, trim the cutsets to its supervisions, then split that resulting cutset. The train cutset is further perturbed in speed here. Then, I returned the cuts instead in the return value of prepare_csj.

    • Is there a recommended way to split the resulting cutset into its constituent recordingset and supervisionset so that it can returned?
    • Alternatively, is there a recommended way or an existing recipe to refer to for this train-valid splitting?
  2. Cutset saved to disc Since the cutsets are already created in this step, I saved the cutsets directly to disc so that compute_fbank_csj.py just loads the cuts_XXX.jsonl.gz. What do you think about this flow? (Once I can extract the recordingset and supervisionset from the cutset, I will save those to disc too.)

  3. Transcript preparation Generating the transcript from the raw tsv file took me a separate python program. I start prepare_csj with the assumption that the transcript has been generated using that program, which I plan to include in the icefall recipe. What do you think?

Also, please let me know if I need to change anything else in my codes. It's my first time using click, so I am not confident there.

@jtrmal
Copy link
Collaborator

jtrmal commented Oct 12, 2022 via email

@desh2608
Copy link
Collaborator

Prepare for CSJ

This pull request diverges from the standard prepare of other recipes by a few points. I was hoping to get your input on these points in order to standardise my prepare_csj before the merge.

  1. Train-Valid Splitting
    CSJ does not define a separate validation dataset, so I follow the splitting scheme in espnet2 where 4000 utterances from the training datasets are picked as validation. Because one recording corresponds to multiple supervisions, I first create a concatenated cutset, trim the cutsets to its supervisions, then split that resulting cutset. The train cutset is further perturbed in speed here. Then, I returned the cuts instead in the return value of prepare_csj.

    • Is there a recommended way to split the resulting cutset into its constituent recordingset and supervisionset so that it can returned?
    • Alternatively, is there a recommended way or an existing recipe to refer to for this train-valid splitting?

In general, if the corpus does not provide a data split, we just return the consolidated recordings and supervisions instead of creating our own splits, and the user can then split it in the way they want depending on their use case. See the cmu_kids recipe for example.

  1. Cutset saved to disc Since the cutsets are already created in this step, I saved the cutsets directly to disc so that compute_fbank_csj.py just loads the cuts_XXX.jsonl.gz. What do you think about this flow? (Once I can extract the recordingset and supervisionset from the cutset, I will save those to disc too.)

You can use the CutSet.decompose() function to get the consitituents of the cut set. But you won't need this if you only return the full recordings and supervisions.

  1. Transcript preparation Generating the transcript from the raw tsv file took me a separate python program. I start prepare_csj with the assumption that the transcript has been generated using that program, which I plan to include in the icefall recipe. What do you think?

I think this is reasonable. Some corpora require a bit of preprocessing to get them in a suitable format to read with a python program. CHiME-5, for example, requires array synchronization, and LibriCSS requires running a segmentation script provided with the official data. You can just mention the details about the assumption in the header so that the user can prepare it in the format required.

Also, please let me know if I need to change anything else in my codes. It's my first time using click, so I am not confident there.

lhotse/bin/modes/recipes/__init__.py Show resolved Hide resolved
lhotse/recipes/csj.py Show resolved Hide resolved
lhotse/recipes/csj.py Outdated Show resolved Hide resolved
lhotse/recipes/csj.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, the recipe looks good overall! Can you also update the table in docs/corpus.rst?

Regarding the transcripts, Lhotse recipes are generally supposed to work with raw data distribution as-is and be the very first step of the pipeline. I would suggest to incorporate your separate script for preparing the transcript as a function that's called inside this data prep recipe. I don't know CSJ, maybe the transcript preprocessing work is very involved and not suitable for Lhotse recipe, but I'd like you to at least attempt to convince me first :)

Honestly, I would avoid returning cuts as done here. I agree with Desh this is something that is typically done in downstream applications, if there's no dev set, we just don't provide it. The code you wrote for cuts could be moved to the Icefall recipe.

lhotse/recipes/csj.py Outdated Show resolved Hide resolved
lhotse/recipes/csj.py Outdated Show resolved Hide resolved
@jtrmal
Copy link
Collaborator

jtrmal commented Oct 12, 2022 via email

@desh2608
Copy link
Collaborator

@jtrmal Yeah, in such cases we would save as recordings_<corpus>_all.jsonl.gz instead.

@teowenshen
Copy link
Contributor Author

Added some changes:-

  • Return supervisions and recordings instead of cutset
  • Parsing dataset parts as-is. (train dataset actually is a concatenation of core and noncore parts. Here, I return them as separate dataset parts.)
  • Added gender info in SupervisionSegment
  • Updated docstrings

In general, if the corpus does not provide a data split, we just return the consolidated recordings and supervisions instead of creating our own splits, and the user can then split it in the way they want depending on their use case. See the cmu_kids recipe for example.

Yes. Probably the best way is to have a separate program for the dataset preparation before compute_fbank_csj.py.

Can you also update the table in docs/corpus.rst?

Done.

Regarding the transcripts, Lhotse recipes are generally supposed to work with raw data distribution as-is and be the very first step of the pipeline. I would suggest to incorporate your separate script for preparing the transcript as a function that's called inside this data prep recipe. I don't know CSJ, maybe the transcript preprocessing work is very involved and not suitable for Lhotse recipe, but I'd like you to at least attempt to convince me first :)

It's a 700-liner program that reads in a config file and optionally a segment file, and my prepare.sh runs it recursively for each transcript mode. 😂 However, I do understand the motivation behind consolidating all transcript preparation scripts into lhotse.

I plan to update my Icefall recipe and pull in a working version in Icefall. I will link it here once it's done.

Ideally, I was thinking that maybe a separate command in lhotse could be used for these processes. lhotse parse ?

@teowenshen
Copy link
Contributor Author

I have just sent in the pull request to Icefall for the data preparation part of the recipe.

This is the file that generates the transcripts from the raw tsv file: csj_make_transcript.py

The transcript generation is so involved, partly because there are many tags that mark slurring, laughing, clapping, stuttering, orthography variation, etc.

Another Japanese corpus that we plan to work on (not so soon) also requires an external script to parse the transcript from raw subtitle files. That corpus probably will have the same issue.

@desh2608
Copy link
Collaborator

Can you fix the style checks? You can use:

pre-commit install
pre-commit run

@teowenshen
Copy link
Contributor Author

Okay, I think I have done cleared the style checks.

@desh2608
Copy link
Collaborator

Okay, I think I have done cleared the style checks.

Black seems to be failing still.

@teowenshen
Copy link
Contributor Author

Somehow pre-commit shows that there are no files to check.
Anyhow I ran black again. I was splitting the long lines, which was why black complained.

check that executables have shebangs.................(no files to check)Skipped
fix end of files.....................................(no files to check)Skipped
mixed line ending....................................(no files to check)Skipped
trim trailing whitespace.............................(no files to check)Skipped
flake8...............................................(no files to check)Skipped
isort................................................(no files to check)Skipped
black................................................(no files to check)Skipped

desh2608
desh2608 previously approved these changes Oct 14, 2022
@pzelasko
Copy link
Collaborator

I am almost convinced to merge as is, but actually I have one issue. This would be the first Lhotse recipe that requires some undocumented, external step to prepare the data in order to generate the manifests. It introduces an implicit dependency on Icefall where the script would live, but the user doesn't have any way to find out about it. I don't think this will be useful to anybody outside of Icefall recipe users.

I don't have an issue with adding a 700 line script that does text preprocessing. It shouldn't make a difference if it lives in Lhotse or in Icefall, except that when it lives in Lhotse, the data prep recipe works out of the box even for non-Icefall users. I think it should just be a part of prepare_csj as the very first function called in the recipe. This way this recipe will be compatible with all of the other Lhotse recipes.

@jtrmal
Copy link
Collaborator

jtrmal commented Oct 14, 2022 via email

@desh2608
Copy link
Collaborator

I second Piotr. If the conf.ini files are needed for the corpus preparation, they can be put on a server (e.g. openslr) and downloaded inside the preparation recipe.

@teowenshen
Copy link
Contributor Author

Yes. I would also like to standardise this corpus with other corpora too.

Is it okay to add more optional arguments behind lhotse prepare? I will include a default config in the codes. If the user wants to change the behaviour, they can do so by supplying the path to their own modified config file.

@pzelasko
Copy link
Collaborator

Yes, various recipe have their own recipe-specific optional arguments.

@teowenshen
Copy link
Contributor Author

Great! I will update my codes and send in a pull request soon.

@teowenshen
Copy link
Contributor Author

I have moved the transcript generation part over to Lhotse, so now this recipe should be performing the same functions as other recipes. I included a default .ini file directly inside the codes so that this recipe is as self sufficient as possible.

@pzelasko pzelasko changed the title [WIP] Prepare CSJ Prepare CSJ Oct 17, 2022
@pzelasko pzelasko merged commit c73a22c into lhotse-speech:master Oct 17, 2022
@pzelasko
Copy link
Collaborator

Thanks! Great work.

@teowenshen teowenshen deleted the csj branch October 18, 2022 00:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants