-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prepare CSJ #851
Prepare CSJ #851
Conversation
I think fwiw not making the normalization part of this makes sense. But
please document it somewhere (maybe even print it after data preparation)
so that it's easy to figure out.
y.
…On Wed, Oct 12, 2022 at 1:22 AM Teo Wen Shen ***@***.***> wrote:
Prepare for CSJ
This pull request diverges from the standard prepare of other recipes by a
few points. I was hoping to get your input on these points in order to
standardise my prepare_csj before the merge.
1.
*Train-Valid Splitting*
CSJ does not define a separate validation dataset, so I follow the
splitting scheme in espnet2 where 4000 utterances from the training
datasets are picked as validation. Because one recording corresponds to
multiple supervisions, I first create a concatenated cutset, trim the
cutsets to its supervisions, then split that resulting cutset. The train
cutset is further perturbed in speed here. Then, I returned the cuts
instead in the return value of prepare_csj.
- Is there a recommended way to split the resulting cutset into its
constituent recordingset and supervisionset so that it can returned?
- Alternatively, is there a recommended way or an existing recipe
to refer to for this train-valid splitting?
2.
*Cutset saved to disc* Since the cutsets are already created in this
step, I saved the cutsets directly to disc so that compute_fbank_csj.py
just loads the cuts_XXX.jsonl.gz. What do you think about this flow? (Once
I can extract the recordingset and supervisionset from the cutset, I will
save those to disc too.)
3.
*Transcript preparation* Generating the transcript from the raw tsv
file took me a separate python program. I start prepare_csj with the
assumption that the transcript has been generated using that program, which
I plan to include in the icefall recipe. What do you think?
Also, please let me know if I need to change anything else in my codes.
It's my first time using click, so I am not confident there.
------------------------------
You can view, comment on, or merge this pull request online at:
#851
Commit Summary
- ca2430b
<ca2430b>
prepare_csj
- 6e8b659
<6e8b659>
Merge branch 'lhotse-speech:master' into csj
File Changes
(4 files <https://github.com/lhotse-speech/lhotse/pull/851/files>)
- *M* lhotse/bin/modes/recipes/__init__.py
<https://github.com/lhotse-speech/lhotse/pull/851/files#diff-bf6d6d33b1a4b0ae281fe41733df689412eb11caa1c671a728f0666fa997bae5>
(1)
- *A* lhotse/bin/modes/recipes/csj.py
<https://github.com/lhotse-speech/lhotse/pull/851/files#diff-71f94e124a117785537195febc2fe7f0c989deaf39e2766b8568be30225b4650>
(51)
- *M* lhotse/recipes/__init__.py
<https://github.com/lhotse-speech/lhotse/pull/851/files#diff-3d496cc9788e121f8db0dae3eeff260db9c80c7f22b529dfc00bfddf36697dec>
(1)
- *A* lhotse/recipes/csj.py
<https://github.com/lhotse-speech/lhotse/pull/851/files#diff-89c8d88e453eedf820b047da7363c4e8f3422cd7f2c92c7981a81c1a821a3107>
(246)
Patch Links:
- https://github.com/lhotse-speech/lhotse/pull/851.patch
- https://github.com/lhotse-speech/lhotse/pull/851.diff
—
Reply to this email directly, view it on GitHub
<#851>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACUKYX6EDJMM4AXDE75S5KDWCZDJJANCNFSM6AAAAAARC5HLEQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
In general, if the corpus does not provide a data split, we just return the consolidated recordings and supervisions instead of creating our own splits, and the user can then split it in the way they want depending on their use case. See the
You can use the
I think this is reasonable. Some corpora require a bit of preprocessing to get them in a suitable format to read with a python program. CHiME-5, for example, requires array synchronization, and LibriCSS requires running a segmentation script provided with the official data. You can just mention the details about the assumption in the header so that the user can prepare it in the format required.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, the recipe looks good overall! Can you also update the table in docs/corpus.rst
?
Regarding the transcripts, Lhotse recipes are generally supposed to work with raw data distribution as-is and be the very first step of the pipeline. I would suggest to incorporate your separate script for preparing the transcript as a function that's called inside this data prep recipe. I don't know CSJ, maybe the transcript preprocessing work is very involved and not suitable for Lhotse recipe, but I'd like you to at least attempt to convince me first :)
Honestly, I would avoid returning cuts as done here. I agree with Desh this is something that is typically done in downstream applications, if there's no dev set, we just don't provide it. The code you wrote for cuts could be moved to the Icefall recipe.
I think we do, desh -- thats even the change you were doing -- saving as
"all_supervisions" vs "{dev,test,train}_supervisions"?
…On Wed, Oct 12, 2022 at 2:34 PM Desh Raj ***@***.***> wrote:
Prepare for CSJ
This pull request diverges from the standard prepare of other recipes by a
few points. I was hoping to get your input on these points in order to
standardise my prepare_csj before the merge.
1.
*Train-Valid Splitting*
CSJ does not define a separate validation dataset, so I follow the
splitting scheme in espnet2 where 4000 utterances from the training
datasets are picked as validation. Because one recording corresponds to
multiple supervisions, I first create a concatenated cutset, trim the
cutsets to its supervisions, then split that resulting cutset. The train
cutset is further perturbed in speed here. Then, I returned the cuts
instead in the return value of prepare_csj.
- Is there a recommended way to split the resulting cutset into its
constituent recordingset and supervisionset so that it can returned?
- Alternatively, is there a recommended way or an existing recipe
to refer to for this train-valid splitting?
In general, if the corpus does not provide a data split, we just return
the consolidated recordings and supervisions instead of creating our own
splits, and the user can then split it in the way they want depending on
their use case. See the cmu_kids recipe for example.
1. *Cutset saved to disc* Since the cutsets are already created in
this step, I saved the cutsets directly to disc so that
compute_fbank_csj.py just loads the cuts_XXX.jsonl.gz. What do you think
about this flow? (Once I can extract the recordingset and supervisionset
from the cutset, I will save those to disc too.)
You can use the CutSet.decompose() function to get the consitituents of
the cut set. But you won't need this if you only return the full recordings
and supervisions.
1. *Transcript preparation* Generating the transcript from the raw tsv
file took me a separate python program. I start prepare_csj with the
assumption that the transcript has been generated using that program, which
I plan to include in the icefall recipe. What do you think?
I think this is reasonable. Some corpora require a bit of preprocessing to
get them in a suitable format to read with a python program. CHiME-5, for
example, requires array synchronization, and LibriCSS requires running a
segmentation script provided with the official data. You can just mention
the details about the assumption in the header so that the user can prepare
it in the format required.
Also, please let me know if I need to change anything else in my codes.
It's my first time using click, so I am not confident there.
—
Reply to this email directly, view it on GitHub
<#851 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACUKYX64YP3R4ZEMFPELW4DWC4AEFANCNFSM6AAAAAARC5HLEQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
@jtrmal Yeah, in such cases we would save as |
Added some changes:-
Yes. Probably the best way is to have a separate program for the dataset preparation before compute_fbank_csj.py.
Done.
It's a 700-liner program that reads in a config file and optionally a segment file, and my I plan to update my Icefall recipe and pull in a working version in Icefall. I will link it here once it's done. Ideally, I was thinking that maybe a separate command in lhotse could be used for these processes. |
I have just sent in the pull request to Icefall for the data preparation part of the recipe. This is the file that generates the transcripts from the raw tsv file: csj_make_transcript.py The transcript generation is so involved, partly because there are many tags that mark slurring, laughing, clapping, stuttering, orthography variation, etc. Another Japanese corpus that we plan to work on (not so soon) also requires an external script to parse the transcript from raw subtitle files. That corpus probably will have the same issue. |
Can you fix the style checks? You can use:
|
Okay, I think I have done cleared the style checks. |
Black seems to be failing still. |
Somehow pre-commit shows that there are no files to check.
|
I am almost convinced to merge as is, but actually I have one issue. This would be the first Lhotse recipe that requires some undocumented, external step to prepare the data in order to generate the manifests. It introduces an implicit dependency on Icefall where the script would live, but the user doesn't have any way to find out about it. I don't think this will be useful to anybody outside of Icefall recipe users. I don't have an issue with adding a 700 line script that does text preprocessing. It shouldn't make a difference if it lives in Lhotse or in Icefall, except that when it lives in Lhotse, the data prep recipe works out of the box even for non-Icefall users. I think it should just be a part of |
makes sense to me -- I agree
y.
…On Fri, Oct 14, 2022 at 9:15 AM Piotr Żelasko ***@***.***> wrote:
I am almost convinced to merge as is, but actually I have one issue. This
would be the first Lhotse recipe that requires some undocumented, external
step to prepare the data in order to generate the manifests. It introduces
an implicit dependency on Icefall where the script would live, but the user
doesn't have any way to find out about it. I don't think this will be
useful to anybody outside of Icefall recipe users.
I don't have an issue with adding a 700 line script that does text
preprocessing. It shouldn't make a difference if it lives in Lhotse or in
Icefall, except that when it lives in Lhotse, the data prep recipe works
out of the box even for non-Icefall users. I think it should just be a part
of prepare_csj as the very first function called in the recipe. This way
this recipe will be compatible with all of the other Lhotse recipes.
—
Reply to this email directly, view it on GitHub
<#851 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACUKYX73VN7ZQO2YLDBQRI3WDFMFRANCNFSM6AAAAAARC5HLEQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I second Piotr. If the conf.ini files are needed for the corpus preparation, they can be put on a server (e.g. openslr) and downloaded inside the preparation recipe. |
Yes. I would also like to standardise this corpus with other corpora too. Is it okay to add more optional arguments behind lhotse prepare? I will include a default config in the codes. If the user wants to change the behaviour, they can do so by supplying the path to their own modified config file. |
Yes, various recipe have their own recipe-specific optional arguments. |
Great! I will update my codes and send in a pull request soon. |
I have moved the transcript generation part over to Lhotse, so now this recipe should be performing the same functions as other recipes. I included a default .ini file directly inside the codes so that this recipe is as self sufficient as possible. |
Thanks! Great work. |
Prepare for CSJ
This pull request diverges from the standard prepare of other recipes by a few points. I was hoping to get your input on these points in order to standardise my
prepare_csj
before the merge.Train-Valid Splitting
CSJ does not define a separate validation dataset, so I follow the splitting scheme in espnet2 where 4000 utterances from the training datasets are picked as validation. Because one recording corresponds to multiple supervisions, I first create a concatenated cutset, trim the cutsets to its supervisions, then split that resulting cutset. The train cutset is further perturbed in speed here. Then, I returned the cuts instead in the return value of
prepare_csj
.Cutset saved to disc Since the cutsets are already created in this step, I saved the cutsets directly to disc so that compute_fbank_csj.py just loads the cuts_XXX.jsonl.gz. What do you think about this flow? (Once I can extract the recordingset and supervisionset from the cutset, I will save those to disc too.)
Transcript preparation Generating the transcript from the raw tsv file took me a separate python program. I start
prepare_csj
with the assumption that the transcript has been generated using that program, which I plan to include in the icefall recipe. What do you think?Also, please let me know if I need to change anything else in my codes. It's my first time using click, so I am not confident there.