without local/data.sh file in egs2/mixed_v3/s2t1/local #5469

nichongjia-2007 · 2023-10-10T01:34:52Z

Describe the bug

In egs2/mixed_v3/s2t1/local directory, there is not data.sh file, but in https://github.com/espnet/espnet/blob/master/egs2/TEMPLATE/s2t1/s2t.sh#L549, it should has the file.

sw005320 · 2023-10-10T01:36:40Z

@pyf98, can you answer it for me?

pyf98 · 2023-10-10T01:39:38Z

Hi, we currently do not provide a centralized data.sh for data preparation, because our training dataset is super large and is derived from multiple separate corpora. For v2, we have separate scripts for some corpora: https://github.com/espnet/espnet/tree/master/egs2/mixed_v2/s2t1/local/

pyf98 · 2023-10-10T01:40:34Z

V1 also has some: https://github.com/espnet/espnet/tree/master/egs2/mixed_v1/s2t1/local

pyf98 · 2023-10-10T01:41:26Z

It is suggested to run them separately and combine them later.

pyf98 · 2023-10-10T01:42:35Z

For the new data added in v3: @jctian98 Could you provide the scripts?

sw005320 · 2023-10-10T01:43:12Z

Why not include them in run.sh or data.sh?
This would improve the reproducibility more.

pyf98 · 2023-10-10T02:08:36Z

OK, we will try to provide data.sh. However, it is suggested to create the dataset by following:

Prepare individual data
Dump features for individual data
Combine dumped features into single train/valid

If we first prepare all datasets into data and then dump features, we cannot easily resume if some errors happen (especially due to cluster/server).

pyf98 · 2023-10-10T02:11:34Z

Also note that we re-download the raw data for Multilingual LibriSpeech to use the fully formatted transcriptions. This downloading can take multiple days and may fail in the middle depending on the network. I have included some retry mechanism (see request_with_retry in that script), but no guanrantee it can work perfectly in another environment.

nichongjia-2007 · 2023-10-10T02:16:39Z

Thanks a lot, I will try them to prepare the data separately.

junshipeng · 2023-10-12T10:17:13Z

@pyf98 Based on the local/prepare_wenetspeech.py for merging wenetspeech, if there is a 10-second audio gap in the wenetspeech corpus due to low confidence and it is skipped, will it cause a situation where there is audio but no text when merging for 30 seconds?

pyf98 · 2023-10-12T14:46:31Z

@pyf98 Based on the local/prepare_wenetspeech.py for merging wenetspeech, if there is a 10-second audio gap in the wenetspeech corpus due to low confidence and it is skipped, will it cause a situation where there is audio but no text when merging for 30 seconds?

Thanks @junshipeng for the question. I think such situation can happen in general (not limited to a specific dataset). I do not have a perfect solution now. I wanted to keep the original timestamps in the long recordings which is also easier, instead of concatenating the segmented utterances manually.

junshipeng · 2023-10-13T10:17:49Z

@pyf98 Based on the local/prepare_wenetspeech.py for merging wenetspeech, if there is a 10-second audio gap in the wenetspeech corpus due to low confidence and it is skipped, will it cause a situation where there is audio but no text when merging for 30 seconds?

Thanks @junshipeng for the question. I think such situation can happen in general (not limited to a specific dataset). I do not have a perfect solution now. I wanted to keep the original timestamps in the long recordings which is also easier, instead of concatenating the segmented utterances manually.

@pyf98 Have you tried the effect of not concatenating into 30 seconds, such as using the original annotated audio length?

pyf98 · 2023-10-13T14:00:17Z

@junshipeng No, we didn't try that. We tried to mimic Whisper so we always used long-form inputs. In this way, we can predict timestamps for each utterance in addition to the text transcript. If we use short segmented utterances, we cannot achieve this.

nichongjia-2007 · 2023-10-14T11:18:14Z

@pyf98 https://github.com/espnet/espnet/blob/master/egs2/mixed_v1/s2t1/local/prepare_librispeech.py#L26 there are some errors when prepare the librispeech

pyf98 · 2023-10-14T16:51:28Z

@nichongjia-2007 What are the errors? For LibriSpeech, I was using the unsegmented version which is different from the commonly used ESPnet data.

junshipeng · 2023-10-20T03:18:53Z

@pyf98 When I run into Stage 7, I encounter an error：

File "git/espnet/espnet2/bin/launch.py", line 265, in main
raise RuntimeError("run.pl doesn't support submitting to the other nodes.")
RuntimeError: run.pl doesn't support submitting to the other nodes.

After reviewing the code, I noticed that the cmd parameter in launch.py is set to run.pl, which is defined in cmd.sh. However, it seems that the espnet2.bin.launch in Stage 7 does not support run.pl：
elif Path(args.cmd[0]).name == "run.pl":
raise RuntimeError("run.pl doesn't support submitting to the other nodes.")

pyf98 · 2023-10-20T03:21:58Z

@junshipeng I do not know this type of errors. But for OWSM, we do not use any LM. So we do not execute Stage 7

junshipeng · 2023-10-24T10:08:50Z

@pyf98 hi,I noticed that stage10 is taking a long time to complete. Is there any way to speed it up? Also, the GPU parameter is set to 0. Is this normal?

pyf98 · 2023-10-24T17:35:11Z

@junshipeng Stage 10 is used to collect stats which will be used to normalize the features. It does not require GPU. To speed up, you can use many parallel jobs, i.e., changing --nj

nichongjia-2007 added the Bug bug should be fixed label Oct 10, 2023

sw005320 assigned pyf98 Oct 10, 2023

jctian98 mentioned this issue Oct 17, 2023

add centralized data preparation for OWSM #5478

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

without local/data.sh file in egs2/mixed_v3/s2t1/local #5469

without local/data.sh file in egs2/mixed_v3/s2t1/local #5469

nichongjia-2007 commented Oct 10, 2023

sw005320 commented Oct 10, 2023

pyf98 commented Oct 10, 2023 •

edited

Loading

pyf98 commented Oct 10, 2023

pyf98 commented Oct 10, 2023

pyf98 commented Oct 10, 2023

sw005320 commented Oct 10, 2023

pyf98 commented Oct 10, 2023

pyf98 commented Oct 10, 2023 •

edited

Loading

nichongjia-2007 commented Oct 10, 2023

junshipeng commented Oct 12, 2023

pyf98 commented Oct 12, 2023

junshipeng commented Oct 13, 2023

pyf98 commented Oct 13, 2023

nichongjia-2007 commented Oct 14, 2023

pyf98 commented Oct 14, 2023

junshipeng commented Oct 20, 2023

pyf98 commented Oct 20, 2023

junshipeng commented Oct 24, 2023

pyf98 commented Oct 24, 2023

without local/data.sh file in egs2/mixed_v3/s2t1/local #5469

without local/data.sh file in egs2/mixed_v3/s2t1/local #5469

Comments

nichongjia-2007 commented Oct 10, 2023

sw005320 commented Oct 10, 2023

pyf98 commented Oct 10, 2023 • edited Loading

pyf98 commented Oct 10, 2023

pyf98 commented Oct 10, 2023

pyf98 commented Oct 10, 2023

sw005320 commented Oct 10, 2023

pyf98 commented Oct 10, 2023

pyf98 commented Oct 10, 2023 • edited Loading

nichongjia-2007 commented Oct 10, 2023

junshipeng commented Oct 12, 2023

pyf98 commented Oct 12, 2023

junshipeng commented Oct 13, 2023

pyf98 commented Oct 13, 2023

nichongjia-2007 commented Oct 14, 2023

pyf98 commented Oct 14, 2023

junshipeng commented Oct 20, 2023

pyf98 commented Oct 20, 2023

junshipeng commented Oct 24, 2023

pyf98 commented Oct 24, 2023

pyf98 commented Oct 10, 2023 •

edited

Loading

pyf98 commented Oct 10, 2023 •

edited

Loading