Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

without local/data.sh file in egs2/mixed_v3/s2t1/local #5469

Open
nichongjia-2007 opened this issue Oct 10, 2023 · 19 comments
Open

without local/data.sh file in egs2/mixed_v3/s2t1/local #5469

nichongjia-2007 opened this issue Oct 10, 2023 · 19 comments
Assignees
Labels
Bug bug should be fixed

Comments

@nichongjia-2007
Copy link

Describe the bug

In egs2/mixed_v3/s2t1/local directory, there is not data.sh file, but in https://github.com/espnet/espnet/blob/master/egs2/TEMPLATE/s2t1/s2t.sh#L549, it should has the file.

@nichongjia-2007 nichongjia-2007 added the Bug bug should be fixed label Oct 10, 2023
@sw005320
Copy link
Contributor

@pyf98, can you answer it for me?

@pyf98
Copy link
Collaborator

pyf98 commented Oct 10, 2023

Hi, we currently do not provide a centralized data.sh for data preparation, because our training dataset is super large and is derived from multiple separate corpora. For v2, we have separate scripts for some corpora: https://github.com/espnet/espnet/tree/master/egs2/mixed_v2/s2t1/local/

@pyf98
Copy link
Collaborator

pyf98 commented Oct 10, 2023

@pyf98
Copy link
Collaborator

pyf98 commented Oct 10, 2023

It is suggested to run them separately and combine them later.

@pyf98
Copy link
Collaborator

pyf98 commented Oct 10, 2023

For the new data added in v3: @jctian98 Could you provide the scripts?

@sw005320
Copy link
Contributor

Why not include them in run.sh or data.sh?
This would improve the reproducibility more.

@pyf98
Copy link
Collaborator

pyf98 commented Oct 10, 2023

OK, we will try to provide data.sh. However, it is suggested to create the dataset by following:

  1. Prepare individual data
  2. Dump features for individual data
  3. Combine dumped features into single train/valid

If we first prepare all datasets into data and then dump features, we cannot easily resume if some errors happen (especially due to cluster/server).

@pyf98
Copy link
Collaborator

pyf98 commented Oct 10, 2023

Also note that we re-download the raw data for Multilingual LibriSpeech to use the fully formatted transcriptions. This downloading can take multiple days and may fail in the middle depending on the network. I have included some retry mechanism (see request_with_retry in that script), but no guanrantee it can work perfectly in another environment.

@nichongjia-2007
Copy link
Author

Thanks a lot, I will try them to prepare the data separately.

@junshipeng
Copy link

@pyf98 Based on the local/prepare_wenetspeech.py for merging wenetspeech, if there is a 10-second audio gap in the wenetspeech corpus due to low confidence and it is skipped, will it cause a situation where there is audio but no text when merging for 30 seconds?

@pyf98
Copy link
Collaborator

pyf98 commented Oct 12, 2023

@pyf98 Based on the local/prepare_wenetspeech.py for merging wenetspeech, if there is a 10-second audio gap in the wenetspeech corpus due to low confidence and it is skipped, will it cause a situation where there is audio but no text when merging for 30 seconds?

Thanks @junshipeng for the question. I think such situation can happen in general (not limited to a specific dataset). I do not have a perfect solution now. I wanted to keep the original timestamps in the long recordings which is also easier, instead of concatenating the segmented utterances manually.

@junshipeng
Copy link

@pyf98 Based on the local/prepare_wenetspeech.py for merging wenetspeech, if there is a 10-second audio gap in the wenetspeech corpus due to low confidence and it is skipped, will it cause a situation where there is audio but no text when merging for 30 seconds?

Thanks @junshipeng for the question. I think such situation can happen in general (not limited to a specific dataset). I do not have a perfect solution now. I wanted to keep the original timestamps in the long recordings which is also easier, instead of concatenating the segmented utterances manually.

@pyf98 Have you tried the effect of not concatenating into 30 seconds, such as using the original annotated audio length?

@pyf98
Copy link
Collaborator

pyf98 commented Oct 13, 2023

@junshipeng No, we didn't try that. We tried to mimic Whisper so we always used long-form inputs. In this way, we can predict timestamps for each utterance in addition to the text transcript. If we use short segmented utterances, we cannot achieve this.

@nichongjia-2007
Copy link
Author

@pyf98 https://github.com/espnet/espnet/blob/master/egs2/mixed_v1/s2t1/local/prepare_librispeech.py#L26 there are some errors when prepare the librispeech

@pyf98
Copy link
Collaborator

pyf98 commented Oct 14, 2023

@nichongjia-2007 What are the errors? For LibriSpeech, I was using the unsegmented version which is different from the commonly used ESPnet data.

@junshipeng
Copy link

@pyf98 When I run into Stage 7, I encounter an error:

File "git/espnet/espnet2/bin/launch.py", line 265, in main
raise RuntimeError("run.pl doesn't support submitting to the other nodes.")
RuntimeError: run.pl doesn't support submitting to the other nodes.

After reviewing the code, I noticed that the cmd parameter in launch.py is set to run.pl, which is defined in cmd.sh. However, it seems that the espnet2.bin.launch in Stage 7 does not support run.pl:
elif Path(args.cmd[0]).name == "run.pl":
raise RuntimeError("run.pl doesn't support submitting to the other nodes.")

@pyf98
Copy link
Collaborator

pyf98 commented Oct 20, 2023

@junshipeng I do not know this type of errors. But for OWSM, we do not use any LM. So we do not execute Stage 7

@junshipeng
Copy link

@pyf98 hi,I noticed that stage10 is taking a long time to complete. Is there any way to speed it up? Also, the GPU parameter is set to 0. Is this normal?

@pyf98
Copy link
Collaborator

pyf98 commented Oct 24, 2023

@junshipeng Stage 10 is used to collect stats which will be used to normalize the features. It does not require GPU. To speed up, you can use many parallel jobs, i.e., changing --nj

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug bug should be fixed
Projects
None yet
Development

No branches or pull requests

4 participants