Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CHiME4 recipe #746

Merged
merged 9 commits into from
Nov 19, 2021
Merged

Add CHiME4 recipe #746

merged 9 commits into from
Nov 19, 2021

Conversation

funcwj
Copy link
Collaborator

@funcwj funcwj commented Oct 29, 2021

Add CHiME4 recipe with character unit. Training data combines WSJ0+1 & CHiME4 real & simulation utterances. Current best WER with CTC prefix beam search are 21.17%/19.06%/28.41%/29.16% on dt05 real/simu and et05 real/simu, respectively.

@funcwj
Copy link
Collaborator Author

funcwj commented Oct 29, 2021

Will update results in README.md in the later days.

@@ -139,7 +139,7 @@ def Dataset(data_type, data_list_file, symbol_table, conf,
else:
dataset = Processor(dataset, processor.parse_raw)

dataset = Processor(dataset, processor.tokenize, symbol_table, bpe_model)
dataset = Processor(dataset, processor.tokenize, symbol_table, bpe_model, conf.get('char', False))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the line is too long.

Copy link
Collaborator

@robin1001 robin1001 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please see inline.

@@ -258,7 +258,7 @@ def compute_fbank(data,
yield dict(key=sample['key'], label=sample['label'], feat=mat)


def tokenize(data, symbol_table, bpe_model=None):
def tokenize(data, symbol_table, bpe_model=None, char=False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tokenize use char as the default model unit.
You mean your model unit is seperated with white space? it is more like phoneme, such as timit. It's better we take another name.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to use split_with_space instead

| decoding mode | dt05_real_1ch | dt05_simu_1ch | et05_real_1ch | et05_simu_1ch |
|:---------------:|:-------------:|:-------------:|:-------------:|:-------------:|
| ctc_beam_search | 19.06% | 21.17% | 28.39% | 29.16% |
| att_rescoring | 17.92% | 20.22% | 27.40% | 28.25% |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ctc_beam_search -> ctc_prefix_beam_search
att_rescoring -> attention_rescoring
rename so it is consistent with other recipes.

@robin1001 robin1001 merged commit d01276a into main Nov 19, 2021
@robin1001 robin1001 deleted the jwu/chime4 branch November 19, 2021 01:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants