Add CHiME4 recipe #746

funcwj · 2021-10-29T03:39:54Z

Add CHiME4 recipe with character unit. Training data combines WSJ0+1 & CHiME4 real & simulation utterances. Current best WER with CTC prefix beam search are 21.17%/19.06%/28.41%/29.16% on dt05 real/simu and et05 real/simu, respectively.

funcwj · 2021-10-29T03:41:27Z

Will update results in README.md in the later days.

robin1001 · 2021-11-11T08:13:23Z

wenet/dataset/dataset.py

@@ -139,7 +139,7 @@ def Dataset(data_type, data_list_file, symbol_table, conf,
    else:
        dataset = Processor(dataset, processor.parse_raw)

-    dataset = Processor(dataset, processor.tokenize, symbol_table, bpe_model)
+    dataset = Processor(dataset, processor.tokenize, symbol_table, bpe_model, conf.get('char', False))


the line is too long.

robin1001

please see inline.

robin1001 · 2021-11-11T08:19:16Z

wenet/dataset/processor.py

@@ -258,7 +258,7 @@ def compute_fbank(data,
        yield dict(key=sample['key'], label=sample['label'], feat=mat)


-def tokenize(data, symbol_table, bpe_model=None):
+def tokenize(data, symbol_table, bpe_model=None, char=False):


tokenize use char as the default model unit.
You mean your model unit is seperated with white space? it is more like phoneme, such as timit. It's better we take another name.

I'd like to use split_with_space instead

robin1001 · 2021-11-12T08:06:21Z

examples/chime4/s0/README.md

+|   decoding mode | dt05_real_1ch | dt05_simu_1ch | et05_real_1ch | et05_simu_1ch |
+|:---------------:|:-------------:|:-------------:|:-------------:|:-------------:|
+| ctc_beam_search |   19.06%      |   21.17%      |   28.39%      |    29.16%     |
+|  att_rescoring  |   17.92%      |   20.22%      |   27.40%      |    28.25%     |


ctc_beam_search -> ctc_prefix_beam_search
att_rescoring -> attention_rescoring
rename so it is consistent with other recipes.

funcwj added 3 commits October 13, 2021 11:32

init chime4 recipe

413aa3d

update decoding recipe in run,sh

098ca05

fix tokenizer issue for char unit

80fd5d5

funcwj added 4 commits November 2, 2021 04:02

add initial results

76956fd

finalize recipe

1c79a0c

fix tab issues in run.sh

20967cf

fix trailing whitespace

368d8e7

robin1001 reviewed Nov 11, 2021

View reviewed changes

robin1001 requested changes Nov 11, 2021

View reviewed changes

robin1001 reviewed Nov 12, 2021

View reviewed changes

funcwj added 2 commits November 17, 2021 12:10

fix comments

31e8de5

commit missing changes

958621c

robin1001 approved these changes Nov 19, 2021

View reviewed changes

robin1001 merged commit d01276a into main Nov 19, 2021

robin1001 deleted the jwu/chime4 branch November 19, 2021 01:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CHiME4 recipe #746

Add CHiME4 recipe #746

funcwj commented Oct 29, 2021

funcwj commented Oct 29, 2021

robin1001 Nov 11, 2021

robin1001 left a comment

robin1001 Nov 11, 2021

funcwj Nov 17, 2021

robin1001 Nov 12, 2021

Add CHiME4 recipe #746

Add CHiME4 recipe #746

Conversation

funcwj commented Oct 29, 2021

funcwj commented Oct 29, 2021

robin1001 Nov 11, 2021

Choose a reason for hiding this comment

robin1001 left a comment

Choose a reason for hiding this comment

robin1001 Nov 11, 2021

Choose a reason for hiding this comment

funcwj Nov 17, 2021

Choose a reason for hiding this comment

robin1001 Nov 12, 2021

Choose a reason for hiding this comment