tools/fst/prepare_dict.py can't support chinese and english at the same time. #1653
Closed
Description
Describe the bug
tools/fst/prepare_dict.py, it can't split chinese phrase into character when bpemodel given.
To Reproduce
tools/fst/prepare_dict.py lang_char.txt lexicon_raw.txt lexicon.txt bpe.model
if lexicon_raw.txt have a phrase "我们":
sp.EncodeAsPieces("我们")
['▁', '我们']
Expected behavior
it will prompt contains oov unit “我们”, we want it be ['我', '们'] actually.
Additional context
One of the ways to modify it :
if not word.encode('UTF-8').isalpha():
pieces = list(word)
else:
pieces = sp.EncodeAsPieces(word)
We assume that one entry of lexicon_raw.txt only one language.
Metadata
Assignees
Labels
No labels