tools/fst/prepare_dict.py can't support chinese and english at the same time.

**Describe the bug**
tools/fst/prepare_dict.py, it can't split chinese phrase into character when bpemodel given.

**To Reproduce**
tools/fst/prepare_dict.py lang_char.txt lexicon_raw.txt lexicon.txt bpe.model

if  lexicon_raw.txt have a phrase "我们":
>>> sp.EncodeAsPieces("我们")
['▁', '我们']

**Expected behavior**
it will prompt contains oov unit “我们”， we want it be ['我', '们'] actually.

**Screenshots**
![Screenshot_select-area_20230110113128](https://user-images.githubusercontent.com/14941351/211455908-6983af16-bfd2-4ea3-9d8d-cc51f99f4532.png)


**Additional context**
One of the ways to modify it : 
 if not word.encode('UTF-8').isalpha():
    pieces = list(word)
else:
    pieces = sp.EncodeAsPieces(word)
We assume that one entry of lexicon_raw.txt only one language.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tools/fst/prepare_dict.py can't support chinese and english at the same time. #1653

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development