Skip to content

How to Add or Edit [script].unicharset in langdata folder? #99

Open
@sethleech

Description

How to Add or Edit [script].unicharset in langdata folder?

  • I want to know How to get 'glyph_metrics' data from [font or several fonts].

Dear all,

I am trying tesseart recently and it is really a very good product. I would like to ask if there is any tutorial or steps about How to Add or Edit [script].unicharset? for example han.unicharset

I want to add missing chars or unicode chars for CJK Extensions B,C,D,E,F.
CJK Unified Ideographs Extension B: U+20000–U+2A6D6
CJK Unified Ideographs Extension C: U+2A700–U+2B734
CJK Unified Ideographs Extension D: U+2B740–U+2B81D
CJK Unified Ideographs Extension E: U+2B820–U+2CEA1
CJK Unified Ideographs Extension F: U+2CEB0–U+2EBE0

Please refer : when training tesseract, I tried this

1st try :
** unicharset_extractor **
tesseract-ocr/unicharset_extractor -D [lang] [lang]/[lang].[font].exp0.box

output is unicharset :
𥮗 1 0,255,0,255,0,0,0,0,0,0 NULL 4 0 0 # 𥮗 [25b97 ]x

** set_unicharset_properties **
tesseract-ocr/set_unicharset_properties -U unicharset -O [lang].unicharset --script_dir=langdata/[lang] --X langdata/[lang]/han.xheights

Warning: properties incomplete for index 4 = 𥮗

output is [lang].unicharset :
𥮗 1 0,255,0,255,0,0,0,0,0,0 NULL 4 0 0 # 𥮗 [25b97 ]x
=> not changed

2nd try :
I edited file langdata/han.unicharset
line 0 : 23514 -> 23515
add new line in end of lines 𥮗 1 61,64,255,255,188,200,6,11,205,224 Han 23514 0 23514 𥮗 # 𥮗 [25b97 ]x
copied data 61,64,255,255,188,200,6,11,205,224 from any other line. ex) line 67

** unicharset_extractor **
tesseract-ocr/unicharset_extractor -D [lang] [lang]/[lang].[font].exp0.box

output is unicharset :
𥮗 1 0,255,0,255,0,0,0,0,0,0 NULL 4 0 0 # 𥮗 [25b97 ]x

** set_unicharset_properties **
tesseract-ocr/set_unicharset_properties -U unicharset -O [lang].unicharset --script_dir=langdata/[lang] --X langdata/[lang]/han.xheights
no warning

output is [lang].unicharset :
𥮗 1 61,64,255,255,188,200,6,11,205,224 Han 4 0 4 𥮗 # 𥮗 [25b97 ]x
=> changed

I found out

  1. [script].unicharset file is officially supported.
  2. entry properties : 'character' 'properties' 'glyph_metrics' 'script' 'other_case' 'direction' 'mirror' 'normed_form'

How to get 'glyph_metrics' data from [font or several fonts]?

Thank you in advance.

Regards,

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions