-
Notifications
You must be signed in to change notification settings - Fork 888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Balinese Script OCR #152
Comments
Hi @gindrawan, with jTessBoxEditor you will get a recognition model which uses the old legacy recognizer, but not the LSTM one. For training LSTM, you need a large number of ground truth data, that means pairs of line images and text files with the corresponding text. You can use generated images by rendering the text with a Balinese font, and you can also use scans from Balinese publications (books, newspapers, ...) where you have to extract the lines and transcribe the text. Ideally both kinds of images are available. |
Are there any converters from Bali Simbar Dwijendra to Unicode? |
As far as I know, there is no such converter. I found Vimala font with glyph shape quite close to Bali Simbar Dwijendra font, as I mentioned at #126. |
Hi @Shreeshrii , Based on your tesseract code base changing in tesseract-ocr/tesseract@b34cf9d#diff-eaafd22a79065f5b8d28318d482e650d if I insert let say "Noto Sans Balinese" or "Vimala" at line 608 (after ""Prada" ") what other else critically I need to add/change to make the learning on? (Of corse, my eyes are also look at the other 2 files that you've changed) tesseract-ocr/tesseract@0eb7be1#diff-eaafd22a79065f5b8d28318d482e650d I'm still on "error and trial" mode of the learning process based on your https://github.com/Shreeshrii/tessdata_jav_java. I think before reach at that option, some light from you would be useful. Thanks... |
jav_java was done more than a year ago.
Now there is also possibility of training from line images. See
tesseract-ocr/tesstrain repo.
Please wait for a day or two. I am in the process of setting up something
for Balinese that you can then extend with your training text.
It is possible that no changes will be required in tesseract codebase.
It will also be useful if you can create ground truth transcription in
Unicode for at least 5 scanned page images from books which can be used for
validating the training.
You can also create a few hundred line images with transcription for
fine-tuning of traineddata created with synthetic images.
…On Tue, Mar 24, 2020, 09:58 gindrawan ***@***.***> wrote:
Hi @Shreeshrii <https://github.com/Shreeshrii> ,
Based on your tesseract code base changing in
***@***.***#diff-eaafd22a79065f5b8d28318d482e650d
<tesseract-ocr/tesseract@b34cf9d#diff-eaafd22a79065f5b8d28318d482e650d>
if I insert let say "Noto Sans Balinese" or "Vimala" at line 608 (after
""Prada" ") what other else critically I need to add/change to make the
learning on? (Of corse, my eyes are also look at the other 2 files that
you've changed)
***@***.***#diff-eaafd22a79065f5b8d28318d482e650d
<tesseract-ocr/tesseract@0eb7be1#diff-eaafd22a79065f5b8d28318d482e650d>
***@***.***#diff-eaafd22a79065f5b8d28318d482e650d
<tesseract-ocr/tesseract@7957288#diff-eaafd22a79065f5b8d28318d482e650d>
I'm still on "error and trial" mode of the learning process based on your
https://github.com/Shreeshrii/tessdata_jav_java. I think before reach at
that option, some light from you would be useful. Thanks...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#152 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABG37I47XDQVO4DC7HPAHP3RJAZHRANCNFSM4LM4TXMQ>
.
|
Thank you @Shreeshrii Another thing, if the trained data successfuly generated, is that compatible for Tesseract4Android (https://github.com/adaptech-cz/Tesseract4Android) ? Since they require trained data file at https://github.com/tesseract-ocr/tessdata/tree/4.0.0 |
Just images are not enough. What is needed is the correct (ground truth)
text in unicode format for each of those images.
So, the files should be 001.png and 001.gt.txt . Same basename but .gt.txt
for the unicode text for each.
For a work in progress, see
https://github.com/Shreeshrii/tesstrain-bali/tree/master/test
I need the correct text for the images so that it can be compared with the
OCRed text to verify accuracy on actual images.
…On Tue, Mar 24, 2020 at 1:56 PM gindrawan ***@***.***> wrote:
jav_java was done more than a year ago. Now there is also possibility of
training from line images. See tesseract-ocr/tesstrain repo. Please wait
for a day or two. I am in the process of setting up something for Balinese
that you can then extend with your training text. It is possible that no
changes will be required in tesseract codebase. It will also be useful if
you can create ground truth transcription in Unicode for at least 5 scanned
page images from books which can be used for validating the training. You
can also create a few hundred line images with transcription for
fine-tuning of traineddata created with synthetic images.
… <#m_-1197623344891217353_>
On Tue, Mar 24, 2020, 09:58 gindrawan *@*.*> wrote: Hi @Shreeshrii
<https://github.com/Shreeshrii> https://github.com/Shreeshrii
<https://github.com/Shreeshrii> , Based on your tesseract code base
changing in @.*#diff-eaafd22a79065f5b8d28318d482e650d <
***@***.***#diff-eaafd22a79065f5b8d28318d482e650d
<tesseract-ocr/tesseract@b34cf9d#diff-eaafd22a79065f5b8d28318d482e650d>>
if I insert let say "Noto Sans Balinese" or "Vimala" at line 608 (after
""Prada" ") what other else critically I need to add/change to make the
learning on? (Of corse, my eyes are also look at the other 2 files that
you've changed) *@*.*#diff-eaafd22a79065f5b8d28318d482e650d
***@***.***#diff-eaafd22a79065f5b8d28318d482e650d
<tesseract-ocr/tesseract@0eb7be1#diff-eaafd22a79065f5b8d28318d482e650d>>
@.*#diff-eaafd22a79065f5b8d28318d482e650d <
***@***.***#diff-eaafd22a79065f5b8d28318d482e650d
<tesseract-ocr/tesseract@7957288#diff-eaafd22a79065f5b8d28318d482e650d>>
I'm still on "error and trial" mode of the learning process based on your
https://github.com/Shreeshrii/tessdata_jav_java. I think before reach at
that option, some light from you would be useful. Thanks... — You are
receiving this because you were mentioned. Reply to this email directly,
view it on GitHub <#152 (comment)
<#152 (comment)>>,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABG37I47XDQVO4DC7HPAHP3RJAZHRANCNFSM4LM4TXMQ
.
Thank you @Shreeshrii <https://github.com/Shreeshrii>
Here they are scanned page images from book (quick search from the
Internet) with various image type and size.
I still prepare for the synthetic images (in Noto Sans/Serif Balinese and
Vimala), hope can be posted this day or tommorow.
Another thing, if the trained data successfuly generated, is that
compatible for Tesseract4Android (
https://github.com/adaptech-cz/Tesseract4Android) ? Since they require
trained data file at https://github.com/tesseract-ocr/tessdata/tree/4.0.0
balinese-script-images-v1.zip
<https://github.com/tesseract-ocr/langdata/files/4374016/balinese-script-images-v1.zip>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#152 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABG37I4YLC4A3EIMNX77KYTRJBVBXANCNFSM4LM4TXMQ>
.
--
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
|
Sorry, I forgot about the txt. May be need longer time for that. |
This is small pair image and text file using Noto Serif Balinese, I took them from https://en.wikipedia.org/wiki/Balinese_script. Hope can be used for now.. |
Oh, I forgot. Do the image need its box file or only the unicode text? |
Just the unicode text.
…On Tue, Mar 24, 2020 at 6:02 PM gindrawan ***@***.***> wrote:
Oh, I forgot. Do the image need its box file or only the unicode text?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#152 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABG37I3BFPEY6ARA35XUQ2TRJCR6NANCNFSM4LM4TXMQ>
.
--
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
|
Hi @Shreeshrii, It seems more time I need to prepare the training data (1-2 more days). Meanwhile, I just realize that there are kind of training data in page images (https://github.com/topherseance/javanese-aksara-training-text) and line images. Based on your previous answer, it seems you prefer line images? What happened with page images? On preparing line images in my case, it seems more effort because a page image need to be converted to several line images. But if training result will better enough, it's Ok then. At the attachment I have sample of my page image with its ground truth text. Is that Ok before I proceed further to line images? |
Are you preparing synthetic data using fonts or using actual images similar to what needs to be recognised later? |
https://github.com/Shreeshrii/tesstrain-bali/tree/master/langdata I had done a training run with 4-5 fonts. |
I am preparing about 5 thousands word (the remaining about 29 thousands word still on verification on the unicode) for synthetic data using Noto Serif Balinese, just download the latest font, updated 3 days ago (https://github.com/googlefonts/noto-fonts/tree/master/phaseIII_only/unhinted/ttf/NotoSerifBalinese). Somehow more updated than Noto Sans Balinese. Those 5 thousands words has already transformed into 101 page images, each contains 12 line training texts, each line about 5-10 words. Need a little more time to finalized it. If go into line images, well.. need more extra time. After that I am going to Vimala with the same unicode with Noto Serif Balinese. Vimala more likely needed for actual images recognition. The most needed for actual images recognition, Bali Simbar Dwijendra (BSD) we plan later since using non-balinese unicode, so more time and effort to prepare the training data. Actually, if involved BSD, the balinese script recognition app would has 2 option for post processing: unicode and non-unicode (I imagine some switch radio button to select before recognition). |
Generation of synthetic data is not an issue. It is actually quite easy to generate page images or line images given a training text and set of fonts. See https://github.com/Shreeshrii/tesstrain-bali/tree/master/gt/bali-Vimala LSTM training works on line images, so it is better to do line images. But this can be done easily by a computer. It seems to me that you are just taking a word list and generating text lines and images from that. Instead you should actually be using sentences and paragraphs and phrases along with punctuation similar to the pages that need to be recognized.
If there was any script which maps from BSD to Unicode then it can probably be handled programatically. Otherwise you should take page in BSD and transcribe it in Unicode. When I asked for page images for testing, I meant some sample actual images (in BSD) . I am generating images in five fonts: However, if only Vimala is required, it will probably be faster to get convergence. |
It's ok I think you put all of those fonts. Kadiri, Pustaka, and Vimala seem try to mimic certain different styles of ancient glyph. Moreover Vimala was also developed with BSD style reference. Noto Sans Balinese and Noto Serif Balinese seem not so many difference each other. I don't know what the consideration Google release both of them. |
@Shreeshrii , I just make any map from BSD to Balinese Unicode, perhaps it useful. |
Is http://www.unicode.org/udhr/d/udhr_ban.html in BSD? I did a simple substitution using |
It is in Balinese Latin (like Javanese Latin using convention name "java"; and its Javanese Script using "java-jav") . From there we can convert it to many Balinese Script (BSD, Vimala, Noto Serif Balinese, etc) but need some rule-based text preprocessing first.
At the reverse process (Balinese script to Balinese Latin), actually I don't know, how to make this work in Tesseract, as I illustrated it at the attachment. |
Oh, for Balinese Script to Balinese Latin at the illustration file |
The way tesseract (lstm version) works, the image will be recognised as Unicode text which will render correctly with Unicode Balinese fonts. So, both Vimala and Noto fonts should be able to render the same output. |
Hi @Shreeshrii , I just improve the bsd code to unicode mapping https://github.com/gindrawan/balinese-bsdcode-2-unicode based on your sed file (at the attachement, I gave status OK, REV, and ADDED. Not all of added mapping were put it there, see the link). I have tested bali1.traineddata from https://github.com/Shreeshrii/tesstrain-bali/tree/master/data using a simple BSD word image but the result is still not right (the file is at the attachment with gt text file using bsdcode and unicode for checking). Perhaps because not yet learned using BSD. Related to udhr.latn.txt, if you want to transliterate it to BSD-style Balinese script, you can try android app (still not prefect though): https://play.google.com/store/apps/details?id=id.ac.undiksha.aksarabalisd&hl=en |
You have given link to apps that convert from Latn to BSD as well as Noto
(Unicode) for Balinese.
What will be helpful, if you want to train for BSD, is you can send me two
text files, one in BSD and one in Noto, for the same Balinese text. Similar
to file you sent earlier, but that was just one word.
…On Mon, Mar 30, 2020 at 10:54 AM gindrawan ***@***.***> wrote:
I did a simple substitution using sed to convert the text from there to
Unicode using the mapping you suggested. I don't think it is correct. B
is not converted, also some signs don't seem right. I don't know the
language to verify.
Hi @Shreeshrii <https://github.com/Shreeshrii> , I just improve the bsd
code to unicode mapping
https://github.com/gindrawan/balinese-bsdcode-2-unicode based on your sed
file (at the attachement, I gave status OK, REV, and ADDED. Not all of
added mapping were put it there, see the link).
I have tested bali1.traineddata from
https://github.com/Shreeshrii/tesstrain-bali/tree/master/data using a
simple BSD word image but the result is still not right (the file is at the
attachment with gt text file using bsdcode and unicode for checking).
Perhaps because not yet learned using BSD.
Related to udhr.latn.txt, if you want to transliterate it to BSD-style
Balinese script, you can try android app (still not prefect though):
https://play.google.com/store/apps/details?id=id.ac.undiksha.aksarabalisd&hl=en
bsd2unicode.sed.txt
<https://github.com/tesseract-ocr/langdata/files/4400924/bsd2unicode.sed.txt>
bakta.zip
<https://github.com/tesseract-ocr/langdata/files/4400927/bakta.zip>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#152 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABG37IZS3T3OLJUIPQWFBV3RKAUKRANCNFSM4LM4TXMQ>
.
--
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
|
I just make them but still in small size since quite manual to generate them. How will you train tesseract wilth such data? |
Hi,
I want to develop an OCR for Balinese Script (https://en.wikipedia.org/wiki/Balinese_script) using Tesseract 4.0 and tool jTessBoxEditor 2.2.1 (still not support LSTM?).
There are two font involved (at the attachment)
I wanto accomodate both type of fonts with priority to Bali Simbar Dwijendra. Sorry I am new to Tesseract and the question is how do I start with it?
Thank you very much for your kind attention.
Best regards, Indra
bali-simbar-dj-noto-serif-balinese.zip
The text was updated successfully, but these errors were encountered: