Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggest 'deva' for Devanagari #41

Closed
Shreeshrii opened this issue Jan 13, 2017 · 11 comments
Closed

Suggest 'deva' for Devanagari #41

Shreeshrii opened this issue Jan 13, 2017 · 11 comments

Comments

@Shreeshrii
Copy link
Contributor

With LSTM training the dictionary dawg files have become optional. In light of this, I want to suggest an additional traineddata file for Devanagari script, which can cater to all main languages written in it.

The reason for suggesting this is, when I tested OCR on a Marathi text, a lot of words with rakaara were not recognised correctly. However, same page OCRed with Sanskrit recognised them correctly, but some others were incorrect.

So, in addition to the multiple traineddata for various languages written in Devaन

@Shreeshrii
Copy link
Contributor Author

Can add Deva.traineddata which is trained on training text for all these languages taken together.

@amitdo
Copy link

amitdo commented Jan 14, 2017

Related papers:

A Segmentation-Free Approach for Printed Devanagari Script Recognition (2015)
Tushar Karayil, Adnan Ul-Hasan, Thomas M. Breuel

Can we build language-independent OCR using LSTM networks?
(2013)
Adnan Ul-Hasan, Thomas M. Breuel

More interesting papers about LSTM for OCR:
https://github.com/tmbdev/ocropy/wiki/Publications

@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented Jan 15, 2017

List of unicode devanagari fonts that could be used for training, if not already being used

tesseract-ocr/tesseract#561 (comment)

Sample og glyphs in different fonts

tesseract-ocr/tesseract#654

@amitdo
Copy link

amitdo commented Jan 15, 2017

Similary. it would be nice to have a generic traineddata for multiple Latin script based langs, as described in the paper I mentioned above.

Likewise, you could provide a generic Cyrillic traineddata.

@amitdo
Copy link

amitdo commented Jan 15, 2017

And maybe one based on the Arabic script.

@amitdo
Copy link

amitdo commented Mar 14, 2017

#41 (comment)
@stweil
I think it's related to your message here:
https://groups.google.com/forum/#!topic/tesseract-dev/8H_4K3vPRJE

@stweil
Copy link
Member

stweil commented Mar 14, 2017

Likewise, you could provide a generic Cyrillic traineddata.

I assume the same would be needed for Greek. Or would it be better to include Greek characters in the Latin training set? Several sciences (especially Physics and Mathematics) use single Greek characters in texts which are mostly written with Latin letters.

@Shreeshrii
Copy link
Contributor Author

#59 (comment)

@theraysmith commented 2 days ago

I've also added an experiment to throw all the Latin languages together
into a single engine. (Actually a separate model for each of 36 scripts).
If that works it will solve the problem of reading Citroen in German and
picking up the e umlaut.
The downside is that this model has almost 400 characters in it, despite
carefully keeping out the long-tail graphics characters. Even if it does
work, it will be slower, but possibly not much slower than running 2
languages. It will have about 56 languages in it. I have some optimism that
this may work, ever since I discovered that the vie LSTM model gets the
phototest.tif image 100% correct.

@amitdo
Copy link

amitdo commented Aug 1, 2017

This request was implemted by Ray:

tesseract-ocr/tessdata#62 (comment)

@Shreeshrii
Copy link
Contributor Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants