-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Q&A: Training Wiki Updates and Request for Info #659
Comments
Also helpful will be info on:
e.g. For Sanskrit, I want to train by adding a layer using a list of most frequent orthographic syllables so that the unicharset is expanded to include all possible aksharas. Will this work?
|
My own question - the answer can also be added to the wiki. Is it OK to mix b/w images, produced by text2image, with gray and/or color images from book scan? |
Also, is there a way for tesseract to create line boxes for a scanned image.
It will make it easier to put the truth text if the box dimensions are
pre-made.
- excuse the brevity, sent from mobile
…On 13-Jan-2017 2:14 PM, "Amit D." ***@***.***> wrote:
My own question - the answer can also be added to the wiki.
Is it OK to mix b/w images, produced by text2image, with gray and/or color
images from book scan?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#659 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o7lvUh_kbAfVygdwAU1ZBpPaiXCaks5rRzlqgaJpZM4LigRu>
.
|
This feature is not implemented. I will try to implement it sometime in the next few days and send a PR. |
Another question: what effect does the add a layer type of training have regarding the unicharset in the new traineddata. For add a layer, a unicharset if required eg.
Meaning, if we just want to add a few characters to the unicharset, is it enough to have good sampling of those or do characters from the lstm unicharset (which are unknown at this point) need to be there too. |
Traineddata files in tessdata for 4.0 were trained with The training command examples do not specify this. What are the recommended value to be used for finetuning and adding a layer? |
Please see 'core dumped' error in these cases seems to be related to using --eval_listfile as part of the lstmtraining command eg. Please update the wiki, if you can confirm this, so that people are able to run the tutorial. Thanks. |
@amitdo Question to you, let me explain as briefly as I can:
I found certain groups of ocr failures in my scan case, two examples which were always wrongly detected
QuestionIs there an easy way - I guess, it could be possible and would be very userfriendly -
|
Hi @Wikinaut! Believe it or not, I haven't started yet playing with training the LSTM engine, so I don't know enough to answer your question. Hopefully, this serious 'bug' will be fixed sometime in the next month :-) Some observations: 'ë' does not appear in the training text, 'é' appears 4 times.
'für' appears 10 times in the training text.
Did you try |
@amitdo my original text uses a very "bad" font, where the characters overlap very often, and the characters often look, but are not, "ligatures". This explains the "fiir" in many cases (in my case). I also tried |
@amitdo and regarding my question above, can I "quickly" retrain my "deu" training data (or a copy of it) with a corrected text, this would be really great? Promise: some mBitcoins for this today! |
Whoever coded the LSTM: Big APPLAUSE for him or her! |
LSTM - New OCR engine based on neural networks. |
@amitdo yes, but what if one selects |
The two engines runs and the results are combined in some way. |
👍 |
As said, I have zero experience training the LSTM engine. What you want is described here: |
Please provide a sample image for testing. |
"für" vs.Tesseract: "fiir""Citroën" vs. Tesseract: "Citroén" |
ë is not in the training_text. Needs to be added, hope that @theraysmith will include in next training. für - is being recognized -see attached output files. though ö was not recognized in one image. |
https://en.wikipedia.org/wiki/German_language#Orthography |
It does looks like 'ii' (two 'i's), doesn't it? Maybe the training text needs some examples of 'ii' so it can learn to distinguish it from 'ü'. |
@Shreeshrii in my conversion, these words "für" were recognized as "fiir". May be due to use of "unpaper" as preprocessor, and/or my use of "-l deu+eng --oem 2" for the conversion. There were many more occurences of false-detecting "fiir" in my about 700 pages of text. This was the most frequent conversion error and triggered me to aksing you how I could retrain tessdata by using my corrected text file. A simple command line would be very helpful for such cases. @amitdo regarding "ii": In my text, tesseract correctly ocr-ed "ii" in the words "Gummiisolation", and "Daiichi" (a name). |
@theraysmith You appear to be the expert for answering my question, if such a procedure for re-training (tesseract + LSTM) is easily possible, or not: (I described it already above:) Can I "quickly" retrain my "deu" (or "deu+eng") training data (or a copy of it) with a corrected text ?
re-running with re-trained tesseract' should in the best case result in
I found but do not (yet) understand the present training explanations in the Wiki, and perhaps is my idea not yet covered. |
This kind of retraining would be desirable, but is not available.
In your case, you don't need it though, as 4.00 works for all the examples
of "für" that you provided.
You just need to make sure you are using the latest code and data.
As Amit points out e diaresis is not in the German alphabet. I successfully
correctly got
"Citroën"
by using fra+deu as the language. Unfortunately, it doesn't work with
deu+fra, and neither works for the 2nd example.
BTW this needed a bug fix for multi-language, which I will check in soon.
…On Mon, Jan 16, 2017 at 9:26 AM, Wikinaut ***@***.***> wrote:
@theraysmith <https://github.com/theraysmith> You appear to be the expert
for answering my question, if such a procedure for re-training (tesseract +
LSTM) is easily possible, or not:
(I described it already above:)
Can I "quickly" retrain my "deu" (or "deu+eng") training data (or a copy
of it) with a corrected text ?
- in.pdf -> tesseract -> out.txt
- out.txt -> manually corrected -> *corrected.txt*
- retraining tesseract (to get tesseract' )with these inputs: in.pdf +
*corrected.txt*
re-running re-trained tesseract should in the best case result in
- in-pdf -> tesseract' -> corrected.txt
I found but do not (yet) understand the present training explanations in
the Wiki, and perhaps is my idea not yet covered.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#659 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AL056ZP1oiFo1bk3xNLf3lGGBiL7G-ODks5rS6hbgaJpZM4LigRu>
.
--
Ray.
|
@theraysmith Thank you for your swift answer. In my case, many "für" were detected as "fiir", when or when not using I will retry - and report here - with only |
@theraysmith to be more precise: I tried tesseract with |
New box renderer |
Some of the problems with German texts were addressed in tesseract-ocr/langdata#54, tesseract-ocr/langdata#56 and tesseract-ocr/langdata#57. I don't know whether those fixes are sufficient to improve future trainings. |
@stweil @amitdo Stefan, please can you also make sure that common words with a https://en.wikipedia.org/wiki/Diaeresis_(diacritic) (Deutsch: Trema) like |
I addressed the more general question whether all European languages should support all typical diacritical characters in the tesseract-dev forum and need information from @theraysmith to proceed. |
I expect that using additional languages has more side effects than recognizing additional characters, because they also add word list, unigram frequencies, word bigrams and so on for that languages which might have a negative effect on OCR results for texts which are mainly written in a single language but make sparely use of additional languages. Examples of such texts are German texts with foreign person or trade mark names, but also English scientific texts with additional Greek characters (a combination often used in mathematics and physics). |
@stweil Thanks for your swift answers. Let me know, if I can help. |
Wikinaut, you can try the new best/Latin.traineddata |
Does it work now with best traineddata? Can I close this issue? |
I have not tried the latest version. Pls. let this open - I will close it, if it's solved. |
The best/eng.traineddata doesn't have the marks you want. Try the new best/Latin.traineddata. |
The problem with "fiir" instead of "für" is a typical example of the ii / ü confusion which still exists in the current best traineddata. The wordlist for |
@stweil I now use the new https://github.com/tesseract-ocr/tessdata_best data, and found that a problem with lowercase vs. uppercase "s" exists, in a 1000-page text, typical incorrectly detected word patterns are:
|
Try to correct the mistakes in the wordlist and see if it helps to recognize these words. |
... or run Tesseract without a wordlist. I recently removed the wordlists from the best traineddata to see and compare the real quality of the trained LSTM data. This is impossible when Tesseract uses a wordlist. With wordlists, Tesseract also invents words which don't occur in the original text ("computer" and "Google" in historical documents). PS. Is there a parameter which disables the post OCR steps (like wordlist evaluation) in Tesseract without the need to remove the wordlists from the traineddata files? |
Yes, there is a parameter which disables the wordlist evaluation. I don't remember its name right now... |
Please see #960 I guess, you can make the following two config variables as false to not load the wordlist dawg files. load_system_dawg T |
The parameter is lstm_use_matrix. |
load_system_dawg should work. load_freq_dawg seems to have no impact on the lstm recognizer. |
Those config variables related to the legacy engine. New traineddata files have a different lstm-word-dawg and have no freq-dawg files. So, I am not sure whether they will work. I haven't tried it yet. |
Line 307 in 27d25e9
|
I wonder why LSTM needs its own word list. I'd expect that a word list is different for different languages, and it is also reasonable to use different word lists for different kinds of text (topic, date) of the same language, but it should not depend on the OCR algorithm. |
It is not that the wordlist is different, but the fact that the legacy engine and LSTM models might be using different unicharsets. The creation and unpacking of dawgs requires unicharsets, that's why there are two sets of dawg files, even for numbers and punctuation, in addition to the wordlist. |
@theraysmith
Ray, Thanks for updating the Wiki page for LSTM training. A few more changes in the following may be required in light of the updates:
Please also provide command for building traineddata with just the .lstm file or with just .lstm and lstm-dawgs (so as to minimize traineddata filesize, if only LSTM is going to be used).
The text was updated successfully, but these errors were encountered: