-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LSTM: Training - Eval not run from trainer #644
Comments
@theraysmith Any update on this? |
@stweil Have you come across this problem while training? Any solution? |
No, I did not see that assertion up to now. How can it be reproduced? |
Please try the following making appropriate changes for your fonts directory and tessdata directory
On windows using the binaries from appveyor artifacts, I am getting
I will check with tesseract built under WSL and report separately. |
I think crash happens in builds with debug. Non-debug builds get some error messages but continue. |
The non-debug build continues, but uses a bad index internally, so the results are invalid. Replace Can you get a stack trace for the assertion? |
|
@stweil Unrelated to this issue, one of the recent commits seems to have caused the program to slow down a lot - while training it seems to hang for a while and even OCRing images seems to take longer. I am not sure how to verify/confirm this. |
@Shreeshrii, please open a new issue about that regression. |
@stweil I don't see any patch here. |
Should the model_data here be pointing to --eval_listfile or lstmf files within it? |
@Shreeshrii, I am sorry, but somehow my last comment got lost. So once again: The assertions are caused by an index of 0 used for an empty vector. Since commit 907de59 the constructor of GenericVector no longer allocates memory for an empty vector.
I don't know whether returning |
I tried to analyze the problem ("why is there a data size 0?") and noticed that the behavior of the program is totally erratic when I run it in a debugger. There are several threads involved, and depending on my breakpoints the problem with the data size occurs or not. That looks like a synchronization issue, and so I decided to use Valgrind. The result is horrible:
Here is one example (total log file is about 28000 lines):
So some more work is needed. The patch shown above does not fix the real problems. |
@stweil Thank you for looking into this. |
may also be related. |
The new test in LSTMTrainer::UpdateErrorGraph fixes an assertion (see issues tesseract-ocr#644, tesseract-ocr#792). The new test in LSTMTrainer::ReadTrainingDump was added to improve the robustness of the code. Signed-off-by: Stefan Weil <stefan.weil@bib.uni-mannheim.de>
The new test in LSTMTrainer::UpdateErrorGraph fixes an assertion (see issues tesseract-ocr#644, tesseract-ocr#792). The new test in LSTMTrainer::ReadTrainingDump was added to improve the robustness of the code. Signed-off-by: Stefan Weil <sw@weilnetz.de>
It looks like the pix which was created by pixScaleGeneral contains uninitialized pixel data.
|
Using valgrind, I am not able to find uninitialized pixels starting from pixScale (or pixScaleGeneral). I did something simple like: Try to replace line 1705 of pix2.c, which is pixCreateTemplateNoInit(), with |
and to make sure that valgrind was catching it, I used a conditional in the inner loop: If any pixel were uninitialized, we'd have a message like: |
My short test program
I compiled it on Debian Stretch using
|
Could that be the reason for the problem? Debian has a Leptonica version 1.74.1-1, so maybe that fix is missing? Then Debian needs a newer version (and Tesseract should require Leptonica 1.74.2). CC @jbreiden. |
Stefan, can you run your test program from our github head, or replacing the 1.74.1 scale.c with the most recent one? That would determine if the normalization change fixed this problem. I ran your exact program with the current pixScaleGeneral(), on valgrind, and got no error. |
Under what conditions should eval be run from trainer? Will training work if it is done without eval? |
After failing to find the problem with an uninitialized value from pixUnsharpMasking(), I will do the most simple thing, which is make sure that the pixel values are initialized. Use of pixCreateTemplateNoInit(), instead of pixCreateTemplate(), is clearly a poor optimization. I will also remove other uses of the NoInit version in places where it's not obvious by inspection that all pixels are set. |
Committed (#512) to leptonica. Hoping that this solves any uninitialized value problems. |
https://github.com/DanBloomberg/leptonica/releases/tag/1.74.3 Should we be using this for 4.00? |
As far as I see Leptonica 1.74.2 is sufficient (it solved the uninitialized value problems for Tesseract), but of course you can use the newer version, too. |
1.74.3 has a fix for the uninitialized issue (the topic of this thread),
and some other bug fixes on windows. It has far fewer coverity scan
'bugs'. And I also made it configure-ready, as advertised in the README.
…On Fri, Jun 9, 2017 at 11:10 PM, Stefan Weil ***@***.***> wrote:
As far as I see Leptonica 1.74.2 is sufficient, but of course you can use
the newer version, too.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#644 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AP6mLLmxMT7FYa8MVlC7A9Z274t3Ng-dks5sCjM_gaJpZM4Ld5Xp>
.
|
Dan, the uninitialized issue was already fixed with 1.74.2, and so were the Coverity scan issues. |
Somehow I didn't realize that 1.74.2 fixed the uninitialized issue in
unsharp masking. So now it's double-fixed. And 1.74.3 has even more fixed
coverity scan issues :-)
For 1.74.3, I wanted to make a configure-ready version available (and also,
I hadn't done a tarball release before on github). I plan to make all
future releases that way.
So it seems that both 1.74.2 and 1.74.3 can be used with tesseract.
For the future, I'd like to remove the pixWriteDisplay*() functions from
the library, which are only there to support some older versions of
tesseract.
…On Sat, Jun 10, 2017 at 2:14 PM, Stefan Weil ***@***.***> wrote:
Dan, the uninitialized issue was already fixed with 1.74.2, and so were
the Coverity scan issues.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#644 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AP6mLANpA6zDdbA955Xk-agzECB6WpQlks5sCwdOgaJpZM4Ld5Xp>
.
|
Hi there! I had exactly the same issue after exactly 100 iterations: I was using Leptonica 1.74.1 provided by Macports, then built manually 1.74.4 (just 1 hour after Dan put out it :) ), still the same problem. What baffles me is, the I'm using a 2015 MBP running OS X El Capitan (10.11.6) with Xcode 8.
Thanks! |
@stweil first thanks for TRULY swift reply! It has been training for past 2 hours for about 3,000 iterations and still going - it's just enjoyable to see it continuing after days of struggling :) So has |
That fixed at least the currently known problems and should produce more stable results as the trained data no longer depends on undefined values (which could produce random results). |
When training with --eval_listfile in addition to --train_listfile, I have noticed that while all files from --train_listfile are loaded at beginning of training process, only first two from --eval_listfile are loaded. However, I have not found any images from the --eval_listfile being used in the training (I am going by the detailed log per iteration which is displayed with --debug_interval -1. Does this mean that eval is still not being run? Or, does the eval process not write the log message? If it is a question of not logging, is it possible to modify to add the log message during evaluation? |
Answered in https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 With --debug_interval -1, the trainer outputs verbose text debug for every training iteration. @stweil Eval seems to be running now (though there are no debug messages to verify it). Thanks for the fix. Should I close this issue or does it need any other verification? |
so, is this issue fixed ? |
@xlight Thanks to patches by @stweil and leptonica 1.74.2, the assertion and problem with uninitialized data related to this issue are fixed when using the latest code from github.. @theraysmith will have to verify that his original problem has been fixed, since I do not know what exact test he was referring to in his comment ...
|
2 Percent improvement time=12184, best error was 8.073 @ 23307 Warning: LSTMTrainer deserialized an LSTMRecognizer! At iteration 35491/53200/53202, Mean rms=0.165%, delta=1.673%, char train=6.063%, word train=19.746%, skip ratio=0%, New best char error = 6.063 At iteration 33032, stage 1, Eval Char error rate=19.697385, Word error rate=42.197884 wrote checkpoint. At iteration 35535/53300/53302, Mean rms=0.166%, delta=1.706%, char train=6.108%, word train=19.751%, skip ratio=0%, New worst char error = 6.108 At iteration 34063, stage 1, Eval Char error rate=20.580924, Word error rate=41.515177 wrote checkpoint. Eval is being run from trainer. Closing the issue. Thanks, @stweil. |
Latest code change has reverted the fix.
|
Fixed by 45fb7dd |
See comment from Ray at
#542 (comment)
The text was updated successfully, but these errors were encountered: