Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix and enable lstm related unittests #2180

Merged
merged 5 commits into from
Jan 24, 2019
Merged

Fix and enable lstm related unittests #2180

merged 5 commits into from
Jan 24, 2019

Conversation

Shreeshrii
Copy link
Collaborator

I will upload required testdata files to the test repo.

@Shreeshrii Shreeshrii changed the title Fix and build lstm related unittests Fix and enable lstm related unittests Jan 22, 2019
@Shreeshrii
Copy link
Collaborator Author

lstm_test.log

@stweil Please see the attached log file. The error rates for many tests are much lower than the expected values. I am wondering if it is related to using the Batch/Mean error as the Best error. Is this the same way error rates are calculated in tesseract?

Of course the difference could just be because the 'testdata' is different.

@ghost ghost assigned stweil Jan 23, 2019
@ghost ghost added the review label Jan 23, 2019
Fixed a merge conflict.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
@stweil
Copy link
Member

stweil commented Jan 23, 2019

I added a commit to fix a merge conflict with Git master.

unittest/log.h Outdated
break;
case ERROR:
std::cout << "[ERROR] ";
std::cout << "\n[ERROR] ";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Shreeshrii, did you find the implementation which is used by Google, and does that implementation add line feeds like that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Google might be using the implementation in glog - https://github.com/google/glog/blob/master/src/windows/glog/logging.h

I added the linefeed because I thought it might increase readability. It could probably be replaced by a space.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. The Google implementation adds a linefeed at the end if the log string does not already end with one. As Tesseract only has a few users of LOG, I think the linefeed characters can be added locally when calling LOG if needed. I suggest to remove the 3rd commit, at least for now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stweil I don't know how to remove the commit. I have added another commit reverting the change.

@stweil stweil merged commit bbd23bb into tesseract-ocr:master Jan 24, 2019
@ghost ghost removed the review label Jan 24, 2019
@@ -17,7 +17,7 @@
#include "fileio.h" // for tesseract::File
#include "gtest/gtest.h"

const char* FLAGS_test_tmpdir = ".";
const char* FLAGS_test_tmpdir = "./tmp";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That directory is missing for builds which are not started in the root directory, so a lot of tests fail or crash currently. Do we need this change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I was running make check in tesseract root directory all files generated by unittests were being created in the unittest root directory. make clean did not remove them. So I thought it would be helpful to have a separate directory for the generated files.

There maybe a better way to accomplish this. Please change as you see fit. Thanks.

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Jan 24, 2019

When built with --enable-openmp

[       OK ] LSTMTrainerTest.TestSquashed (121074 ms)
[----------] 1 test from LSTMTrainerTest (121074 ms total)

With --disable-openmp

[       OK ] LSTMTrainerTest.TestSquashed (250335 ms)
[----------] 1 test from LSTMTrainerTest (250335 ms total)

@Shreeshrii
Copy link
Collaborator Author

@stweil Ref: #2180 (comment)

Did you have a chance to look into this?

I reran tesstutorial today. According to Ray in https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#training-from-scratch

The character error rate falls below 50% just after 3700 iterations, and by 5000 to about 13%, where it will terminate. (In about 20 minutes on a current high-end machine with AVX.)

In my test run today

At iteration 3684/3700/3700, Mean rms=1.506%, delta=3.148%, char train=10.694%, word train=25.578%, skip ratio=0%,  New best char error = 10.694 wrote best model:/home/ubuntu/tesstutorial/engoutput/base10.694_3684.checkpoint wrote checkpoint.

At iteration 4764/5000/5000, Mean rms=0.902%, delta=1.113%, char train=3.729%, word train=9.967%, skip ratio=0%,  New best char error = 3.729 wrote best model:/home/ubuntu/tesstutorial/engoutput/base3.729_4764.checkpoint wrote checkpoint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants