Closed
Description
Hello.
I have some questions about situation with tests in Tesseract repo.
- I think we have lack of unit-tests in Tesseract repo. Here I see some tests, but it is too few, I think. Should we add more unit-tests? Should we write more unit-tests for old engine? Or just write unit-tests for newer LSTM engine? Also I think we can move unit-tests also to CMake. And we can intgrate running unit-tests to TravisCI/Appveyor.
- I am a little bit confused about testing recognition quality. How do we do it? I found this, but this seems out-dated stuff. Can we collect different images (of course, only images with good license for us), prepare ground-truth and check Tesseract on our set for regressions? I think it's very important for OCR-engine. Also we can integrate regression tests to TravisCI/Appveyor.
- We must test Tesseract with Google sanitizers. For this we must have some tests, compile Tesseract with sanitizers and run tests. I think we can find some errors. Also this way will prevent from some possible mistakes in future.
- (Hint) I suggest collect all images from issues and add it to our test set.
I try to understand, can I work on this way for Tesseract? Will this work be welcomed? - As mentioned in Use OSS-Fuzz for improved code quality #1351, we should try to use OSS-FUZZ with Tesseract. At my last work we found a lot of problems with different Tesseract options (some of them led to crash)
I suggest discuss here about testing stuff.
Activity
jbreiden commentedon Jun 4, 2018
Google has a bunch of tests that we should add to the repo. They will need some effort to get them to work there. Here's an example of one of them, note how we'll have to change stuff like
ABSL_ARRAYSIZE
Shreeshrii commentedon Jun 4, 2018
jbreiden commentedon Jun 4, 2018
Google has 56 files of tests for Tesseract. None of them will work as-is with the GitHub repo, but at least some could be adapted without too much effort. Might be a good starting point, especially for someone like @zamazan4ik who sounds excited about writing or improving test.
Shreeshrii commentedon Jun 4, 2018
@jbreiden It will be great if you can add them to GitHub for @zamazan4ik to update. Thanks!
stweil commentedon Jun 4, 2018
@zamazan4ik, could you please add "RFC:" to the title of this issue ("RFC: Situation with tests in Tesseract")? That makes it clear that it is not a bug report.
@jbreiden, I also think that the available test code should be added to git, even if it is currently not integrated in the build process. Please add only text files (test code) to tesseract git. If there are also binaries (images, tessdata), they can be added to https://github.com/tesseract-ocr/test.
zamazan4ik commentedon Jun 4, 2018
@jbreiden Thank you for the information. Seems like Google has unit-tests. Okay, we can wait for them. But what about regression tests? Has Google anything for this? Or we should prepare images and ground truth?
zamazan4ik commentedon Jun 4, 2018
@stweil If I have some images with/without ground truth for them, should I add them to https://github.com/tesseract-ocr/test? Are there any special requirements for test data?
zamazan4ik commentedon Jun 4, 2018
And I want to clarify situation with unit-tests. Do we want to wait for unit-tests from Google or start implement our own unit-tests?
Shreeshrii commentedon Jun 4, 2018
If you have any specific unitest in mind, please go ahead and implement it.
We can add the ones from Google as and when they are added to the repo and modified to work with the code in GitHub.
There was discussion in another thread, regarding putting all binaries related to testing in a separate repo (test) which can be invoked as a submodule so that the tesseract repo does not become very large.
@zdenop has already created a new repo and the images used by current unittests should also be moved there.
Additionally, it might be possible to reduce some test file sizes for image files.
zamazan4ik commentedon Jun 4, 2018
@Shreeshrii Was any discussion about measuring recognition quality between different Tesseract runs?
stweil commentedon Jun 4, 2018
I don't remember such discussions, but I think that measuring the quality (not only for text recognition, but also for layout recognition) should be part of the regression tests.
Shreeshrii commentedon Jun 4, 2018
No, that was not covered. Google may have these internal tests, since Ray puts statistics in his presentations, but nothing was mentioned in context of the open source code.
However, I think it is important to check for regression, at least with some sample images to begin with.
The UNLV datasets are only for limited set of languages. I would like us to be able to test each language and script, even if is a with a single one page image.
That dataset might take some time to build, but if a framework for that can be setup new language tests can be added as and when the image and matching ground truth becomes available.
Shreeshrii commentedon Jun 4, 2018
For example the tests should catch cases like:
#682
LSTM: khmer is not working with --oem 1
stweil commentedon Jun 4, 2018
I would not use the tesseract-ocr repositories to collect all kinds of ground truth, but of course some examples are needed for the regression tests. We need them to measure the recognition error rate, and we need them if we have tests for training, too.
Maybe we could also start a Wiki page https://github.com/tesseract-ocr/tesseract/wiki/Ground-Truth to collect good sources of ground truth, like we collect information on fonts at https://github.com/tesseract-ocr/tesseract/wiki/Fonts.
46 remaining items