Skip to content

RFC: Situation with tests in Tesseract #1627

Closed
@zamazan4ik

Description

Hello.
I have some questions about situation with tests in Tesseract repo.

  1. I think we have lack of unit-tests in Tesseract repo. Here I see some tests, but it is too few, I think. Should we add more unit-tests? Should we write more unit-tests for old engine? Or just write unit-tests for newer LSTM engine? Also I think we can move unit-tests also to CMake. And we can intgrate running unit-tests to TravisCI/Appveyor.
  2. I am a little bit confused about testing recognition quality. How do we do it? I found this, but this seems out-dated stuff. Can we collect different images (of course, only images with good license for us), prepare ground-truth and check Tesseract on our set for regressions? I think it's very important for OCR-engine. Also we can integrate regression tests to TravisCI/Appveyor.
  3. We must test Tesseract with Google sanitizers. For this we must have some tests, compile Tesseract with sanitizers and run tests. I think we can find some errors. Also this way will prevent from some possible mistakes in future.
  4. (Hint) I suggest collect all images from issues and add it to our test set.
    I try to understand, can I work on this way for Tesseract? Will this work be welcomed?
  5. As mentioned in Use OSS-Fuzz for improved code quality #1351, we should try to use OSS-FUZZ with Tesseract. At my last work we found a lot of problems with different Tesseract options (some of them led to crash)

I suggest discuss here about testing stuff.

Activity

jbreiden

jbreiden commented on Jun 4, 2018

@jbreiden
Contributor

Google has a bunch of tests that we should add to the repo. They will need some effort to get them to work there. Here's an example of one of them, note how we'll have to change stuff like ABSL_ARRAYSIZE

#include "tesseract/ccstruct/statistc.h"
#include "tesseract/ccutil/genericvector.h"
#include "tesseract/ccutil/kdpair.h"

namespace {

const int kTestData[] = { 2, 0, 12, 1, 1, 2, 10, 1, 0, 0, 0, 2, 0, 4, 1, 1 };

class STATSTest : public testing::Test {
 public:
  void SetUp() {
    stats_.set_range(0, 16);
    for (int i = 0; i < ABSL_ARRAYSIZE(kTestData); ++i)
      stats_.add(i, kTestData[i]);
  }

  void TearDown() {
  }

  STATS stats_;
};

// Tests some basic numbers from the stats_.
TEST_F(STATSTest, BasicStats) {
  EXPECT_EQ(37, stats_.get_total());
  EXPECT_EQ(2, stats_.mode());
  EXPECT_EQ(12, stats_.pile_count(2));
}

// Tests the top_n_modes function.
TEST_F(STATSTest, TopNModes) {
  GenericVector<tesseract::KDPairInc<float, int> > modes;
  int num_modes = stats_.top_n_modes(3, &modes);
  EXPECT_EQ(3, num_modes);
  // Mode0 is 12 1 1 = 14 total count with a mean of 2 3/14.
  EXPECT_FLOAT_EQ(2.0f + 3.0f / 14, modes[0].key);
  EXPECT_EQ(14, modes[0].data);
  // Mode 1 is 2 10 1 = 13 total count with a mean of 5 12/13.
  EXPECT_FLOAT_EQ(5.0f + 12.0f / 13, modes[1].key);
  EXPECT_EQ(13, modes[1].data);
  // Mode 2 is 4 1 1 = 6 total count with a mean of 13.5.
  EXPECT_FLOAT_EQ(13.5f, modes[2].key);
  EXPECT_EQ(6, modes[2].data);
}

}  // namespace
Shreeshrii

Shreeshrii commented on Jun 4, 2018

@Shreeshrii
Collaborator
jbreiden

jbreiden commented on Jun 4, 2018

@jbreiden
Contributor

Google has 56 files of tests for Tesseract. None of them will work as-is with the GitHub repo, but at least some could be adapted without too much effort. Might be a good starting point, especially for someone like @zamazan4ik who sounds excited about writing or improving test.

Shreeshrii

Shreeshrii commented on Jun 4, 2018

@Shreeshrii
Collaborator

@jbreiden It will be great if you can add them to GitHub for @zamazan4ik to update. Thanks!

stweil

stweil commented on Jun 4, 2018

@stweil
Member

@zamazan4ik, could you please add "RFC:" to the title of this issue ("RFC: Situation with tests in Tesseract")? That makes it clear that it is not a bug report.

@jbreiden, I also think that the available test code should be added to git, even if it is currently not integrated in the build process. Please add only text files (test code) to tesseract git. If there are also binaries (images, tessdata), they can be added to https://github.com/tesseract-ocr/test.

changed the title Situation with tests in Tesseract RFC: Situation with tests in Tesseract on Jun 4, 2018
zamazan4ik

zamazan4ik commented on Jun 4, 2018

@zamazan4ik
ContributorAuthor

@jbreiden Thank you for the information. Seems like Google has unit-tests. Okay, we can wait for them. But what about regression tests? Has Google anything for this? Or we should prepare images and ground truth?

zamazan4ik

zamazan4ik commented on Jun 4, 2018

@zamazan4ik
ContributorAuthor

@stweil If I have some images with/without ground truth for them, should I add them to https://github.com/tesseract-ocr/test? Are there any special requirements for test data?

zamazan4ik

zamazan4ik commented on Jun 4, 2018

@zamazan4ik
ContributorAuthor

And I want to clarify situation with unit-tests. Do we want to wait for unit-tests from Google or start implement our own unit-tests?

Shreeshrii

Shreeshrii commented on Jun 4, 2018

@Shreeshrii
Collaborator

If you have any specific unitest in mind, please go ahead and implement it.

We can add the ones from Google as and when they are added to the repo and modified to work with the code in GitHub.

There was discussion in another thread, regarding putting all binaries related to testing in a separate repo (test) which can be invoked as a submodule so that the tesseract repo does not become very large.

@zdenop has already created a new repo and the images used by current unittests should also be moved there.

Additionally, it might be possible to reduce some test file sizes for image files.

zamazan4ik

zamazan4ik commented on Jun 4, 2018

@zamazan4ik
ContributorAuthor

@Shreeshrii Was any discussion about measuring recognition quality between different Tesseract runs?

stweil

stweil commented on Jun 4, 2018

@stweil
Member

I don't remember such discussions, but I think that measuring the quality (not only for text recognition, but also for layout recognition) should be part of the regression tests.

Shreeshrii

Shreeshrii commented on Jun 4, 2018

@Shreeshrii
Collaborator

No, that was not covered. Google may have these internal tests, since Ray puts statistics in his presentations, but nothing was mentioned in context of the open source code.

However, I think it is important to check for regression, at least with some sample images to begin with.

The UNLV datasets are only for limited set of languages. I would like us to be able to test each language and script, even if is a with a single one page image.

That dataset might take some time to build, but if a framework for that can be setup new language tests can be added as and when the image and matching ground truth becomes available.

Shreeshrii

Shreeshrii commented on Jun 4, 2018

@Shreeshrii
Collaborator

For example the tests should catch cases like:

#682
LSTM: khmer is not working with --oem 1

stweil

stweil commented on Jun 4, 2018

@stweil
Member

If I have some images with/without ground truth for them, should I add them to https://github.com/tesseract-ocr/test? Are there any special requirements for test data?

I would not use the tesseract-ocr repositories to collect all kinds of ground truth, but of course some examples are needed for the regression tests. We need them to measure the recognition error rate, and we need them if we have tests for training, too.

Maybe we could also start a Wiki page https://github.com/tesseract-ocr/tesseract/wiki/Ground-Truth to collect good sources of ground truth, like we collect information on fonts at https://github.com/tesseract-ocr/tesseract/wiki/Fonts.

46 remaining items

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      RFC: Situation with tests in Tesseract · Issue #1627 · tesseract-ocr/tesseract