RFC: Situation with tests in Tesseract #1627

New issue

Closed

RFC: Situation with tests in Tesseract#1627

Labels

RFCfeature requestquestionunit tests

zamazan4ik

Hello.
I have some questions about situation with tests in Tesseract repo.

I think we have lack of unit-tests in Tesseract repo. Here I see some tests, but it is too few, I think. Should we add more unit-tests? Should we write more unit-tests for old engine? Or just write unit-tests for newer LSTM engine? Also I think we can move unit-tests also to CMake. And we can intgrate running unit-tests to TravisCI/Appveyor.
I am a little bit confused about testing recognition quality. How do we do it? I found this, but this seems out-dated stuff. Can we collect different images (of course, only images with good license for us), prepare ground-truth and check Tesseract on our set for regressions? I think it's very important for OCR-engine. Also we can integrate regression tests to TravisCI/Appveyor.
We must test Tesseract with Google sanitizers. For this we must have some tests, compile Tesseract with sanitizers and run tests. I think we can find some errors. Also this way will prevent from some possible mistakes in future.
(Hint) I suggest collect all images from issues and add it to our test set.
I try to understand, can I work on this way for Tesseract? Will this work be welcomed?
As mentioned in Use OSS-Fuzz for improved code quality #1351, we should try to use OSS-FUZZ with Tesseract. At my last work we found a lot of problems with different Tesseract options (some of them led to crash)

I suggest discuss here about testing stuff.

jbreiden

Contributor

Google has a bunch of tests that we should add to the repo. They will need some effort to get them to work there. Here's an example of one of them, note how we'll have to change stuff like ABSL_ARRAYSIZE

#include "tesseract/ccstruct/statistc.h"
#include "tesseract/ccutil/genericvector.h"
#include "tesseract/ccutil/kdpair.h"

namespace {

const int kTestData[] = { 2, 0, 12, 1, 1, 2, 10, 1, 0, 0, 0, 2, 0, 4, 1, 1 };

class STATSTest : public testing::Test {
 public:
  void SetUp() {
    stats_.set_range(0, 16);
    for (int i = 0; i < ABSL_ARRAYSIZE(kTestData); ++i)
      stats_.add(i, kTestData[i]);
  }

  void TearDown() {
  }

  STATS stats_;
};

// Tests some basic numbers from the stats_.
TEST_F(STATSTest, BasicStats) {
  EXPECT_EQ(37, stats_.get_total());
  EXPECT_EQ(2, stats_.mode());
  EXPECT_EQ(12, stats_.pile_count(2));
}

// Tests the top_n_modes function.
TEST_F(STATSTest, TopNModes) {
  GenericVector<tesseract::KDPairInc<float, int> > modes;
  int num_modes = stats_.top_n_modes(3, &modes);
  EXPECT_EQ(3, num_modes);
  // Mode0 is 12 1 1 = 14 total count with a mean of 2 3/14.
  EXPECT_FLOAT_EQ(2.0f + 3.0f / 14, modes[0].key);
  EXPECT_EQ(14, modes[0].data);
  // Mode 1 is 2 10 1 = 13 total count with a mean of 5 12/13.
  EXPECT_FLOAT_EQ(5.0f + 12.0f / 13, modes[1].key);
  EXPECT_EQ(13, modes[1].data);
  // Mode 2 is 4 1 1 = 6 total count with a mean of 13.5.
  EXPECT_FLOAT_EQ(13.5f, modes[2].key);
  EXPECT_EQ(6, modes[2].data);
}

}  // namespace

Shreeshrii

Collaborator

Jeff, Ray had started transferring some tests but hit a road block with one which included some file io. We couldn't get it to build as tesseract repo is missing some libraries used at Google. The source is there in unittests folder but it is not included in makefile.

…

On Mon 4 Jun, 2018, 8:03 AM jbreiden, ***@***.***> wrote: Google has a bunch of tests that we should add to the repo. Here's an example of one of them. I can help with this. #include "tesseract/ccstruct/statistc.h" #include "tesseract/ccutil/genericvector.h" #include "tesseract/ccutil/kdpair.h" namespace { const int kTestData[] = { 2, 0, 12, 1, 1, 2, 10, 1, 0, 0, 0, 2, 0, 4, 1, 1 }; class STATSTest : public testing::Test { public: void SetUp() { stats_.set_range(0, 16); for (int i = 0; i < ABSL_ARRAYSIZE(kTestData); ++i) stats_.add(i, kTestData[i]); } void TearDown() { } STATS stats_; }; // Tests some basic numbers from the stats_. TEST_F(STATSTest, BasicStats) { EXPECT_EQ(37, stats_.get_total()); EXPECT_EQ(2, stats_.mode()); EXPECT_EQ(12, stats_.pile_count(2)); } // Tests the top_n_modes function. TEST_F(STATSTest, TopNModes) { GenericVector<tesseract::KDPairInc<float, int> > modes; int num_modes = stats_.top_n_modes(3, &modes); EXPECT_EQ(3, num_modes); // Mode0 is 12 1 1 = 14 total count with a mean of 2 3/14. EXPECT_FLOAT_EQ(2.0f + 3.0f / 14, modes[0].key); EXPECT_EQ(14, modes[0].data); // Mode 1 is 2 10 1 = 13 total count with a mean of 5 12/13. EXPECT_FLOAT_EQ(5.0f + 12.0f / 13, modes[1].key); EXPECT_EQ(13, modes[1].data); // Mode 2 is 4 1 1 = 6 total count with a mean of 13.5. EXPECT_FLOAT_EQ(13.5f, modes[2].key); EXPECT_EQ(6, modes[2].data); } } // namespace — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1627 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o4fW7j4uPre6P7gRXa4C9cNneRKiks5t5JxtgaJpZM4UYRTW> .

jbreiden

Contributor

Google has 56 files of tests for Tesseract. None of them will work as-is with the GitHub repo, but at least some could be adapted without too much effort. Might be a good starting point, especially for someone like @zamazan4ik who sounds excited about writing or improving test.

Shreeshrii

Collaborator

@jbreiden It will be great if you can add them to GitHub for @zamazan4ik to update. Thanks!

stweil

Member

@zamazan4ik, could you please add "RFC:" to the title of this issue ("RFC: Situation with tests in Tesseract")? That makes it clear that it is not a bug report.

@jbreiden, I also think that the available test code should be added to git, even if it is currently not integrated in the build process. Please add only text files (test code) to tesseract git. If there are also binaries (images, tessdata), they can be added to https://github.com/tesseract-ocr/test.

zamazan4ik

changed the title Situation with tests in Tesseract RFC: Situation with tests in Tesseract

on Jun 4, 2018

zamazan4ik

ContributorAuthor

@jbreiden Thank you for the information. Seems like Google has unit-tests. Okay, we can wait for them. But what about regression tests? Has Google anything for this? Or we should prepare images and ground truth?

zamazan4ik

ContributorAuthor

@stweil If I have some images with/without ground truth for them, should I add them to https://github.com/tesseract-ocr/test? Are there any special requirements for test data?

zamazan4ik

ContributorAuthor

And I want to clarify situation with unit-tests. Do we want to wait for unit-tests from Google or start implement our own unit-tests?

Shreeshrii

Collaborator

If you have any specific unitest in mind, please go ahead and implement it.

We can add the ones from Google as and when they are added to the repo and modified to work with the code in GitHub.

There was discussion in another thread, regarding putting all binaries related to testing in a separate repo (test) which can be invoked as a submodule so that the tesseract repo does not become very large.

@zdenop has already created a new repo and the images used by current unittests should also be moved there.

Additionally, it might be possible to reduce some test file sizes for image files.

zamazan4ik

ContributorAuthor

@Shreeshrii Was any discussion about measuring recognition quality between different Tesseract runs?

stweil

Member

I don't remember such discussions, but I think that measuring the quality (not only for text recognition, but also for layout recognition) should be part of the regression tests.

Shreeshrii

Collaborator

No, that was not covered. Google may have these internal tests, since Ray puts statistics in his presentations, but nothing was mentioned in context of the open source code.

However, I think it is important to check for regression, at least with some sample images to begin with.

The UNLV datasets are only for limited set of languages. I would like us to be able to test each language and script, even if is a with a single one page image.

That dataset might take some time to build, but if a framework for that can be setup new language tests can be added as and when the image and matching ground truth becomes available.

Shreeshrii

Collaborator

For example the tests should catch cases like:

#682
LSTM: khmer is not working with --oem 1

stweil

Member

If I have some images with/without ground truth for them, should I add them to https://github.com/tesseract-ocr/test? Are there any special requirements for test data?

I would not use the tesseract-ocr repositories to collect all kinds of ground truth, but of course some examples are needed for the regression tests. We need them to measure the recognition error rate, and we need them if we have tests for training, too.

Maybe we could also start a Wiki page https://github.com/tesseract-ocr/tesseract/wiki/Ground-Truth to collect good sources of ground truth, like we collect information on fonts at https://github.com/tesseract-ocr/tesseract/wiki/Fonts.

46 remaining items

to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

Labels

RFCfeature requestquestionunit tests

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Situation with tests in Tesseract #1627

46 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RFC: Situation with tests in Tesseract #1627

Description

Activity

jbreiden commented on Jun 4, 2018

Shreeshrii commented on Jun 4, 2018

jbreiden commented on Jun 4, 2018

Shreeshrii commented on Jun 4, 2018

stweil commented on Jun 4, 2018

zamazan4ik commented on Jun 4, 2018

zamazan4ik commented on Jun 4, 2018

zamazan4ik commented on Jun 4, 2018

Shreeshrii commented on Jun 4, 2018

zamazan4ik commented on Jun 4, 2018

stweil commented on Jun 4, 2018

Shreeshrii commented on Jun 4, 2018

Shreeshrii commented on Jun 4, 2018

stweil commented on Jun 4, 2018

46 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions