Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recognize with monitor crashes on specific file #1449

Closed
oleg-st opened this issue Apr 3, 2018 · 15 comments
Closed

Recognize with monitor crashes on specific file #1449

oleg-st opened this issue Apr 3, 2018 · 15 comments

Comments

@oleg-st
Copy link

oleg-st commented Apr 3, 2018

Environment

  • Tesseract Version: tesseract 4.00.00alpha
  • Commit Number: f8e26ee
  • Platform: Linux osboxes 3.10.0-693.17.1.el7.x86_64 defect issue #1 SMP Thu Jan 25 20:13:58 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Current Behavior:

Crash in api->Recognize(&monitor) with specific image file.

start >= 0 && start + num <= length_:Error:Assert failed:in file ratngs.cpp, line 324
Segmentation fault (core dumped)

Expected Behavior:

No crash.

Code to reproduce:

#include "tesseract/baseapi.h"
#include "tesseract/genericvector.h"
#include "tesseract/renderer.h"
#include "tesseract/ocrclass.h"
#include <leptonica/allheaders.h>

int main()
{
    tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
    ETEXT_DESC monitor;

    // Initialize tesseract-ocr with Russian, without specifying tessdata path
    if (api->Init(NULL, "rus")) {
        fprintf(stderr, "Could not initialize tesseract.\n");
        return 1;
    }

    // Open input image with leptonica library
    Pix *image = pixRead("test.jpg");
    api->SetImage(image);
    // Recognize (should crash here)
    api->Recognize(&monitor);

    api->End();

    return 0;
}

test.jpg:
https://user-images.githubusercontent.com/7984580/38233906-d99b50b8-3724-11e8-95a2-86b18d88b034.jpg

Suggested Fix:

Backtrace:

#0  ERRCODE::error (this=this@entry=0x7ffff7d8a640 <_ZL13ASSERT_FAILED>, 
    caller=caller@entry=0x7ffff789e8f0 "start >= 0 && start + num <= length_", action=action@entry=ABORT, 
    format=format@entry=0x7ffff787eeb4 "in file %s, line %d") at errcode.cpp:86
#1  0x00007ffff7855581 in WERD_CHOICE::remove_unichar_ids (this=0x135d800, start=start@entry=1, num=num@entry=1) at ratngs.cpp:324
#2  0x00007ffff784850d in remove_unichar_id (index=1, this=<optimized out>) at ratngs.h:481
#3  WERD_RES::MergeAdjacentBlobs (this=this@entry=0x20979f0, index=index@entry=0) at pageres.cpp:972
#4  0x00007ffff76db80a in tesseract::Tesseract::write_results (this=this@entry=0x7ffff7fb0010, page_res_it=..., 
    newline_type=<optimized out>, force_eol=force_eol@entry=0 '\000') at output.cpp:192
#5  0x00007ffff76dbdb3 in tesseract::Tesseract::output_pass (this=this@entry=0x7ffff7fb0010, page_res_it=..., 
    target_word_box=target_word_box@entry=0x0) at output.cpp:112
#6  0x00007ffff76c4968 in tesseract::Tesseract::recog_all_words (this=0x7ffff7fb0010, page_res=0x20a7e40, 
    monitor=monitor@entry=0x7fffffffe280, target_word_box=target_word_box@entry=0x0, word_config=word_config@entry=0x0, 
    dopasses=dopasses@entry=0) at control.cpp:425
#7  0x00007ffff76af41c in tesseract::TessBaseAPI::Recognize (this=0x8059d0, monitor=0x7fffffffe280) at baseapi.cpp:869
#8  0x0000000000400b40 in main ()

Tesseract crashes only if Recognize called with monitor parameter not null.

monitor parameter affects because of this condition:
control.cpp:

  // changed by jetsoft
  // needed for dll to output memory structure
  if ((dopasses == 0 || dopasses == 2) && (monitor || tessedit_write_unlv))
    output_pass(page_res_it, target_word_box);
  // end jetsoft

My suggestion is to move this code block below few lines (after "Remove empty words, as these mess up the result iterators.").

I think there are empty words in page_res_it and it causes crash in output_pass.

After this change there is no more crashes.

  // Write results pass.
  set_global_loc_code(LOC_WRITE_RESULTS);
  // This is now redundant, but retained commented so show how to obtain
  // bounding boxes and style information.

  PageSegMode pageseg_mode = static_cast<PageSegMode>(
      static_cast<int>(tessedit_pageseg_mode));
  textord_.CleanupSingleRowResult(pageseg_mode, page_res);

  // Remove empty words, as these mess up the result iterators.
  for (page_res_it.restart_page(); page_res_it.word() != NULL;
       page_res_it.forward()) {
    WERD_RES* word = page_res_it.word();
    POLY_BLOCK* pb = page_res_it.block()->block != NULL
                         ? page_res_it.block()->block->poly_block()
                         : NULL;
    if (word->best_choice == NULL || word->best_choice->length() == 0 ||
        (word->best_choice->IsAllSpaces() && (pb == NULL || pb->IsText()))) {
      page_res_it.DeleteCurrentWord();
    }
  }

  // changed by jetsoft
  // needed for dll to output memory structure
  if ((dopasses == 0 || dopasses == 2) && (monitor || tessedit_write_unlv))
    output_pass(page_res_it, target_word_box);
  // end jetsoft

  if (monitor != NULL) {
    monitor->progress = 100;
  }
  return true;
}
@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Apr 3, 2018 via email

@oleg-st
Copy link
Author

oleg-st commented Apr 3, 2018

Same behavior detected with latest code from GitHub
Commit: 10f4998

@amitdo
Copy link
Collaborator

amitdo commented Apr 3, 2018

Duplicate of #989.

@oleg-st
Copy link
Author

oleg-st commented Apr 3, 2018

However It's not fixed in #989

@amitdo
Copy link
Collaborator

amitdo commented Apr 3, 2018

@oleg-st
Copy link
Author

oleg-st commented Apr 3, 2018

Yes, it doesn't help.

Crash occurs with any monitor (not null).

@oleg-st
Copy link
Author

oleg-st commented May 30, 2018

Any news?
Checked with 8d3f811
The crash still occurs in same conditions.
No crash with suggested tesseract code fix above.

@zdenop
Copy link
Contributor

zdenop commented Oct 10, 2018

@oleg-st : I am not able to reproduce it with current code on openSUSE 15 64bit. I slightly modified your code and it works for me:

    Pix *image = pixRead("test.jpg");
    api->SetImage(image);
    api->Recognize(&monitor);
    char* outText = api->GetUTF8Text();
        printf("OCR output:\n%s", outText);
    if (outText)
        delete [] outText;

I also tried other "Russian" image from issue 1912 and it works for me.
What is source of your rus language? tessdata_fast, tessdata_best or tessdata or custom?

@amitdo
Copy link
Collaborator

amitdo commented Oct 10, 2018

start >= 0 && start + num <= length_:Error:Assert failed:in file ratngs.cpp, line 324
Segmentation fault (core dumped)

Probably duplicate of #948.

@oleg-st
Copy link
Author

oleg-st commented Oct 11, 2018

@zdenop

What is source of your rus language? tessdata_fast, tessdata_best or tessdata or custom?

It's tessdata_best

I will try to reproduce with latest code.

@oleg-st
Copy link
Author

oleg-st commented Oct 11, 2018

@zdenop
Reproduced both with tessdata_best and tessdata_fast (Tesseract version: 9d84968).

Code (slightly modified):

#include "tesseract/baseapi.h"
#include "tesseract/ocrclass.h"
#include <leptonica/allheaders.h>

int main()
{
    tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
    ETEXT_DESC* monitor = new ETEXT_DESC();

    // Initialize tesseract-ocr with Russian, without specifying tessdata path
    if (api->Init(NULL, "rus")) {
        fprintf(stderr, "Could not initialize tesseract.\n");
        return 1;
    }

    Pix *image = pixRead("test.jpg");
    api->SetImage(image);
    api->Recognize(monitor);
    char* outText = api->GetUTF8Text();
        printf("OCR output:\n%s", outText);
    if (outText)
        delete [] outText;

    api->End();

    delete monitor;

    return 0;
}

Output:

[oleg@osboxes tesseract]$ g++ -I /opt/tesseract4/include/ -L /opt/tesseract4/lib/ -std=c++11 -ltesseract -llept test.cpp -o testcpp
[oleg@osboxes tesseract]$ LD_LIBRARY_PATH=/opt/tesseract4/lib/ ./testcpp                                                           
Warning: Invalid resolution 0 dpi. Using 70 instead.
Image too small to scale!! (3x36 vs min width of 3)
Line cannot be recognized!!
Image too small to scale!! (3x36 vs min width of 3)
Line cannot be recognized!!
Image too small to scale!! (3x36 vs min width of 3)
Line cannot be recognized!!
start >= 0 && start + num <= length_:Error:Assert failed:in file ratngs.cpp, line 325
Segmentation fault (core dumped)

@amitdo
Copy link
Collaborator

amitdo commented Oct 11, 2018

Try this advice:
#948 (comment)

@oleg-st
Copy link
Author

oleg-st commented Oct 11, 2018

@amitdo It helped.
No crash with api->SetVariable("unlv_tilde_crunching", "false");

@zdenop
Copy link
Contributor

zdenop commented Oct 22, 2018

@amitdo : what about to change default value unlv_tilde_crunching to false (maybe to set it only in unvl)?

I looked for reason why my test did not crash and I found out that I used old rus.trainneddata (file date is 2016-11-28). When I used tessdata_best version of rus.trainneddata - it crashed.

@amitdo
Copy link
Collaborator

amitdo commented Oct 22, 2018

The problem is that the unlv renderer needs it.

Changing unlv_tilde_crunching to true inside the unlv renderer will be too late.

Setting unlv_tilde_crunching to false by default and putting this variable in unlv config with a true value will probably be fine.

@zdenop zdenop closed this as completed in 3d508a6 Oct 23, 2018
scubess added a commit to scubess/tesseract that referenced this issue Oct 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants