Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LSTM: Training - Eval not run from trainer #644

Closed
Shreeshrii opened this issue Jan 9, 2017 · 42 comments
Closed

LSTM: Training - Eval not run from trainer #644

Shreeshrii opened this issue Jan 9, 2017 · 42 comments

Comments

@Shreeshrii
Copy link
Collaborator

See comment from Ray at
#542 (comment)

it doesn't run the eval from the trainer.
It doesn't harm the tutorial, but will be required before people start serious training.

@xlight
Copy link

xlight commented Apr 11, 2017

with --eval_listfiles , always coredump. 😭

lstmtraining: ../ccutil/genericvector.h:696: T& GenericVector::operator const [with T = char]: Assertion `index >= 0 && index < size_used_' failed.

@Shreeshrii
Copy link
Collaborator Author

@theraysmith Any update on this?

@Shreeshrii
Copy link
Collaborator Author

@stweil Have you come across this problem while training? Any solution?

@stweil
Copy link
Member

stweil commented May 29, 2017

No, I did not see that assertion up to now. How can it be reproduced?

@Shreeshrii
Copy link
Collaborator Author

Please try the following making appropriate changes for your fonts directory and tessdata directory

training/tesstrain.sh \
  --fonts_dir /c/Windows/Fonts \
  --tessdata_dir ./tessdata \
  --training_text ../langdata/ara/ara.training_text \
  --langdata_dir ../langdata \
  --lang ara  \
  --linedata_only \
  --noextract_font_properties \
  --exposures "0"    \
  --fontlist "Arial" \
  --output_dir ~/tesstutorial/aratest
  
training/tesstrain.sh \
  --fonts_dir /c/Windows/Fonts \
  --tessdata_dir ./tessdata \
  --training_text ../langdata/ara/ara.training_text \
  --langdata_dir ../langdata \
  --lang ara  \
  --linedata_only \
  --noextract_font_properties \
  --exposures "0"    \
  --fontlist "Arial" \
  "Arial Unicode MS" \
  "Calibri" \
  "Courier New" \
  --output_dir ~/tesstutorial/araeval

mkdir -p ~/tesstutorial/aratuned_from_aratest 

combine_tessdata -e ../tessdata/ara.traineddata \
  ~/tesstutorial/aratuned_from_aratest/ara.lstm
  
lstmtraining \
  --continue_from ~/tesstutorial/aratuned_from_aratest/ara.lstm \
  --train_listfile ~/tesstutorial/aratest/ara.training_files.txt \
  --eval_listfile ~/tesstutorial/araeval/ara.training_files.txt \
  --model_output ~/tesstutorial/aratuned_from_aratest/aratuned \
  --target_error_rate 0.01 

On windows using the binaries from appveyor artifacts, I am getting

$ lstmtraining \
>   --continue_from ~/tesstutorial/aratuned_from_aratest/ara.lstm \
>   --train_listfile ~/tesstutorial/aratest/ara.training_files.txt \
>   --eval_listfile ~/tesstutorial/araeval/ara.training_files.txt \
>   --model_output ~/tesstutorial/aratuned_from_aratest/aratuned \
>   --target_error_rate 0.01
Segmentation fault

I will check with tesseract built under WSL and report separately.

@Shreeshrii
Copy link
Collaborator Author

I think crash happens in builds with debug. Non-debug builds get some error messages but continue.

@stweil
Copy link
Member

stweil commented May 30, 2017

The non-debug build continues, but uses a bad index internally, so the results are invalid.

Replace assert (which is only used in debug builds) by ASSERT_HOST in ccutil/genericvector.h, then the non-debug build should show an error, too.

Can you get a stack trace for the assertion?

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented May 30, 2017

(gdb) run
Starting program: /usr/local/bin/lstmtraining --continue_from /home/shree/tesstutorial/aratuned_from_aratest/ara.lstm --train_listfile /home/shree/tesstutorial/aratest/a
ra.training_files.txt --eval_listfile /home/shree/tesstutorial/araeval/ara.training_files.txt --model_output /home/shree/tesstutorial/aratuned_from_aratest/aratuned --ta
rget_error_rate 0.01
warning: Error disabling address space randomization: Success
warning: linux_ptrace_test_ret_to_nx: PTRACE_KILL waitpid returned -1: Interrupted system call
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Loaded file /home/shree/tesstutorial/aratuned_from_aratest/aratuned_checkpoint, unpacking...
Successfully restored trainer from /home/shree/tesstutorial/aratuned_from_aratest/aratuned_checkpoint
[New Thread 0x7f4e34110700 (LWP 24)]
Loaded 81/81 pages (1-81) of document /home/shree/tesstutorial/aratest/ara.Arial.exp0.lstmf
[Thread 0x7f4e34110700 (LWP 24) exited]
[New Thread 0x7f4e34110700 (LWP 25)]
Loaded 81/81 pages (1-81) of document /home/shree/tesstutorial/araeval/ara.Arial.exp0.lstmf
[Thread 0x7f4e34110700 (LWP 25) exited]
[New Thread 0x7f4e33900700 (LWP 26)]
Loaded 81/81 pages (1-81) of document /home/shree/tesstutorial/araeval/ara.Arial_Unicode_MS.exp0.lstmf
[Thread 0x7f4e33900700 (LWP 26) exited]
[New Thread 0x7f4e33900700 (LWP 27)]
[New Thread 0x7f4e34110700 (LWP 28)]
[New Thread 0x7f4e325c0700 (LWP 29)]
[New Thread 0x7f4e315a0700 (LWP 30)]
lstmtraining: ../ccutil/genericvector.h:713: T& GenericVector<T>::operator[](int) const [with T = char]: Assertion `index >= 0 && index < size_used_' failed.

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7f4e315a0700 (LWP 30)]
0x00007f4e36ff6c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56      ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) backtrace
#0  0x00007f4e36ff6c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007f4e36ffa028 in __GI_abort () at abort.c:89
#2  0x00007f4e36fefbf6 in __assert_fail_base (fmt=0x7f4e371403b8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
    assertion=assertion@entry=0x4f48b8 "index >= 0 && index < size_used_", file=file@entry=0x4f489a "../ccutil/genericvector.h", line=line@entry=713,
    function=function@entry=0x4f55a0 <GenericVector<char>::operator[](int) const::__PRETTY_FUNCTION__> "T& GenericVector<T>::operator[](int) const [with T = char]")
    at assert.c:92
#3  0x00007f4e36fefca2 in __GI___assert_fail (assertion=0x4f48b8 "index >= 0 && index < size_used_", file=0x4f489a "../ccutil/genericvector.h", line=713,
    function=0x4f55a0 <GenericVector<char>::operator[](int) const::__PRETTY_FUNCTION__> "T& GenericVector<T>::operator[](int) const [with T = char]") at assert.c:101
#4  0x0000000000406bef in GenericVector<char>::operator[] (this=0x7ffffd882960, index=0) at ../ccutil/genericvector.h:713
#5  0x000000000043d308 in tesseract::LSTMTrainer::ReadTrainingDump (this=0x7f4e3159f650, data=..., trainer=0x7f4e3159f650) at lstmtrainer.cpp:921
#6  0x000000000040b8cb in tesseract::LSTMTester::RunEvalSync (this=0x7ffffd8828f0, iteration=24, training_errors=0x7ffffd882fb8, model_data=..., training_stage=1)
    at lstmtester.cpp:86
#7  0x000000000040bc2f in tesseract::LSTMTester::ThreadFunc (lstmtester_void=0x7ffffd8828f0) at lstmtester.cpp:123
#8  0x00007f4e37de8184 in start_thread (arg=0x7f4e315a0700) at pthread_create.c:312
#9  0x00007f4e370ba37d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb)

@Shreeshrii
Copy link
Collaborator Author

@stweil Unrelated to this issue, one of the recent commits seems to have caused the program to slow down a lot - while training it seems to hang for a while and even OCRing images seems to take longer. I am not sure how to verify/confirm this.

@amitdo
Copy link
Collaborator

amitdo commented May 30, 2017

@Shreeshrii, please open a new issue about that regression.

@Shreeshrii
Copy link
Collaborator Author

@stweil I don't see any patch here.

@Shreeshrii
Copy link
Collaborator Author

#6 0x000000000040b8cb in tesseract::LSTMTester::RunEvalSync (this=0x7ffff40f2d00, iteration=0, training_errors=0x7ffff40f33c8, model_data=..., training_stage=0) at lstmtester.cpp:86

Should the model_data here be pointing to --eval_listfile or lstmf files within it?

@stweil
Copy link
Member

stweil commented May 31, 2017

@Shreeshrii, I am sorry, but somehow my last comment got lost. So once again:

The assertions are caused by an index of 0 used for an empty vector. Since commit 907de59 the constructor of GenericVector no longer allocates memory for an empty vector. tesseract::LSTMTrainer::ReadTrainingDump tries to read 0 bytes. Issue #561 is identical. Here is a patch which fixes the assertion / core dump:

diff --git a/lstm/lstmtrainer.cpp b/lstm/lstmtrainer.cpp
index 03619969..e6d8e4a3 100644
--- a/lstm/lstmtrainer.cpp
+++ b/lstm/lstmtrainer.cpp
@@ -918,6 +918,7 @@ bool LSTMTrainer::SaveTrainingDump(SerializeAmount serialize_amount,
 // Reads previously saved trainer from memory.
 bool LSTMTrainer::ReadTrainingDump(const GenericVector<char>& data,
                                    LSTMTrainer* trainer) {
+  if (data.size() == 0) return false;
   return trainer->ReadSizedTrainingDump(&data[0], data.size());
 }

I don't know whether returning true would be better.

@stweil
Copy link
Member

stweil commented May 31, 2017

I tried to analyze the problem ("why is there a data size 0?") and noticed that the behavior of the program is totally erratic when I run it in a debugger. There are several threads involved, and depending on my breakpoints the problem with the data size occurs or not. That looks like a synchronization issue, and so I decided to use Valgrind. The result is horrible:

==2035== ERROR SUMMARY: 44846954 errors from 905 contexts (suppressed: 0 from 0)

Here is one example (total log file is about 28000 lines):

==2035== Conditional jump or move depends on uninitialised value(s)
==2035==    at 0x1497ED: ComputeBlackWhite (networkio.cpp:140)
==2035==    by 0x1497ED: tesseract::NetworkIO::FromPixes(tesseract::StaticShape const&, std::vector<Pix const*, std::allocator<Pix const*> > const&, tesseract::TRand*) (network
io.cpp:190)
==2035==    by 0x1498FF: tesseract::NetworkIO::FromPix(tesseract::StaticShape const&, Pix const*, tesseract::TRand*) (networkio.cpp:164)
==2035==    by 0x1BBB41: tesseract::Input::PreparePixInput(tesseract::StaticShape const&, Pix const*, tesseract::TRand*, tesseract::NetworkIO*) (input.cpp:149)
==2035==    by 0x1C9842: tesseract::LSTMRecognizer::RecognizeLine(tesseract::ImageData const&, bool, bool, bool, float, float*, tesseract::NetworkIO*, tesseract::NetworkIO*) (lstmrecognizer.cpp:281)
==2035==    by 0x13DE72: tesseract::LSTMTrainer::PrepareForBackward(tesseract::ImageData const*, tesseract::NetworkIO*, tesseract::NetworkIO*) (lstmtrainer.cpp:855)
==2035==    by 0x13E5E3: tesseract::LSTMTrainer::TrainOnLine(tesseract::ImageData const*, bool) (lstmtrainer.cpp:798)
==2035==    by 0x11290E: TrainOnLine (lstmtrainer.h:273)
==2035==    by 0x11290E: main (lstmtraining.cpp:194)
==2035==  Uninitialised value was created by a heap allocation
==2035==    at 0x4C2BBAF: malloc (vg_replace_malloc.c:299)
==2035==    by 0x72052A0: pixCreateNoInit (in /usr/lib/x86_64-linux-gnu/liblept.so.5.0.1)
==2035==    by 0x72053F6: pixCreateTemplateNoInit (in /usr/lib/x86_64-linux-gnu/liblept.so.5.0.1)
==2035==    by 0x7208917: pixCopyBorder (in /usr/lib/x86_64-linux-gnu/liblept.so.5.0.1)
==2035==    by 0x71921EF: pixUnsharpMaskingGray2D (in /usr/lib/x86_64-linux-gnu/liblept.so.5.0.1)
==2035==    by 0x7192D99: pixUnsharpMaskingFast (in /usr/lib/x86_64-linux-gnu/liblept.so.5.0.1)
==2035==    by 0x7193014: pixUnsharpMasking (in /usr/lib/x86_64-linux-gnu/liblept.so.5.0.1)
==2035==    by 0x728C454: pixScaleGeneral (in /usr/lib/x86_64-linux-gnu/liblept.so.5.0.1)
==2035==    by 0x1663DF: tesseract::ImageData::PreScale(int, int, float*, int*, int*, GenericVector<TBOX>*) const (imagedata.cpp:245)
==2035==    by 0x1BBA33: tesseract::Input::PrepareLSTMInputs(tesseract::ImageData const&, tesseract::Network const*, int, tesseract::TRand*, float*) (input.cpp:95)
==2035==    by 0x1C97A1: tesseract::LSTMRecognizer::RecognizeLine(tesseract::ImageData const&, bool, bool, bool, float, float*, tesseract::NetworkIO*, tesseract::NetworkIO*) (lstmrecognizer.cpp:265)
==2035==    by 0x13DE72: tesseract::LSTMTrainer::PrepareForBackward(tesseract::ImageData const*, tesseract::NetworkIO*, tesseract::NetworkIO*) (lstmtrainer.cpp:855)

So some more work is needed. The patch shown above does not fix the real problems.

@Shreeshrii
Copy link
Collaborator Author

@stweil Thank you for looking into this.

@Shreeshrii
Copy link
Collaborator Author

#956

may also be related.

stweil added a commit to stweil/tesseract that referenced this issue Jun 4, 2017
The new test in LSTMTrainer::UpdateErrorGraph fixes an assertion
(see issues tesseract-ocr#644, tesseract-ocr#792).

The new test in LSTMTrainer::ReadTrainingDump was added to improve
the robustness of the code.

Signed-off-by: Stefan Weil <stefan.weil@bib.uni-mannheim.de>
stweil added a commit to stweil/tesseract that referenced this issue Jun 4, 2017
The new test in LSTMTrainer::UpdateErrorGraph fixes an assertion
(see issues tesseract-ocr#644, tesseract-ocr#792).

The new test in LSTMTrainer::ReadTrainingDump was added to improve
the robustness of the code.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
@stweil
Copy link
Member

stweil commented Jun 4, 2017

It looks like the pix which was created by pixScaleGeneral contains uninitialized pixel data.
The Tesseract function ComputeBlackWhite operates on that data with random results.
@DanBloomberg, do you have any idea how to fix that?

==14279==  Uninitialised value was created by a heap allocation
==14279==    at 0x4C2BBAF: malloc (vg_replace_malloc.c:299)
==14279==    by 0x71F8BD1: pix_malloc (pix1.c:234)
==14279==    by 0x71F8BD1: pixCreateNoInit (pix1.c:343)
==14279==    by 0x71F8D06: pixCreateTemplateNoInit (pix1.c:408)
==14279==    by 0x71FBF47: pixCopyBorder (pix2.c:1705)
==14279==    by 0x718DE46: pixUnsharpMaskingGray2D (enhance.c:1346)
==14279==    by 0x718EA29: pixUnsharpMaskingFast (enhance.c:1095)
==14279==    by 0x718EC8C: pixUnsharpMasking (enhance.c:917)
==14279==    by 0x727A4B4: pixScaleGeneral (scale.c:368)
==14279==    by 0x1804D9: tesseract::ImageData::PreScale(int, int, float*, int*, int*, GenericVector<TBOX>*) const (imagedata.cpp:245)

@DanBloomberg
Copy link

DanBloomberg commented Jun 4, 2017

Using valgrind, I am not able to find uninitialized pixels starting from pixScale (or pixScaleGeneral). I did something simple like:
pix3 = pixScale(pix2, 0.6, 0.6); // pix3 is 8 bpp
pixGetDimensions(pix3, &w, &h, NULL);
for (i = 0; i < h; i++) {
for (j = 0; j < w; j++) {
pixGetPixel(pix3, j, i, &val);
tot += val;
}
}
where every pixel was read and used.

Try to replace line 1705 of pix2.c, which is pixCreateTemplateNoInit(), with
pixCreateTemplate()
If this solves the problem, then it may be originating in pixUnsharpMaskingGray1D() or in pixUnsharpMaskingGray2D()

@DanBloomberg
Copy link

and to make sure that valgrind was catching it, I used a conditional in the inner loop:
pix3 = pixScaleGeneral(pix1, 0.8, 0.8, 0.4, 2);
pixGetDimensions(pix3, &w, &h, NULL);
for (i = 0; i < h; i++) {
for (j = 0; j < w; j++) {
pixGetPixel(pix3, j, i, &val);
if (val == 0x12345678)
break;
}
}

If any pixel were uninitialized, we'd have a message like:
Conditional jump or move depends on uninitialised value(s)
==65877== at 0x40481C: main (...)
There were none.
(We did have an issue with normalization in these pixUnsharpMaskingGray*() functions back in April, that gave rise to some artifacts.)

@stweil
Copy link
Member

stweil commented Jun 5, 2017

My short test program pixtest.cpp was successful:

#include <leptonica/allheaders.h>

int main(void)
{
  PIX *pix1 = pixCreate(2047, 71, 8);
  PIX *pix2 = pixScaleGeneral(pix1, 0.676056325, 0.676056325, 0.200000003, 1);
  int width = pixGetWidth(pix2);
  int height = pixGetHeight(pix2);
  int y = height / 2;
  l_uint32* line = pixGetData(pix2) + pixGetWpl(pix2) * y;
  int prev = GET_DATA_BYTE(line, 0);
  if (prev != 0) {
    return 1;
  }
  return 0;
}

I compiled it on Debian Stretch using g++ -g pixtest.cpp -llept.
Valgrind reports the same problem as in Tesseract:

==22626== Conditional jump or move depends on uninitialised value(s)
==22626==    at 0x400832: main (pixtest.cpp:12)

@stweil
Copy link
Member

stweil commented Jun 5, 2017

We did have an issue with normalization in these pixUnsharpMaskingGray*() functions back in April, that gave rise to some artifacts.

Could that be the reason for the problem? Debian has a Leptonica version 1.74.1-1, so maybe that fix is missing? Then Debian needs a newer version (and Tesseract should require Leptonica 1.74.2). CC @jbreiden.

@DanBloomberg
Copy link

Stefan, can you run your test program from our github head, or replacing the 1.74.1 scale.c with the most recent one? That would determine if the normalization change fixed this problem.

I ran your exact program with the current pixScaleGeneral(), on valgrind, and got no error.

@Shreeshrii
Copy link
Collaborator Author

@theraysmith

Under what conditions should eval be run from trainer?

Will training work if it is done without eval?

@DanBloomberg
Copy link

After failing to find the problem with an uninitialized value from pixUnsharpMasking(), I will do the most simple thing, which is make sure that the pixel values are initialized. Use of pixCreateTemplateNoInit(), instead of pixCreateTemplate(), is clearly a poor optimization. I will also remove other uses of the NoInit version in places where it's not obvious by inspection that all pixels are set.

@DanBloomberg
Copy link

Committed (#512) to leptonica. Hoping that this solves any uninitialized value problems.

@Shreeshrii
Copy link
Collaborator Author

https://github.com/DanBloomberg/leptonica/releases/tag/1.74.3

Should we be using this for 4.00?

@stweil
Copy link
Member

stweil commented Jun 10, 2017

As far as I see Leptonica 1.74.2 is sufficient (it solved the uninitialized value problems for Tesseract), but of course you can use the newer version, too.

@DanBloomberg
Copy link

DanBloomberg commented Jun 10, 2017 via email

@stweil
Copy link
Member

stweil commented Jun 10, 2017

Dan, the uninitialized issue was already fixed with 1.74.2, and so were the Coverity scan issues.

@DanBloomberg
Copy link

DanBloomberg commented Jun 10, 2017 via email

@dudullz
Copy link

dudullz commented Jun 12, 2017

Hi there!
The problem seems still there, at least on my MBP :(

I had exactly the same issue after exactly 100 iterations:
Assertion failed: (index >= 0 && index < size_used_), function operator[], file ~/tesseract/ccutil/genericvector.h, line 713.

I was using Leptonica 1.74.1 provided by Macports, then built manually 1.74.4 (just 1 hour after Dan put out it :) ), still the same problem.
I make sure lstmtraining is linking to the correct version of Lept library (using otool -L on Mac)

What baffles me is, the scale.c file - which I presume includes the fix to the problem - is identical for 1.74.[1 | 2 | 3 | 4]... is it correct?

I'm using a 2015 MBP running OS X El Capitan (10.11.6) with Xcode 8.
The command I'm using to train:

./tesseract/build-mac/bin/Debug/lstmtraining -U tesstutorial/engtrain/eng.unicharset --script_dir langdata/ --debug_interval 100 --net_spec '[1,36,0,1 Ct5,5,16 Mp3,3 Lfys64 Lfx128 Lrx128 Lfx256 O1c105]' --model_output tesstutorial/engoutput/base --train_listfile tesstutorial/engtrain/eng.training_files.txt --eval_listfile tesstutorial/engeval/eng.training_files.txt --max_iterations 5000

Thanks!

@stweil
Copy link
Member

stweil commented Jun 12, 2017

@dudullz, there were several problems, and updating Leptonica to 1.74.2 or newer fixes one of them. Pull request #978 fixes the assertion. It's still unmerged, that's why you got that assertion.

@dudullz
Copy link

dudullz commented Jun 12, 2017

@stweil first thanks for TRULY swift reply!

It has been training for past 2 hours for about 3,000 iterations and still going - it's just enjoyable to see it continuing after days of struggling :)

So has updating Leptonica to 1.74.4 + pull request #978 fixed the real problem? means the trained results are valid to use now (not like non-debug builds continue but with invalid results)?

@stweil
Copy link
Member

stweil commented Jun 12, 2017

That fixed at least the currently known problems and should produce more stable results as the trained data no longer depends on undefined values (which could produce random results).

@Shreeshrii
Copy link
Collaborator Author

@stweil

When training with --eval_listfile in addition to --train_listfile, I have noticed that while all files from --train_listfile are loaded at beginning of training process, only first two from --eval_listfile are loaded. However, I have not found any images from the --eval_listfile being used in the training (I am going by the detailed log per iteration which is displayed with --debug_interval -1.

Does this mean that eval is still not being run?

Or, does the eval process not write the log message?

If it is a question of not logging, is it possible to modify to add the log message during evaluation?

@Shreeshrii
Copy link
Collaborator Author

does the eval process not write the log message?

Answered in https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

With --debug_interval -1, the trainer outputs verbose text debug for every training iteration.

@stweil Eval seems to be running now (though there are no debug messages to verify it). Thanks for the fix.

Should I close this issue or does it need any other verification?

@xlight
Copy link

xlight commented Jun 17, 2017

so, is this issue fixed ?

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Jun 17, 2017

@xlight Thanks to patches by @stweil and leptonica 1.74.2, the assertion and problem with uninitialized data related to this issue are fixed when using the latest code from github..

@theraysmith will have to verify that his original problem has been fixed, since I do not know what exact test he was referring to in his comment ...

it doesn't run the eval from the trainer.

@Shreeshrii
Copy link
Collaborator Author

2 Percent improvement time=12184, best error was 8.073 @ 23307

Warning: LSTMTrainer deserialized an LSTMRecognizer!

At iteration 35491/53200/53202, Mean rms=0.165%, delta=1.673%, char train=6.063%, word train=19.746%, skip ratio=0%, New best char error = 6.063

At iteration 33032, stage 1, Eval Char error rate=19.697385, Word error rate=42.197884

wrote checkpoint.

At iteration 35535/53300/53302, Mean rms=0.166%, delta=1.706%, char train=6.108%, word train=19.751%, skip ratio=0%, New worst char error = 6.108

At iteration 34063, stage 1, Eval Char error rate=20.580924, Word error rate=41.515177

wrote checkpoint.


Eval is being run from trainer. Closing the issue. Thanks, @stweil.

@Shreeshrii
Copy link
Collaborator Author

@theraysmith

Latest code change has reverted the fix.

Iteration 99: ALIGNED TRUTH : च छ ज झ ञ उ ऊ ट ठ ड ढ ण ऋ ॠ त थ
Iteration 99: BEST OCR TEXT :
File /tmp/tmp.m82dWGBYZW/mar/mar.Aparajita.exp0.lstmf page 3 :
Mean rms=5.427%, delta=53.011%, train=110.2%(100%), skip ratio=0%
lstmtraining: genericvector.h:713: T& GenericVector<T>::operator[](int) const [with T = char]: Assertion `index >= 0 && index < size_used_' failed.
Aborted (core dumped)

theraysmith pushed a commit that referenced this issue Jul 16, 2017
@theraysmith
Copy link
Contributor

Yes a new bug. I was mystified for a while, but it is very simple.
Fixed in 4b6f0b9..f4f66f8

theraysmith pushed a commit that referenced this issue Jul 16, 2017
@Shreeshrii
Copy link
Collaborator Author

Fixed by 45fb7dd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants