Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Pythonize tesstrain.sh and friends #2076

Merged
merged 3 commits into from
Nov 25, 2018

Conversation

jbarlow83
Copy link

@jbarlow83 jbarlow83 commented Nov 24, 2018

After attempting to chase down an error in tesstrain.sh I ended up converting to tesstrain.sh to Python. I also see interested was expressed by others in this conversion in #1503.

Only LSTM is converted right now. I have not done this work for the Tesseract 3 portions of training because IIRC, the old 3.x engine might get scrapped. If there's a need I can convert that too; the process is pretty mechanical.

Python 3.6+ is required, which also ships with Ubuntu 18.04 and is always installed. I need Python 3.6 for its improved string substitution - this makes it follow the bash version more closely,

I attempted to keep source changes minimal so it would be easier to compare
bash to Python in code review and confirm equivalence. The first commit is structured so it's somewhat easier to compare to the originals in bash, e.g. keeping Python and bash the same for the rare cases where they are the same. The second is cleaned to PEP8 so it deviates further. I definitely appreciate anyone checking whether I've mistranslated something. ( #1503 suggests language_specific.py being presented in YAML, and I agree that would be better, but for now I want to avoid major restructuring.)

There are minor improvements already but it's more or less functionally equivalent for LSTM training. Python's logging library is used and the log file is handled more consistently. Console output is more terse when things are going well. There are progress bars. Temporary files are retained on failure, for inspection. It's generally easier to use.

The .cpp utilities generally depend on portions of the Tesseract code base. I don't plan to modify those.

James R. Barlow added 2 commits November 24, 2018 00:45
This is a lightweight, semi-Pythonic conversion of tesstrain.sh that currently
supports only LSTM and not the Tesseract 3 training mode.

I attempted to keep source changes minimal so it would be easy to compare
bash to Python in code review and confirm equivalence.

Python 3.6+ is required.  Ubuntu 18.04 ships Python 3.6 and it is a mandatory
package (the package manager is also written in Python), so it is available
in the baseline Tesseract 4.0 system.

There are minor output and behavioral changes, and advantages.  Python's loggingis used.  Temporary files are only deleted on success, so they can be inspected
if training files.  Console output is more terse and the log file is more
verbose.  And there are progress bars!  (The python3-tqdm package is required.)
Where tesstrain.sh would sometimes fail without explanation and return an error
code of 1, it is much easier to find the point of failure in this version.
That was also the main motivation for this work.

Argument checking is also more comprehensive.
This increases the deviation from the bash scripts so is done separately.
@amitdo
Copy link
Collaborator

amitdo commented Nov 24, 2018

Nice work!

Maybe we should move this python code to a separate repo, tesstrain-python, hosted under the tesseract-ocr organization.

@egorpugin
Copy link
Contributor

egorpugin commented Nov 24, 2018

Maybe we should move this python code to a separate repo, tesstrain-python

I cannot agree. Too many repos.

@jbarlow83
Copy link
Author

Codacy's remaining complaints are ridiculous.

@egorpugin egorpugin merged commit 267b799 into tesseract-ocr:master Nov 25, 2018
@amitdo
Copy link
Collaborator

amitdo commented Nov 25, 2018

I still prefer to separate it from the c++ code.

Can you move the python code to a new directory?

One option is python under the root dir, similar to the java dir.

You can make it python/training.

@amitdo
Copy link
Collaborator

amitdo commented Nov 25, 2018

Other options:

/tools/training/python
/src/training/tools/python
/src/training/python

@zdenop
Copy link
Contributor

zdenop commented Nov 25, 2018

IMO it is nothing else than replacing bash script. if bash script was not problem with c++ why python?
For me bigger problem is that we mix library and program code (src/api/tesseractmain.cpp + whole src/training/)

@amitdo
Copy link
Collaborator

amitdo commented Nov 25, 2018

if bash script was not problem with c++ why python?

With new code we can change old ways of doing things.

For me bigger problem is that we mix library and program code (src/api/tesseractmain.cpp + whole src/training/)

This is related to my suggestion.

One possible solution:
tools/tesseract.cpp

@jbarlow83
Copy link
Author

jbarlow83 commented Nov 25, 2018 via email

@zdenop
Copy link
Contributor

zdenop commented Nov 25, 2018

Move core training code will increase decency on external libs for all users - and most of them does not need them. (if you wanna have fun - try to use pango in python 3 on window ;-))
Separate training tools is right approach.

@jbarlow83
Copy link
Author

Python has all of the functionality of libicu baked in, and can render fonts so you cut out both dependencies by shifting to it. For example Pillow can render fonts. But if you specifically need pango and can't substitute another font library, yes, that would be painful.

I'm saying this more to put what could be a viable alternative on your radar - maybe it's a lower maintenance option long term. I won't be trying it.

@amitdo amitdo added the RFC label Mar 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants