-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Pythonize tesstrain.sh and friends #2076
Conversation
This is a lightweight, semi-Pythonic conversion of tesstrain.sh that currently supports only LSTM and not the Tesseract 3 training mode. I attempted to keep source changes minimal so it would be easy to compare bash to Python in code review and confirm equivalence. Python 3.6+ is required. Ubuntu 18.04 ships Python 3.6 and it is a mandatory package (the package manager is also written in Python), so it is available in the baseline Tesseract 4.0 system. There are minor output and behavioral changes, and advantages. Python's loggingis used. Temporary files are only deleted on success, so they can be inspected if training files. Console output is more terse and the log file is more verbose. And there are progress bars! (The python3-tqdm package is required.) Where tesstrain.sh would sometimes fail without explanation and return an error code of 1, it is much easier to find the point of failure in this version. That was also the main motivation for this work. Argument checking is also more comprehensive.
This increases the deviation from the bash scripts so is done separately.
Nice work! Maybe we should move this python code to a separate repo, |
I cannot agree. Too many repos. |
Codacy's remaining complaints are ridiculous. |
I still prefer to separate it from the c++ code. Can you move the python code to a new directory? One option is You can make it |
Other options:
|
IMO it is nothing else than replacing bash script. if bash script was not problem with c++ why python? |
With new code we can change old ways of doing things.
This is related to my suggestion. One possible solution: |
Why don't we install tesstrain.py as tesstrain? The language used to write
an installed program like training tools should be an implementation
detail. That would facilitate moving other training tools to higher level
languages if desired. (It might be cleaner to move core training code into
libtesseract and have a Python wrapper that binds the APIs, some time in
the future.)
At the same time it is nice for source code files to have the proper
language suffix, hence the rename.
…On Sun, Nov 25, 2018, 06:05 zdenop ***@***.*** wrote:
IMO it is nothing else than replacing bash script. if bash script was not
problem with c++ why python?
For me bigger problem is that we mix library and program code
(src/api/tesseractmain.cpp + whole src/training/)
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#2076 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABvcM5Vy_gWllMSM8oTGEmbdvaYgKphcks5uyqObgaJpZM4YxVv6>
.
|
Move core training code will increase decency on external libs for all users - and most of them does not need them. (if you wanna have fun - try to use pango in python 3 on window ;-)) |
Python has all of the functionality of libicu baked in, and can render fonts so you cut out both dependencies by shifting to it. For example Pillow can render fonts. But if you specifically need pango and can't substitute another font library, yes, that would be painful. I'm saying this more to put what could be a viable alternative on your radar - maybe it's a lower maintenance option long term. I won't be trying it. |
After attempting to chase down an error in tesstrain.sh I ended up converting to tesstrain.sh to Python. I also see interested was expressed by others in this conversion in #1503.
Only LSTM is converted right now. I have not done this work for the Tesseract 3 portions of training because IIRC, the old 3.x engine might get scrapped. If there's a need I can convert that too; the process is pretty mechanical.
Python 3.6+ is required, which also ships with Ubuntu 18.04 and is always installed. I need Python 3.6 for its improved string substitution - this makes it follow the bash version more closely,
I attempted to keep source changes minimal so it would be easier to compare
bash to Python in code review and confirm equivalence. The first commit is structured so it's somewhat easier to compare to the originals in bash, e.g. keeping Python and bash the same for the rare cases where they are the same. The second is cleaned to PEP8 so it deviates further. I definitely appreciate anyone checking whether I've mistranslated something. ( #1503 suggests language_specific.py being presented in YAML, and I agree that would be better, but for now I want to avoid major restructuring.)
There are minor improvements already but it's more or less functionally equivalent for LSTM training. Python's logging library is used and the log file is handled more consistently. Console output is more terse when things are going well. There are progress bars. Temporary files are retained on failure, for inspection. It's generally easier to use.
The .cpp utilities generally depend on portions of the Tesseract code base. I don't plan to modify those.