diff --git a/README.md b/README.md index 1b32b9dc40..8b518759dd 100644 --- a/README.md +++ b/README.md @@ -33,7 +33,7 @@ In 2005 Tesseract was open sourced by HP. Since 2006 it is developed by Google. The latest stable version is **[3.05.01](https://github.com/tesseract-ocr/tesseract/releases/tag/3.05.01)**, released on June 1, 2017. Latest source code for 3.05 is available from [3.05 branch on GitHub](https://github.com/tesseract-ocr/tesseract/tree/3.05). -Source code for the new **[LSTM based 4.00.00alpha version](https://github.com/tesseract-ocr/tesseract)** is available from the master branch on GitHub. Please note this branch is under active development. +Source code for the new **[LSTM based 4.0 version](https://github.com/tesseract-ocr/tesseract)** is available from the master branch on GitHub. Please note this branch is under active development. See **[Release Notes](https://github.com/tesseract-ocr/tesseract/wiki/ReleaseNotes)** and **[Change Log](https://github.com/tesseract-ocr/tesseract/blob/master/ChangeLog)** for more details of the releases. diff --git a/doc/combine_tessdata.1.asc b/doc/combine_tessdata.1.asc index e91675ee7b..04d2487f1e 100644 --- a/doc/combine_tessdata.1.asc +++ b/doc/combine_tessdata.1.asc @@ -11,7 +11,7 @@ SYNOPSIS DESCRIPTION ----------- -combine_tessdata(1) is the main program to combine/extract/overwrite/list/compact +combine_tessdata(1) is the main program to combine/extract/overwrite/list/compact tessdata components in [lang].traineddata files. To combine all the individual tessdata components (unicharset, DAWGs, @@ -59,10 +59,10 @@ OPTIONS *-c* '.traineddata' 'FILE'...: Compacts the LSTM component in the .traineddata file to int. - + *-d* '.traineddata' 'FILE'...: Lists directory of components from the .traineddata file. - + *-e* '.traineddata' 'FILE'...: Extracts the specified components from the .traineddata file @@ -81,7 +81,7 @@ CAVEATS COMPONENTS ---------- The components in a Tesseract lang.traineddata file as of -Tesseract 4.00alpha are briefly described below; For more information on +Tesseract 4.0 are briefly described below; For more information on many of these files, see and @@ -89,7 +89,7 @@ and lang.config:: (Optional) Language-specific overrides to default config variables. - For 4.00alpha traineddata files, lang.config provides control parameters which + For 4.0 traineddata files, lang.config provides control parameters which can affect layout analysis, and sub-languages. lang.unicharset:: @@ -148,36 +148,36 @@ lang.params-model:: (Optional - 3.0x legacy tesseract) . lang.lstm:: - (Required - 4.00alpha LSTM) Neural net trained recognition model generated by lstmtraining. + (Required - 4.0 LSTM) Neural net trained recognition model generated by lstmtraining. lang.lstm-punc-dawg:: - (Optional - 4.00alpha LSTM) A dawg made from punctuation patterns found around words. + (Optional - 4.0 LSTM) A dawg made from punctuation patterns found around words. The "word" part is replaced by a single space. Uses lang.lstm-unicharset. - + lang.lstm-word-dawg:: - (Optional - 4.00alpha LSTM) A dawg made from dictionary words from the language. + (Optional - 4.0 LSTM) A dawg made from dictionary words from the language. Uses lang.lstm-unicharset. lang.lstm-number-dawg:: - (Optional - 4.00alpha LSTM) A dawg made from tokens which originally contained digits. + (Optional - 4.0 LSTM) A dawg made from tokens which originally contained digits. Each digit is replaced by a space character. Uses lang.lstm-unicharset. - + lang.lstm-unicharset:: - (Required - 4.00alpha LSTM) The unicode character set that Tesseract recognizes, with properties. + (Required - 4.0 LSTM) The unicode character set that Tesseract recognizes, with properties. Same unicharset must be used to train the LSTM and build the lstm-*-dawgs files. lang.lstm-recoder:: - (Required - 4.00alpha LSTM) Unicharcompress, aka the recoder, which maps the unicharset + (Required - 4.0 LSTM) Unicharcompress, aka the recoder, which maps the unicharset further to the codes actually used by the neural network recognizer. This is created as part of the starter traineddata by combine_lang_model. - + lang.version:: - (Optional) Version string for the traineddata file. - First appeared in version 4.00alpha of Tesseract. - Old version of traineddata files will report Version string:Pre-4.0.0. - 4.00alpha version of traineddata files may include the network spec + (Optional) Version string for the traineddata file. + First appeared in version 4.0 of Tesseract. + Old version of traineddata files will report Version string:Pre-4.0.0. + 4.0 version of traineddata files may include the network spec used for LSTM training as part of version string. - + HISTORY ------- combine_tessdata(1) first appeared in version 3.00 of Tesseract diff --git a/doc/tesseract.1.asc b/doc/tesseract.1.asc index c298e45e2a..c18917f23e 100644 --- a/doc/tesseract.1.asc +++ b/doc/tesseract.1.asc @@ -115,7 +115,7 @@ SINGLE OPTIONS LANGUAGES --------- -The currently available traineddata files for tesseract 4.00 +The currently available traineddata files for tesseract 4.0 for the following languages are in (in https://github.com/tesseract-ocr/tessdata_fast): @@ -244,7 +244,7 @@ argument '-l foo'. SCRIPTS ------- -The traineddata files for the following scripts for tesseract 4.00 +The traineddata files for the following scripts for tesseract 4.0 are also in https://github.com/tesseract-ocr/tessdata_fast. In most cases, each of these contains all the languages that use that script PLUS English.