AMBIGUOUS_WORDS(1)

+.\" Date: 06/12/2015 .\" Manual: \ \& .\" Source: \ \& .\" Language: English .\" -.TH "AMBIGUOUS_WORDS" "1" "02/09/2012" "\ \&" "\ \&" +.TH "AMBIGUOUS_WORDS" "1" "06/12/2015" "\ \&" "\ \&" .\" ----------------------------------------------------------------- .\" * Define some portability stuff .\" ----------------------------------------------------------------- diff --git a/doc/ambiguous_words.1.html b/doc/ambiguous_words.1.html index ae1e201015..3fd5f7f1f6 100644 --- a/doc/ambiguous_words.1.html +++ b/doc/ambiguous_words.1.html @@ -2,15 +2,25 @@ "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> - - + + AMBIGUOUS_WORDS(1) - +

SYNOPSIS

ambiguous_words [-l lang] TESSDATADIR WORDLIST AMBIGUOUSFILE

DESCRIPTION

ambiguous_words(1) runs Tesseract in a special mode, and for each word @@ -591,25 +758,32 @@

DESCRIPTION

ambiguous with it. TESSDATADIR must be set to the absolute path of a directory containing tessdata/lang.traineddata.

COPYING

AUTHOR

The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985-1995) and Google (2006-present).

diff --git a/doc/ambiguous_words.1.xml b/doc/ambiguous_words.1.xml index f46dce4252..6293866ceb 100644 --- a/doc/ambiguous_words.1.xml +++ b/doc/ambiguous_words.1.xml @@ -3,11 +3,14 @@ + + AMBIGUOUS_WORDS(1) + ambiguous_words 1 - - + + ambiguous_words diff --git a/doc/cntraining.1 b/doc/cntraining.1 index 1acc8f812f..332655e513 100644 --- a/doc/cntraining.1 +++ b/doc/cntraining.1 @@ -1,13 +1,13 @@ '\" t .\" Title: cntraining .\" Author: [see the "AUTHOR" section] -.\" Generator: DocBook XSL Stylesheets v1.75.2 -.\" Date: 02/09/2012 +.\" Generator: DocBook XSL Stylesheets v1.78.1 +.\" Date: 06/12/2015 .\" Manual: \ \& .\" Source: \ \& .\" Language: English .\" -.TH "CNTRAINING" "1" "02/09/2012" "\ \&" "\ \&" +.TH "CNTRAINING" "1" "06/12/2015" "\ \&" "\ \&" .\" ----------------------------------------------------------------- .\" * Define some portability stuff .\" ----------------------------------------------------------------- @@ -45,7 +45,7 @@ Directory to write output files to\&. .sp tesseract(1), shapeclustering(1), mftraining(1) .sp -\m[blue]\fBhttp://code\&.google\&.com/p/tesseract\-ocr/wiki/TrainingTesseract3\fR\m[] +\m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TrainingTesseract\fR\m[] .SH "COPYING" .sp Copyright (c) Hewlett\-Packard Company, 1988 Licensed under the Apache License, Version 2\&.0 diff --git a/doc/cntraining.1.asc b/doc/cntraining.1.asc index 808134740b..ef98112e06 100644 --- a/doc/cntraining.1.asc +++ b/doc/cntraining.1.asc @@ -24,7 +24,7 @@ SEE ALSO -------- tesseract(1), shapeclustering(1), mftraining(1) - + COPYING ------- diff --git a/doc/cntraining.1.html b/doc/cntraining.1.html index 085db73365..706d3bd0f4 100644 --- a/doc/cntraining.1.html +++ b/doc/cntraining.1.html @@ -2,15 +2,25 @@ "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> - - + + CNTRAINING(1) - +

SYNOPSIS

cntraining [-D dir] FILE…

DESCRIPTION

cntraining takes a list of .tr files, from which it generates the normproto data file (the character normalization sensitivity prototypes).

OPTIONS

@@ -603,26 +772,33 @@

OPTIONS

COPYING

AUTHOR

The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985-1995) and Google (2006-present).

diff --git a/doc/cntraining.1.xml b/doc/cntraining.1.xml index d4d4161805..6795f12f2c 100644 --- a/doc/cntraining.1.xml +++ b/doc/cntraining.1.xml @@ -3,11 +3,14 @@ + + CNTRAINING(1) + cntraining 1 - - + + cntraining @@ -40,7 +43,7 @@ prototypes). SEE ALSO tesseract(1), shapeclustering(1), mftraining(1) -http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 +https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract COPYING diff --git a/doc/combine_tessdata.1 b/doc/combine_tessdata.1 index 926d183381..d876d1b8ee 100644 --- a/doc/combine_tessdata.1 +++ b/doc/combine_tessdata.1 @@ -1,13 +1,13 @@ '\" t .\" Title: combine_tessdata .\" Author: [see the "AUTHOR" section] -.\" Generator: DocBook XSL Stylesheets v1.75.2 -.\" Date: 02/09/2012 +.\" Generator: DocBook XSL Stylesheets v1.78.1 +.\" Date: 06/12/2015 .\" Manual: \ \& .\" Source: \ \& .\" Language: English .\" -.TH "COMBINE_TESSDATA" "1" "02/09/2012" "\ \&" "\ \&" +.TH "COMBINE_TESSDATA" "1" "06/12/2015" "\ \&" "\ \&" .\" ----------------------------------------------------------------- .\" * Define some portability stuff .\" ----------------------------------------------------------------- @@ -107,7 +107,7 @@ This will create /home/$USER/temp/eng\&.* files with individual tessdata compone \fIPrefix\fR refers to the full file prefix, including period (\&.) .SH "COMPONENTS" .sp -The components in a Tesseract lang\&.traineddata file as of Tesseract 3\&.02 are briefly described below; For more information on many of these files, see \m[blue]\fBhttp://code\&.google\&.com/p/tesseract\-ocr/wiki/TrainingTesseract3\fR\m[] +The components in a Tesseract lang\&.traineddata file as of Tesseract 3\&.02 are briefly described below; For more information on many of these files, see \m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TrainingTesseract\fR\m[] .PP lang\&.config .RS 4 diff --git a/doc/combine_tessdata.1.asc b/doc/combine_tessdata.1.asc index 3632a98d42..d93de7ea0f 100644 --- a/doc/combine_tessdata.1.asc +++ b/doc/combine_tessdata.1.asc @@ -76,7 +76,7 @@ COMPONENTS The components in a Tesseract lang.traineddata file as of Tesseract 3.02 are briefly described below; For more information on many of these files, see - + lang.config:: (Optional) Language-specific overrides to default config variables. diff --git a/doc/combine_tessdata.1.html b/doc/combine_tessdata.1.html index a05044dfc2..8de474b33b 100644 --- a/doc/combine_tessdata.1.html +++ b/doc/combine_tessdata.1.html @@ -2,15 +2,25 @@ "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> - - + + COMBINE_TESSDATA(1) - +

SYNOPSIS

combine_tessdata [OPTION] FILE…

DESCRIPTION

combine_tessdata(1) is the main program to combine/extract/overwrite @@ -593,7 +760,7 @@

DESCRIPTION

/home/$USER/temp/eng.* run:

combine_tessdata /home/$USER/temp/eng.

combine_tessdata /home/$USER/temp/eng.

The result will be a combined tessdata file /home/$USER/temp/eng.traineddata

Specify option -e if you would like to extract individual components @@ -601,8 +768,8 @@

DESCRIPTION

file and the unicharset from tessdata/eng.traineddata run:

combine_tessdata -e tessdata/eng.traineddata \
-  /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset

combine_tessdata -e tessdata/eng.traineddata \
+  /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset

The desired config file and unicharset will be written to /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset

@@ -611,8 +778,8 @@

DESCRIPTION

and unichar ambiguities files in tessdata/eng.traineddata use:

combine_tessdata -o tessdata/eng.traineddata \
-  /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharambigs

combine_tessdata -o tessdata/eng.traineddata \
+  /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharambigs

As a result, tessdata/eng.traineddata will contain the new language config and unichar ambigs, plus all the original DAWGs, classifier templates, etc.

@@ -623,11 +790,13 @@

DESCRIPTION

Specify option -u to unpack all the components to the specified path:

combine_tessdata -u tessdata/eng.traineddata /home/$USER/temp/eng.

combine_tessdata -u tessdata/eng.traineddata /home/$USER/temp/eng.

This will create /home/$USER/temp/eng.* files with individual tessdata components from tessdata/eng.traineddata.

OPTIONS

-e .traineddata FILE…: @@ -638,16 +807,20 @@

OPTIONS

-u .traineddata PATHPREFIX Unpacks the .traineddata using the provided prefix.

CAVEATS

Prefix refers to the full file prefix, including period (.)

COMPONENTS

The components in a Tesseract lang.traineddata file as of Tesseract 3.02 are briefly described below; For more information on many of these files, see -http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3

+https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract

lang.config @@ -802,30 +975,39 @@ COMPONENTS

HISTORY

combine_tessdata(1) first appeared in version 3.00 of Tesseract

COPYING

AUTHOR

The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985-1995) and Google (2006-present).

diff --git a/doc/combine_tessdata.1.xml b/doc/combine_tessdata.1.xml index 0cb023cad0..1a43995fb5 100644 --- a/doc/combine_tessdata.1.xml +++ b/doc/combine_tessdata.1.xml @@ -3,11 +3,14 @@ + + COMBINE_TESSDATA(1) + combine_tessdata 1 - - + + combine_tessdata @@ -67,7 +70,7 @@ components from tessdata/eng.traineddata. The components in a Tesseract lang.traineddata file as of Tesseract 3.02 are briefly described below; For more information on many of these files, see -http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 +https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract diff --git a/doc/dawg2wordlist.1 b/doc/dawg2wordlist.1 index 2d73da370b..5fb50b522b 100644 --- a/doc/dawg2wordlist.1 +++ b/doc/dawg2wordlist.1 @@ -1,13 +1,13 @@ '\" t .\" Title: dawg2wordlist .\" Author: [see the "AUTHOR" section] -.\" Generator: DocBook XSL Stylesheets v1.75.2 -.\" Date: 02/09/2012 +.\" Generator: DocBook XSL Stylesheets v1.78.1 +.\" Date: 06/12/2015 .\" Manual: \ \& .\" Source: \ \& .\" Language: English .\" -.TH "DAWG2WORDLIST" "1" "02/09/2012" "\ \&" "\ \&" +.TH "DAWG2WORDLIST" "1" "06/12/2015" "\ \&" "\ \&" .\" ----------------------------------------------------------------- .\" * Define some portability stuff .\" ----------------------------------------------------------------- @@ -46,7 +46,7 @@ dawg2wordlist(1) converts a Tesseract Directed Acyclic Word Graph (DAWG) to a li .sp tesseract(1), mftraining(1), wordlist2dawg(1), unicharset(5), combine_tessdata(1) .sp -\m[blue]\fBhttp://code\&.google\&.com/p/tesseract\-ocr/wiki/TrainingTesseract3\fR\m[] +\m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TrainingTesseract\fR\m[] .SH "COPYING" .sp Copyright (C) 2012 Google, Inc\&. Licensed under the Apache License, Version 2\&.0 diff --git a/doc/dawg2wordlist.1.asc b/doc/dawg2wordlist.1.asc index cd644a01bf..93594d61ae 100644 --- a/doc/dawg2wordlist.1.asc +++ b/doc/dawg2wordlist.1.asc @@ -32,7 +32,7 @@ SEE ALSO tesseract(1), mftraining(1), wordlist2dawg(1), unicharset(5), combine_tessdata(1) - + COPYING ------- diff --git a/doc/dawg2wordlist.1.html b/doc/dawg2wordlist.1.html index 9d926f9e8a..b700fe186d 100644 --- a/doc/dawg2wordlist.1.html +++ b/doc/dawg2wordlist.1.html @@ -2,15 +2,25 @@ "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> - - + + DAWG2WORDLIST(1) - +

SYNOPSIS

dawg2wordlist UNICHARSET DAWG WORDLIST

DESCRIPTION

dawg2wordlist(1) converts a Tesseract Directed Acyclic Word Graph (DAWG) to a list of words using a unicharset as key.

OPTIONS

UNICHARSET @@ -599,27 +768,34 @@

OPTIONS

WORDLIST Plain text (output) file in UTF-8, one word per line

COPYING

AUTHOR

The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985-1995) and Google (2006-present).

diff --git a/doc/dawg2wordlist.1.xml b/doc/dawg2wordlist.1.xml index 5a9a224b95..c73113191c 100644 --- a/doc/dawg2wordlist.1.xml +++ b/doc/dawg2wordlist.1.xml @@ -3,11 +3,14 @@ + + DAWG2WORDLIST(1) + dawg2wordlist 1 - - + + dawg2wordlist @@ -35,7 +38,7 @@ Graph (DAWG) to a list of words using a unicharset as key. SEE ALSO tesseract(1), mftraining(1), wordlist2dawg(1), unicharset(5), combine_tessdata(1) -http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 +https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract COPYING diff --git a/doc/mftraining.1 b/doc/mftraining.1 index 441e03b258..1901850ada 100644 --- a/doc/mftraining.1 +++ b/doc/mftraining.1 @@ -1,13 +1,13 @@ '\" t .\" Title: mftraining .\" Author: [see the "AUTHOR" section] -.\" Generator: DocBook XSL Stylesheets v1.75.2 -.\" Date: 02/09/2012 +.\" Generator: DocBook XSL Stylesheets v1.78.1 +.\" Date: 06/12/2015 .\" Manual: \ \& .\" Source: \ \& .\" Language: English .\" -.TH "MFTRAINING" "1" "02/09/2012" "\ \&" "\ \&" +.TH "MFTRAINING" "1" "06/12/2015" "\ \&" "\ \&" .\" ----------------------------------------------------------------- .\" * Define some portability stuff .\" ----------------------------------------------------------------- @@ -85,7 +85,7 @@ Directory to write output files to\&. .sp tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1), shapeclustering(1), unicharset(5) .sp -\m[blue]\fBhttp://code\&.google\&.com/p/tesseract\-ocr/wiki/TrainingTesseract3\fR\m[] +\m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TrainingTesseract\fR\m[] .SH "COPYING" .sp Copyright (C) Hewlett\-Packard Company, 1988 Licensed under the Apache License, Version 2\&.0 diff --git a/doc/mftraining.1.asc b/doc/mftraining.1.asc index 1a57d1e3c0..85e1263ade 100644 --- a/doc/mftraining.1.asc +++ b/doc/mftraining.1.asc @@ -43,7 +43,7 @@ SEE ALSO tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1), shapeclustering(1), unicharset(5) - + COPYING ------- diff --git a/doc/mftraining.1.html b/doc/mftraining.1.html index 4d5e54bb82..4abdfd6a6c 100644 --- a/doc/mftraining.1.html +++ b/doc/mftraining.1.html @@ -2,15 +2,25 @@ "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> - - + + MFTRAINING(1) - +

SYNOPSIS

mftraining -U unicharset -O lang.unicharset FILE…

DESCRIPTION

mftraining takes a list of .tr files, from which it generates the @@ -591,6 +758,8 @@

DESCRIPTION

(the number of expected features for each character). (A fourth file called Microfeat is also written by this program, but it is not used.)

OPTIONS

@@ -623,7 +792,7 @@ OPTIONS - *font_name* *xheight* + *font_name* *xheight*
@@ -644,27 +813,34 @@ OPTIONS

COPYING

AUTHOR

The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985-1995) and Google (2006-present).

diff --git a/doc/mftraining.1.xml b/doc/mftraining.1.xml index 0f85e4f9d2..239178a5c1 100644 --- a/doc/mftraining.1.xml +++ b/doc/mftraining.1.xml @@ -3,11 +3,14 @@ + + MFTRAINING(1) + mftraining 1 - - + + mftraining @@ -84,7 +87,7 @@ called Microfeat is also written by this program, but it is not used.) SEE ALSO tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1), shapeclustering(1), unicharset(5) -http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 +https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract COPYING diff --git a/doc/shapeclustering.1 b/doc/shapeclustering.1 index d59783f0d8..f1f9fbdea6 100644 --- a/doc/shapeclustering.1 +++ b/doc/shapeclustering.1 @@ -1,13 +1,13 @@ '\" t .\" Title: shapeclustering .\" Author: [see the "AUTHOR" section] -.\" Generator: DocBook XSL Stylesheets v1.75.2 -.\" Date: 02/09/2012 +.\" Generator: DocBook XSL Stylesheets v1.78.1 +.\" Date: 06/12/2015 .\" Manual: \ \& .\" Source: \ \& .\" Language: English .\" -.TH "SHAPECLUSTERING" "1" "02/09/2012" "\ \&" "\ \&" +.TH "SHAPECLUSTERING" "1" "06/12/2015" "\ \&" "\ \&" .\" ----------------------------------------------------------------- .\" * Define some portability stuff .\" ----------------------------------------------------------------- @@ -85,7 +85,7 @@ The output unicharset that will be given to combine_tessdata(1)\&. .sp tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1), unicharset(5) .sp -\m[blue]\fBhttp://code\&.google\&.com/p/tesseract\-ocr/wiki/TrainingTesseract3\fR\m[] +\m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TrainingTesseract\fR\m[] .SH "COPYING" .sp Copyright (C) Google, 2011 Licensed under the Apache License, Version 2\&.0 diff --git a/doc/shapeclustering.1.asc b/doc/shapeclustering.1.asc index cab0dc43dc..81ca0dbc09 100644 --- a/doc/shapeclustering.1.asc +++ b/doc/shapeclustering.1.asc @@ -46,7 +46,7 @@ SEE ALSO tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1), unicharset(5) - + COPYING ------- diff --git a/doc/shapeclustering.1.html b/doc/shapeclustering.1.html index a1f42cca99..845d49a815 100644 --- a/doc/shapeclustering.1.html +++ b/doc/shapeclustering.1.html @@ -2,15 +2,25 @@ "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> - - + + SHAPECLUSTERING(1) - +

SYNOPSIS

shapeclustering -D output_dir @@ -587,6 +752,8 @@

SYNOPSIS

-F font_props -X xheights FILE…

DESCRIPTION

shapeclustering(1) takes extracted feature .tr files (generated by @@ -594,6 +761,8 @@

DESCRIPTION

file shapetable and an enhanced unicharset. This program is still experimental, and is not required (yet) for training Tesseract.

OPTIONS

@@ -634,7 +803,7 @@ OPTIONS - 'font_name' 'xheight' + 'font_name' 'xheight'
@@ -647,27 +816,34 @@ OPTIONS

COPYING

AUTHOR

The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985-1995) and Google (2006-present).

diff --git a/doc/shapeclustering.1.xml b/doc/shapeclustering.1.xml index 8000d27ea1..d02bcf8db9 100644 --- a/doc/shapeclustering.1.xml +++ b/doc/shapeclustering.1.xml @@ -3,11 +3,14 @@ + + SHAPECLUSTERING(1) + shapeclustering 1 - - + + shapeclustering @@ -87,7 +90,7 @@ experimental, and is not required (yet) for training Tesseract. SEE ALSO tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1), unicharset(5) -http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 +https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract COPYING diff --git a/doc/tesseract.1 b/doc/tesseract.1 index 7acdb90de1..d509a03430 100644 --- a/doc/tesseract.1 +++ b/doc/tesseract.1 @@ -2,12 +2,12 @@ .\" Title: tesseract .\" Author: [see the "AUTHOR" section] .\" Generator: DocBook XSL Stylesheets v1.78.1 -.\" Date: 08/02/2014 +.\" Date: 06/12/2015 .\" Manual: \ \& .\" Source: \ \& .\" Language: English .\" -.TH "TESSERACT" "1" "08/02/2014" "\ \&" "\ \&" +.TH "TESSERACT" "1" "06/12/2015" "\ \&" "\ \&" .\" ----------------------------------------------------------------- .\" * Define some portability stuff .\" ----------------------------------------------------------------- @@ -224,7 +224,7 @@ The engine was developed at Hewlett Packard Laboratories Bristol and at Hewlett .sp Version 2\&.00 brought Unicode (UTF\-8) support, six languages, and the ability to train Tesseract\&. .sp -Tesseract was included in UNLV\(cqs Fourth Annual Test of OCR Accuracy\&. See \m[blue]\fBhttp://www\&.isri\&.unlv\&.edu/downloads/AT\-1995\&.pdf\fR\m[]\&. With Tesseract 2\&.00, scripts are now included to allow anyone to reproduce some of these tests\&. See \m[blue]\fBhttp://code\&.google\&.com/p/tesseract\-ocr/wiki/TestingTesseract\fR\m[] for more details\&. +Tesseract was included in UNLV\(cqs Fourth Annual Test of OCR Accuracy\&. See \m[blue]\fBhttp://www\&.isri\&.unlv\&.edu/downloads/AT\-1995\&.pdf\fR\m[]\&. With Tesseract 2\&.00, scripts are now included to allow anyone to reproduce some of these tests\&. See \m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TestingTesseract\fR\m[] for more details\&. .sp Tesseract 3\&.00 adds a number of new languages, including Chinese, Japanese, and Korean\&. It also introduces a new, single\-file based system of managing language data\&. .sp @@ -233,7 +233,7 @@ Tesseract 3\&.02 adds BiDirectional text support, the ability to recognize multi For further details, see the file ReleaseNotes included with the distribution\&. .SH "RESOURCES" .sp -Main web site: \m[blue]\fBhttp://code\&.google\&.com/p/tesseract\-ocr/\fR\m[] Information on training: \m[blue]\fBhttp://code\&.google\&.com/p/tesseract\-ocr/wiki/TrainingTesseract3\fR\m[] +Main web site: \m[blue]\fBhttps://github\&.com/tesseract\-ocr\fR\m[] Information on training: \m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TrainingTesseract\fR\m[] .SH "SEE ALSO" .sp ambiguous_words(1), cntraining(1), combine_tessdata(1), dawg2wordlist(1), shape_training(1), mftraining(1), unicharambigs(5), unicharset(5), unicharset_extractor(1), wordlist2dawg(1) diff --git a/doc/tesseract.1.asc b/doc/tesseract.1.asc index bcb3fccbb9..94048bb676 100644 --- a/doc/tesseract.1.asc +++ b/doc/tesseract.1.asc @@ -218,9 +218,9 @@ Version 2.00 brought Unicode (UTF-8) support, six languages, and the ability to train Tesseract. Tesseract was included in UNLV's Fourth Annual Test of OCR Accuracy. -See . With Tesseract 2.00, +See . With Tesseract 2.00, scripts are now included to allow anyone to reproduce some of these tests. -See for more +See for more details. Tesseract 3.00 adds a number of new languages, including Chinese, Japanese, @@ -234,8 +234,8 @@ For further details, see the file ReleaseNotes included with the distribution. RESOURCES --------- -Main web site: + -Information on training: +Main web site: + +Information on training: SEE ALSO -------- diff --git a/doc/tesseract.1.html b/doc/tesseract.1.html index 3e6d0e5f28..8619987e10 100644 --- a/doc/tesseract.1.html +++ b/doc/tesseract.1.html @@ -3,7 +3,7 @@ - + TESSERACT(1) - +

DESCRIPTION

The unicharambigs file (a component of traineddata, see combine_tessdata(1) ) @@ -588,7 +753,7 @@

DESCRIPTION

The file contains a number of lines, laid out as follow:

[num] <TAB> [char(s)] <TAB> [num] <TAB> [char(s)] <TAB> [num]

[num] <TAB> [char(s)] <TAB> [num] <TAB> [char(s)] <TAB> [num]

@@ -652,13 +817,15 @@

DESCRIPTION

unicharset. The numbers in fields one and three refer to the number of unichars (not bytes).

+ +

EXAMPLE

2       ' '     1       "     1
+2       ' '     1       "     1
 1       m       2       r n   0
-3       i i i   1       m     0

+3       i i i   1       m     0

In this example, all instances of the 2 character sequence '' will always be replaced by the 1 character sequence "; a 1 character @@ -666,6 +833,8 @@

EXAMPLE

the 3 character sequence may be replaced by the 1 character sequence m.

HISTORY

The unicharambigs file first appeared in Tesseract 3.00; prior to that, a @@ -673,26 +842,33 @@

HISTORY

format was almost identical, except only mandatory replacements could be specified, and field 5 was absent.

BUGS

This is a documentation "bug": it’s not currently clear what should be done in the case of ligatures (such as fi) which may also appear as regular letters in the unicharset.

AUTHOR

The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985-1995) and Google (2006-present).

diff --git a/doc/unicharambigs.5.xml b/doc/unicharambigs.5.xml index 12ecb1fc29..75b3c66431 100644 --- a/doc/unicharambigs.5.xml +++ b/doc/unicharambigs.5.xml @@ -3,11 +3,14 @@ + + UNICHARAMBIGS(5) + unicharambigs 5 - - + + unicharambigs diff --git a/doc/unicharset.5 b/doc/unicharset.5 index fd9cccd642..a5924db6e8 100644 --- a/doc/unicharset.5 +++ b/doc/unicharset.5 @@ -1,13 +1,13 @@ '\" t .\" Title: unicharset .\" Author: [see the "AUTHOR" section] -.\" Generator: DocBook XSL Stylesheets v1.75.2 -.\" Date: 02/09/2012 +.\" Generator: DocBook XSL Stylesheets v1.78.1 +.\" Date: 06/12/2015 .\" Manual: \ \& .\" Source: \ \& .\" Language: English .\" -.TH "UNICHARSET" "5" "02/09/2012" "\ \&" "\ \&" +.TH "UNICHARSET" "5" "06/12/2015" "\ \&" "\ \&" .\" ----------------------------------------------------------------- .\" * Define some portability stuff .\" ----------------------------------------------------------------- @@ -214,7 +214,7 @@ The unicharset format first appeared with Tesseract 2\&.00, which was the first .sp tesseract(1), combine_tessdata(1), unicharset_extractor(1) .sp -\m[blue]\fBhttp://code\&.google\&.com/p/tesseract\-ocr/wiki/TrainingTesseract3\fR\m[] +\m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TrainingTesseract\fR\m[] .SH "AUTHOR" .sp The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985\-1995) and Google (2006\-present)\&. diff --git a/doc/unicharset.5.asc b/doc/unicharset.5.asc index ed8c602ad3..5b859daa1e 100644 --- a/doc/unicharset.5.asc +++ b/doc/unicharset.5.asc @@ -124,7 +124,7 @@ SEE ALSO -------- tesseract(1), combine_tessdata(1), unicharset_extractor(1) - + AUTHOR diff --git a/doc/unicharset.5.html b/doc/unicharset.5.html index f76bafaffb..0f16c9e5e5 100644 --- a/doc/unicharset.5.html +++ b/doc/unicharset.5.html @@ -2,15 +2,25 @@ "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> - - + + UNICHARSET(5) - +

DESCRIPTION

Tesseract’s unicharset file contains information on each symbol @@ -596,12 +761,12 @@

DESCRIPTION

Each unichar line in the unicharset file (v2+) may have four space-separated fields:

'character' 'properties' 'script' 'id'

'character' 'properties' 'script' 'id'

Starting with Tesseract v3.02, more information may be given for each unichar:

'character' 'properties' 'glyph_metrics' 'script' 'other_case' 'direction' 'mirror' 'normed_form'

'character' 'properties' 'glyph_metrics' 'script' 'other_case' 'direction' 'mirror' 'normed_form'

Entries:

@@ -712,15 +877,17 @@

DESCRIPTION

EXAMPLE (v2)

; 10 Common 46
+; 10 Common 46
 b 3 Latin 59
 W 5 Latin 40
 7 8 Common 66
-= 0 Common 93

+= 0 Common 93

";" is a punctuation character. Its properties are thus represented by the binary number 10000 (10 in hexadecimal).

@@ -736,20 +903,24 @@

EXAMPLE (v2)

binary number 00001 (1 in hexadecimal): they are alphabetic, but neither upper nor lower case.

EXAMPLE (v3.02)

110
+110
 NULL 0 NULL 0
 N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N
 Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y
 1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1
 9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9
 a 3 58,65,186,198,85,164,0,26,97,185 Latin 56 0 5 a
-. . .

+. . .

CAVEATS

Although the unicharset reader maintains the ability to read unicharsets @@ -759,6 +930,8 @@

CAVEATS

so changing it without re-generating the others is likely to have dire consequences.

HISTORY

The unicharset format first appeared with Tesseract 2.00, which was the @@ -766,21 +939,26 @@

HISTORY

contained only the first two fields, and the "ispunctuation" property was absent (punctuation was regarded as "0", as "=" is in the above example.

AUTHOR

The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985-1995) and Google (2006-present).

diff --git a/doc/unicharset.5.xml b/doc/unicharset.5.xml index e14f2c1ce1..9ae6257e60 100644 --- a/doc/unicharset.5.xml +++ b/doc/unicharset.5.xml @@ -3,11 +3,14 @@ + + UNICHARSET(5) + unicharset 5 - - + + unicharset @@ -206,7 +209,7 @@ absent (punctuation was regarded as "0", as "=" is in the above example. SEE ALSO tesseract(1), combine_tessdata(1), unicharset_extractor(1) -http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 +https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract AUTHOR diff --git a/doc/unicharset_extractor.1 b/doc/unicharset_extractor.1 index c3bdf2fce3..ed2040dbfc 100644 --- a/doc/unicharset_extractor.1 +++ b/doc/unicharset_extractor.1 @@ -1,13 +1,13 @@ '\" t .\" Title: unicharset_extractor .\" Author: [see the "AUTHOR" section] -.\" Generator: DocBook XSL Stylesheets v1.75.2 -.\" Date: 02/09/2012 +.\" Generator: DocBook XSL Stylesheets v1.78.1 +.\" Date: 06/12/2015 .\" Manual: \ \& .\" Source: \ \& .\" Language: English .\" -.TH "UNICHARSET_EXTRACTOR" "1" "02/09/2012" "\ \&" "\ \&" +.TH "UNICHARSET_EXTRACTOR" "1" "06/12/2015" "\ \&" "\ \&" .\" ----------------------------------------------------------------- .\" * Define some portability stuff .\" ----------------------------------------------------------------- @@ -57,7 +57,7 @@ If your system supports the wctype functions, these values will be set automatic .sp tesseract(1), unicharset(5) .sp -\m[blue]\fBhttp://code\&.google\&.com/p/tesseract\-ocr/wiki/TrainingTesseract3\fR\m[] +\m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TrainingTesseract\fR\m[] .SH "HISTORY" .sp unicharset_extractor first appeared in Tesseract 2\&.00\&. diff --git a/doc/unicharset_extractor.1.asc b/doc/unicharset_extractor.1.asc index a331d597e7..c972783a8e 100644 --- a/doc/unicharset_extractor.1.asc +++ b/doc/unicharset_extractor.1.asc @@ -40,7 +40,7 @@ SEE ALSO -------- tesseract(1), unicharset(5) - + HISTORY ------- diff --git a/doc/unicharset_extractor.1.html b/doc/unicharset_extractor.1.html index 8ab1a3a73e..a6ac9e898b 100644 --- a/doc/unicharset_extractor.1.html +++ b/doc/unicharset_extractor.1.html @@ -2,15 +2,25 @@ "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> - - + + UNICHARSET_EXTRACTOR(1) - +

SYNOPSIS

unicharset_extractor [-D dir] FILE…

DESCRIPTION

Tesseract needs to know the set of possible characters it can output. @@ -592,7 +759,7 @@

DESCRIPTION

clustering:

unicharset_extractor fontfile_1.box fontfile_2.box ...

unicharset_extractor fontfile_1.box fontfile_2.box ...

The unicharset will be put into the file dir/unicharset, or simply ./unicharset if no output directory is provided.

@@ -609,30 +776,39 @@

DESCRIPTION

previous versions by running unicharset_extractor before mftraining and cntraining, and giving the unicharset to mftraining.

HISTORY

unicharset_extractor first appeared in Tesseract 2.00.

COPYING

AUTHOR

The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985-1995) and Google (2006-present).

diff --git a/doc/unicharset_extractor.1.xml b/doc/unicharset_extractor.1.xml index d4a5f766a7..bea4d1e16e 100644 --- a/doc/unicharset_extractor.1.xml +++ b/doc/unicharset_extractor.1.xml @@ -3,11 +3,14 @@ + + UNICHARSET_EXTRACTOR(1) + unicharset_extractor 1 - - + + unicharset_extractor @@ -41,7 +44,7 @@ cntraining, and giving the unicharset to mftraining. SEE ALSO tesseract(1), unicharset(5) -http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 +https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract HISTORY diff --git a/doc/wordlist2dawg.1 b/doc/wordlist2dawg.1 index 930a782652..4c8cd19e04 100644 --- a/doc/wordlist2dawg.1 +++ b/doc/wordlist2dawg.1 @@ -1,13 +1,13 @@ '\" t .\" Title: wordlist2dawg .\" Author: [see the "AUTHOR" section] -.\" Generator: DocBook XSL Stylesheets v1.75.2 -.\" Date: 02/09/2012 +.\" Generator: DocBook XSL Stylesheets v1.78.1 +.\" Date: 06/12/2015 .\" Manual: \ \& .\" Source: \ \& .\" Language: English .\" -.TH "WORDLIST2DAWG" "1" "02/09/2012" "\ \&" "\ \&" +.TH "WORDLIST2DAWG" "1" "06/12/2015" "\ \&" "\ \&" .\" ----------------------------------------------------------------- .\" * Define some portability stuff .\" ----------------------------------------------------------------- @@ -63,7 +63,7 @@ wordlist2dawg(1) converts a wordlist to a Directed Acyclic Word Graph (DAWG) for .sp tesseract(1), combine_tessdata(1), dawg2wordlist(1) .sp -\m[blue]\fBhttp://code\&.google\&.com/p/tesseract\-ocr/wiki/TrainingTesseract3\fR\m[] +\m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TrainingTesseract\fR\m[] .SH "COPYING" .sp Copyright (C) 2006 Google, Inc\&. Licensed under the Apache License, Version 2\&.0 diff --git a/doc/wordlist2dawg.1.asc b/doc/wordlist2dawg.1.asc index f0193f14bf..b4f84ad59e 100644 --- a/doc/wordlist2dawg.1.asc +++ b/doc/wordlist2dawg.1.asc @@ -56,7 +56,7 @@ SEE ALSO -------- tesseract(1), combine_tessdata(1), dawg2wordlist(1) - + COPYING ------- diff --git a/doc/wordlist2dawg.1.html b/doc/wordlist2dawg.1.html index a1f72443a2..58e5cab4fa 100644 --- a/doc/wordlist2dawg.1.html +++ b/doc/wordlist2dawg.1.html @@ -2,15 +2,25 @@ "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> - - + + WORDLIST2DAWG(1) - +

SYNOPSIS

wordlist2dawg WORDLIST DAWG lang.unicharset

@@ -588,12 +753,16 @@

SYNOPSIS

wordlist2dawg -r 2 WORDLIST DAWG lang.unicharset

wordlist2dawg -l <short> <long> WORDLIST DAWG lang.unicharset

DESCRIPTION

wordlist2dawg(1) converts a wordlist to a Directed Acyclic Word Graph (DAWG) for use with Tesseract. A DAWG is a compressed, space and time efficient representation of a word list.

OPTIONS

-t @@ -606,6 +775,8 @@

OPTIONS

Produce a file with several dawgs in it, one each for words of length <short>, <short+1>,… <long>

ARGUMENTS

WORDLIST @@ -616,26 +787,33 @@

ARGUMENTS

The unicharset of the language. This is the unicharset generated by mftraining(1).

COPYING

AUTHOR

The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985-1995) and Google (2006-present).

diff --git a/doc/wordlist2dawg.1.xml b/doc/wordlist2dawg.1.xml index cc05a0d155..907d3a574d 100644 --- a/doc/wordlist2dawg.1.xml +++ b/doc/wordlist2dawg.1.xml @@ -3,11 +3,14 @@ + + WORDLIST2DAWG(1) + wordlist2dawg 1 - - + + wordlist2dawg @@ -51,7 +54,7 @@ efficient representation of a word list. SEE ALSO tesseract(1), combine_tessdata(1), dawg2wordlist(1) -http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 +https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract COPYING

AMBIGUOUS_WORDS(1) Manual Page @@ -580,10 +744,13 @@

NAME

SYNOPSIS

DESCRIPTION

DESCRIPTION

SEE ALSO

COPYING

AUTHOR

CNTRAINING(1) Manual Page @@ -580,16 +744,21 @@

NAME

SYNOPSIS

DESCRIPTION

OPTIONS

OPTIONS

SEE ALSO

COPYING

AUTHOR

COMBINE_TESSDATA(1) Manual Page @@ -580,10 +744,13 @@

NAME

SYNOPSIS

DESCRIPTION

DESCRIPTION

DESCRIPTION

DESCRIPTION

DESCRIPTION

OPTIONS

OPTIONS

CAVEATS

COMPONENTS

COMPONENTS

HISTORY

SEE ALSO

COPYING

AUTHOR

DAWG2WORDLIST(1) Manual Page @@ -580,15 +744,20 @@

NAME

SYNOPSIS

DESCRIPTION

OPTIONS

OPTIONS

SEE ALSO

COPYING

AUTHOR

MFTRAINING(1) Manual Page @@ -580,10 +744,13 @@

NAME

SYNOPSIS

DESCRIPTION

DESCRIPTION

OPTIONS

OPTIONS

OPTIONS

OPTIONS

SEE ALSO

COPYING

AUTHOR

SHAPECLUSTERING(1) Manual Page @@ -580,6 +744,7 @@

NAME

SYNOPSIS

SYNOPSIS

DESCRIPTION

DESCRIPTION

OPTIONS

OPTIONS

OPTIONS

OPTIONS

SEE ALSO

COPYING

AUTHOR

UNICHARAMBIGS(5) Manual Page @@ -580,6 +744,7 @@

NAME

DESCRIPTION

DESCRIPTION

DESCRIPTION

EXAMPLE

EXAMPLE

HISTORY

HISTORY

BUGS

SEE ALSO

AUTHOR