diff --git a/README.md b/README similarity index 100% rename from README.md rename to README diff --git a/doc/ambiguous_words.1 b/doc/ambiguous_words.1 index ce32f4cd77..1a1761ca3d 100644 --- a/doc/ambiguous_words.1 +++ b/doc/ambiguous_words.1 @@ -1,13 +1,13 @@ '\" t .\" Title: ambiguous_words .\" Author: [see the "AUTHOR" section] -.\" Generator: DocBook XSL Stylesheets v1.75.2 -.\" Date: 02/09/2012 +.\" Generator: DocBook XSL Stylesheets v1.78.1 +.\" Date: 06/12/2015 .\" Manual: \ \& .\" Source: \ \& .\" Language: English .\" -.TH "AMBIGUOUS_WORDS" "1" "02/09/2012" "\ \&" "\ \&" +.TH "AMBIGUOUS_WORDS" "1" "06/12/2015" "\ \&" "\ \&" .\" ----------------------------------------------------------------- .\" * Define some portability stuff .\" ----------------------------------------------------------------- diff --git a/doc/ambiguous_words.1.html b/doc/ambiguous_words.1.html index ae1e201015..3fd5f7f1f6 100644 --- a/doc/ambiguous_words.1.html +++ b/doc/ambiguous_words.1.html @@ -2,15 +2,25 @@ "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> - - + + AMBIGUOUS_WORDS(1) - +
+

SYNOPSIS

ambiguous_words [-l lang] TESSDATADIR WORDLIST AMBIGUOUSFILE

+
+

DESCRIPTION

ambiguous_words(1) runs Tesseract in a special mode, and for each word @@ -591,25 +758,32 @@

DESCRIPTION

ambiguous with it. TESSDATADIR must be set to the absolute path of a directory containing tessdata/lang.traineddata.

+
+

SEE ALSO

tesseract(1)

+
+

COPYING

Copyright (C) 2012 Google, Inc. Licensed under the Apache License, Version 2.0

+
+

AUTHOR

The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985-1995) and Google (2006-present).

+

diff --git a/doc/ambiguous_words.1.xml b/doc/ambiguous_words.1.xml index f46dce4252..6293866ceb 100644 --- a/doc/ambiguous_words.1.xml +++ b/doc/ambiguous_words.1.xml @@ -3,11 +3,14 @@ + + AMBIGUOUS_WORDS(1) + ambiguous_words 1 -  -  +  +  ambiguous_words diff --git a/doc/cntraining.1 b/doc/cntraining.1 index 1acc8f812f..332655e513 100644 --- a/doc/cntraining.1 +++ b/doc/cntraining.1 @@ -1,13 +1,13 @@ '\" t .\" Title: cntraining .\" Author: [see the "AUTHOR" section] -.\" Generator: DocBook XSL Stylesheets v1.75.2 -.\" Date: 02/09/2012 +.\" Generator: DocBook XSL Stylesheets v1.78.1 +.\" Date: 06/12/2015 .\" Manual: \ \& .\" Source: \ \& .\" Language: English .\" -.TH "CNTRAINING" "1" "02/09/2012" "\ \&" "\ \&" +.TH "CNTRAINING" "1" "06/12/2015" "\ \&" "\ \&" .\" ----------------------------------------------------------------- .\" * Define some portability stuff .\" ----------------------------------------------------------------- @@ -45,7 +45,7 @@ Directory to write output files to\&. .sp tesseract(1), shapeclustering(1), mftraining(1) .sp -\m[blue]\fBhttp://code\&.google\&.com/p/tesseract\-ocr/wiki/TrainingTesseract3\fR\m[] +\m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TrainingTesseract\fR\m[] .SH "COPYING" .sp Copyright (c) Hewlett\-Packard Company, 1988 Licensed under the Apache License, Version 2\&.0 diff --git a/doc/cntraining.1.asc b/doc/cntraining.1.asc index 808134740b..ef98112e06 100644 --- a/doc/cntraining.1.asc +++ b/doc/cntraining.1.asc @@ -24,7 +24,7 @@ SEE ALSO -------- tesseract(1), shapeclustering(1), mftraining(1) - + COPYING ------- diff --git a/doc/cntraining.1.html b/doc/cntraining.1.html index 085db73365..706d3bd0f4 100644 --- a/doc/cntraining.1.html +++ b/doc/cntraining.1.html @@ -2,15 +2,25 @@ "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> - - + + CNTRAINING(1) - +
+

SYNOPSIS

cntraining [-D dir] FILE

+
+

DESCRIPTION

cntraining takes a list of .tr files, from which it generates the normproto data file (the character normalization sensitivity prototypes).

+
+

OPTIONS

@@ -603,26 +772,33 @@

OPTIONS

+
+ +

COPYING

Copyright (c) Hewlett-Packard Company, 1988 Licensed under the Apache License, Version 2.0

+
+

AUTHOR

The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985-1995) and Google (2006-present).

+

diff --git a/doc/cntraining.1.xml b/doc/cntraining.1.xml index d4d4161805..6795f12f2c 100644 --- a/doc/cntraining.1.xml +++ b/doc/cntraining.1.xml @@ -3,11 +3,14 @@ + + CNTRAINING(1) + cntraining 1 -  -  +  +  cntraining @@ -40,7 +43,7 @@ prototypes). SEE ALSO tesseract(1), shapeclustering(1), mftraining(1) -http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 +https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract COPYING diff --git a/doc/combine_tessdata.1 b/doc/combine_tessdata.1 index 926d183381..d876d1b8ee 100644 --- a/doc/combine_tessdata.1 +++ b/doc/combine_tessdata.1 @@ -1,13 +1,13 @@ '\" t .\" Title: combine_tessdata .\" Author: [see the "AUTHOR" section] -.\" Generator: DocBook XSL Stylesheets v1.75.2 -.\" Date: 02/09/2012 +.\" Generator: DocBook XSL Stylesheets v1.78.1 +.\" Date: 06/12/2015 .\" Manual: \ \& .\" Source: \ \& .\" Language: English .\" -.TH "COMBINE_TESSDATA" "1" "02/09/2012" "\ \&" "\ \&" +.TH "COMBINE_TESSDATA" "1" "06/12/2015" "\ \&" "\ \&" .\" ----------------------------------------------------------------- .\" * Define some portability stuff .\" ----------------------------------------------------------------- @@ -107,7 +107,7 @@ This will create /home/$USER/temp/eng\&.* files with individual tessdata compone \fIPrefix\fR refers to the full file prefix, including period (\&.) .SH "COMPONENTS" .sp -The components in a Tesseract lang\&.traineddata file as of Tesseract 3\&.02 are briefly described below; For more information on many of these files, see \m[blue]\fBhttp://code\&.google\&.com/p/tesseract\-ocr/wiki/TrainingTesseract3\fR\m[] +The components in a Tesseract lang\&.traineddata file as of Tesseract 3\&.02 are briefly described below; For more information on many of these files, see \m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TrainingTesseract\fR\m[] .PP lang\&.config .RS 4 diff --git a/doc/combine_tessdata.1.asc b/doc/combine_tessdata.1.asc index 3632a98d42..d93de7ea0f 100644 --- a/doc/combine_tessdata.1.asc +++ b/doc/combine_tessdata.1.asc @@ -76,7 +76,7 @@ COMPONENTS The components in a Tesseract lang.traineddata file as of Tesseract 3.02 are briefly described below; For more information on many of these files, see - + lang.config:: (Optional) Language-specific overrides to default config variables. diff --git a/doc/combine_tessdata.1.html b/doc/combine_tessdata.1.html index a05044dfc2..8de474b33b 100644 --- a/doc/combine_tessdata.1.html +++ b/doc/combine_tessdata.1.html @@ -2,15 +2,25 @@ "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> - - + + COMBINE_TESSDATA(1) - +
+

SYNOPSIS

combine_tessdata [OPTION] FILE

+
+

DESCRIPTION

combine_tessdata(1) is the main program to combine/extract/overwrite @@ -593,7 +760,7 @@

DESCRIPTION

/home/$USER/temp/eng.* run:

-
combine_tessdata /home/$USER/temp/eng.
+
combine_tessdata /home/$USER/temp/eng.

The result will be a combined tessdata file /home/$USER/temp/eng.traineddata

Specify option -e if you would like to extract individual components @@ -601,8 +768,8 @@

DESCRIPTION

file and the unicharset from tessdata/eng.traineddata run:

-
combine_tessdata -e tessdata/eng.traineddata \
-  /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset
+
combine_tessdata -e tessdata/eng.traineddata \
+  /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset

The desired config file and unicharset will be written to /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset

@@ -611,8 +778,8 @@

DESCRIPTION

and unichar ambiguities files in tessdata/eng.traineddata use:

-
combine_tessdata -o tessdata/eng.traineddata \
-  /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharambigs
+
combine_tessdata -o tessdata/eng.traineddata \
+  /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharambigs

As a result, tessdata/eng.traineddata will contain the new language config and unichar ambigs, plus all the original DAWGs, classifier templates, etc.

@@ -623,11 +790,13 @@

DESCRIPTION

Specify option -u to unpack all the components to the specified path:

-
combine_tessdata -u tessdata/eng.traineddata /home/$USER/temp/eng.
+
combine_tessdata -u tessdata/eng.traineddata /home/$USER/temp/eng.

This will create /home/$USER/temp/eng.* files with individual tessdata components from tessdata/eng.traineddata.

+
+

OPTIONS

-e .traineddata FILE…: @@ -638,16 +807,20 @@

OPTIONS

-u .traineddata PATHPREFIX Unpacks the .traineddata using the provided prefix.

+
+

CAVEATS

Prefix refers to the full file prefix, including period (.)

+
+

COMPONENTS

The components in a Tesseract lang.traineddata file as of Tesseract 3.02 are briefly described below; For more information on many of these files, see -http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3

+https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract

lang.config @@ -802,30 +975,39 @@

COMPONENTS

+
+

HISTORY

combine_tessdata(1) first appeared in version 3.00 of Tesseract

+
+

SEE ALSO

tesseract(1), wordlist2dawg(1), cntraining(1), mftraining(1), unicharset(5), unicharambigs(5)

+
+

COPYING

Copyright (C) 2009, Google Inc. Licensed under the Apache License, Version 2.0

+
+

AUTHOR

The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985-1995) and Google (2006-present).

+

diff --git a/doc/combine_tessdata.1.xml b/doc/combine_tessdata.1.xml index 0cb023cad0..1a43995fb5 100644 --- a/doc/combine_tessdata.1.xml +++ b/doc/combine_tessdata.1.xml @@ -3,11 +3,14 @@ + + COMBINE_TESSDATA(1) + combine_tessdata 1 -  -  +  +  combine_tessdata @@ -67,7 +70,7 @@ components from tessdata/eng.traineddata. The components in a Tesseract lang.traineddata file as of Tesseract 3.02 are briefly described below; For more information on many of these files, see -http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 +https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract diff --git a/doc/dawg2wordlist.1 b/doc/dawg2wordlist.1 index 2d73da370b..5fb50b522b 100644 --- a/doc/dawg2wordlist.1 +++ b/doc/dawg2wordlist.1 @@ -1,13 +1,13 @@ '\" t .\" Title: dawg2wordlist .\" Author: [see the "AUTHOR" section] -.\" Generator: DocBook XSL Stylesheets v1.75.2 -.\" Date: 02/09/2012 +.\" Generator: DocBook XSL Stylesheets v1.78.1 +.\" Date: 06/12/2015 .\" Manual: \ \& .\" Source: \ \& .\" Language: English .\" -.TH "DAWG2WORDLIST" "1" "02/09/2012" "\ \&" "\ \&" +.TH "DAWG2WORDLIST" "1" "06/12/2015" "\ \&" "\ \&" .\" ----------------------------------------------------------------- .\" * Define some portability stuff .\" ----------------------------------------------------------------- @@ -46,7 +46,7 @@ dawg2wordlist(1) converts a Tesseract Directed Acyclic Word Graph (DAWG) to a li .sp tesseract(1), mftraining(1), wordlist2dawg(1), unicharset(5), combine_tessdata(1) .sp -\m[blue]\fBhttp://code\&.google\&.com/p/tesseract\-ocr/wiki/TrainingTesseract3\fR\m[] +\m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TrainingTesseract\fR\m[] .SH "COPYING" .sp Copyright (C) 2012 Google, Inc\&. Licensed under the Apache License, Version 2\&.0 diff --git a/doc/dawg2wordlist.1.asc b/doc/dawg2wordlist.1.asc index cd644a01bf..93594d61ae 100644 --- a/doc/dawg2wordlist.1.asc +++ b/doc/dawg2wordlist.1.asc @@ -32,7 +32,7 @@ SEE ALSO tesseract(1), mftraining(1), wordlist2dawg(1), unicharset(5), combine_tessdata(1) - + COPYING ------- diff --git a/doc/dawg2wordlist.1.html b/doc/dawg2wordlist.1.html index 9d926f9e8a..b700fe186d 100644 --- a/doc/dawg2wordlist.1.html +++ b/doc/dawg2wordlist.1.html @@ -2,15 +2,25 @@ "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> - - + + DAWG2WORDLIST(1) - +
+

SYNOPSIS

dawg2wordlist UNICHARSET DAWG WORDLIST

+
+

DESCRIPTION

dawg2wordlist(1) converts a Tesseract Directed Acyclic Word Graph (DAWG) to a list of words using a unicharset as key.

+
+

OPTIONS

UNICHARSET @@ -599,27 +768,34 @@

OPTIONS

WORDLIST Plain text (output) file in UTF-8, one word per line

+
+

SEE ALSO

tesseract(1), mftraining(1), wordlist2dawg(1), unicharset(5), combine_tessdata(1)

- + +
+

COPYING

Copyright (C) 2012 Google, Inc. Licensed under the Apache License, Version 2.0

+
+

AUTHOR

The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985-1995) and Google (2006-present).

+

diff --git a/doc/dawg2wordlist.1.xml b/doc/dawg2wordlist.1.xml index 5a9a224b95..c73113191c 100644 --- a/doc/dawg2wordlist.1.xml +++ b/doc/dawg2wordlist.1.xml @@ -3,11 +3,14 @@ + + DAWG2WORDLIST(1) + dawg2wordlist 1 -  -  +  +  dawg2wordlist @@ -35,7 +38,7 @@ Graph (DAWG) to a list of words using a unicharset as key. SEE ALSO tesseract(1), mftraining(1), wordlist2dawg(1), unicharset(5), combine_tessdata(1) -http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 +https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract COPYING diff --git a/doc/mftraining.1 b/doc/mftraining.1 index 441e03b258..1901850ada 100644 --- a/doc/mftraining.1 +++ b/doc/mftraining.1 @@ -1,13 +1,13 @@ '\" t .\" Title: mftraining .\" Author: [see the "AUTHOR" section] -.\" Generator: DocBook XSL Stylesheets v1.75.2 -.\" Date: 02/09/2012 +.\" Generator: DocBook XSL Stylesheets v1.78.1 +.\" Date: 06/12/2015 .\" Manual: \ \& .\" Source: \ \& .\" Language: English .\" -.TH "MFTRAINING" "1" "02/09/2012" "\ \&" "\ \&" +.TH "MFTRAINING" "1" "06/12/2015" "\ \&" "\ \&" .\" ----------------------------------------------------------------- .\" * Define some portability stuff .\" ----------------------------------------------------------------- @@ -85,7 +85,7 @@ Directory to write output files to\&. .sp tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1), shapeclustering(1), unicharset(5) .sp -\m[blue]\fBhttp://code\&.google\&.com/p/tesseract\-ocr/wiki/TrainingTesseract3\fR\m[] +\m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TrainingTesseract\fR\m[] .SH "COPYING" .sp Copyright (C) Hewlett\-Packard Company, 1988 Licensed under the Apache License, Version 2\&.0 diff --git a/doc/mftraining.1.asc b/doc/mftraining.1.asc index 1a57d1e3c0..85e1263ade 100644 --- a/doc/mftraining.1.asc +++ b/doc/mftraining.1.asc @@ -43,7 +43,7 @@ SEE ALSO tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1), shapeclustering(1), unicharset(5) - + COPYING ------- diff --git a/doc/mftraining.1.html b/doc/mftraining.1.html index 4d5e54bb82..4abdfd6a6c 100644 --- a/doc/mftraining.1.html +++ b/doc/mftraining.1.html @@ -2,15 +2,25 @@ "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> - - + + MFTRAINING(1) - +
+

SYNOPSIS

mftraining -U unicharset -O lang.unicharset FILE

+
+

DESCRIPTION

mftraining takes a list of .tr files, from which it generates the @@ -591,6 +758,8 @@

DESCRIPTION

(the number of expected features for each character). (A fourth file called Microfeat is also written by this program, but it is not used.)

+
+

OPTIONS

@@ -611,7 +780,7 @@

OPTIONS

-
*font_name* *italic* *bold* *fixed_pitch* *serif* *fraktur*
+
*font_name* *italic* *bold* *fixed_pitch* *serif* *fraktur*
@@ -623,7 +792,7 @@

OPTIONS

-
*font_name* *xheight*
+
*font_name* *xheight*
@@ -644,27 +813,34 @@

OPTIONS

+
+

SEE ALSO

tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1), shapeclustering(1), unicharset(5)

- + +
+

COPYING

Copyright (C) Hewlett-Packard Company, 1988 Licensed under the Apache License, Version 2.0

+
+

AUTHOR

The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985-1995) and Google (2006-present).

+

diff --git a/doc/mftraining.1.xml b/doc/mftraining.1.xml index 0f85e4f9d2..239178a5c1 100644 --- a/doc/mftraining.1.xml +++ b/doc/mftraining.1.xml @@ -3,11 +3,14 @@ + + MFTRAINING(1) + mftraining 1 -  -  +  +  mftraining @@ -84,7 +87,7 @@ called Microfeat is also written by this program, but it is not used.) SEE ALSO tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1), shapeclustering(1), unicharset(5) -http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 +https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract
COPYING diff --git a/doc/shapeclustering.1 b/doc/shapeclustering.1 index d59783f0d8..f1f9fbdea6 100644 --- a/doc/shapeclustering.1 +++ b/doc/shapeclustering.1 @@ -1,13 +1,13 @@ '\" t .\" Title: shapeclustering .\" Author: [see the "AUTHOR" section] -.\" Generator: DocBook XSL Stylesheets v1.75.2 -.\" Date: 02/09/2012 +.\" Generator: DocBook XSL Stylesheets v1.78.1 +.\" Date: 06/12/2015 .\" Manual: \ \& .\" Source: \ \& .\" Language: English .\" -.TH "SHAPECLUSTERING" "1" "02/09/2012" "\ \&" "\ \&" +.TH "SHAPECLUSTERING" "1" "06/12/2015" "\ \&" "\ \&" .\" ----------------------------------------------------------------- .\" * Define some portability stuff .\" ----------------------------------------------------------------- @@ -85,7 +85,7 @@ The output unicharset that will be given to combine_tessdata(1)\&. .sp tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1), unicharset(5) .sp -\m[blue]\fBhttp://code\&.google\&.com/p/tesseract\-ocr/wiki/TrainingTesseract3\fR\m[] +\m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TrainingTesseract\fR\m[] .SH "COPYING" .sp Copyright (C) Google, 2011 Licensed under the Apache License, Version 2\&.0 diff --git a/doc/shapeclustering.1.asc b/doc/shapeclustering.1.asc index cab0dc43dc..81ca0dbc09 100644 --- a/doc/shapeclustering.1.asc +++ b/doc/shapeclustering.1.asc @@ -46,7 +46,7 @@ SEE ALSO tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1), unicharset(5) - + COPYING ------- diff --git a/doc/shapeclustering.1.html b/doc/shapeclustering.1.html index a1f42cca99..845d49a815 100644 --- a/doc/shapeclustering.1.html +++ b/doc/shapeclustering.1.html @@ -2,15 +2,25 @@ "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> - - + + SHAPECLUSTERING(1) - +
+

SYNOPSIS

shapeclustering -D output_dir @@ -587,6 +752,8 @@

SYNOPSIS

-F font_props -X xheights FILE

+
+

DESCRIPTION

shapeclustering(1) takes extracted feature .tr files (generated by @@ -594,6 +761,8 @@

DESCRIPTION

file shapetable and an enhanced unicharset. This program is still experimental, and is not required (yet) for training Tesseract.

+
+

OPTIONS

@@ -622,7 +791,7 @@

OPTIONS

-
'font_name' 'italic' 'bold' 'fixed_pitch' 'serif' 'fraktur'
+
'font_name' 'italic' 'bold' 'fixed_pitch' 'serif' 'fraktur'
@@ -634,7 +803,7 @@

OPTIONS

-
'font_name' 'xheight'
+
'font_name' 'xheight'
@@ -647,27 +816,34 @@

OPTIONS

+
+

SEE ALSO

tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1), unicharset(5)

- + +
+

COPYING

Copyright (C) Google, 2011 Licensed under the Apache License, Version 2.0

+
+

AUTHOR

The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985-1995) and Google (2006-present).

+

diff --git a/doc/shapeclustering.1.xml b/doc/shapeclustering.1.xml index 8000d27ea1..d02bcf8db9 100644 --- a/doc/shapeclustering.1.xml +++ b/doc/shapeclustering.1.xml @@ -3,11 +3,14 @@ + + SHAPECLUSTERING(1) + shapeclustering 1 -  -  +  +  shapeclustering @@ -87,7 +90,7 @@ experimental, and is not required (yet) for training Tesseract. SEE ALSO tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1), unicharset(5) -http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 +https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract
COPYING diff --git a/doc/tesseract.1 b/doc/tesseract.1 index 7acdb90de1..d509a03430 100644 --- a/doc/tesseract.1 +++ b/doc/tesseract.1 @@ -2,12 +2,12 @@ .\" Title: tesseract .\" Author: [see the "AUTHOR" section] .\" Generator: DocBook XSL Stylesheets v1.78.1 -.\" Date: 08/02/2014 +.\" Date: 06/12/2015 .\" Manual: \ \& .\" Source: \ \& .\" Language: English .\" -.TH "TESSERACT" "1" "08/02/2014" "\ \&" "\ \&" +.TH "TESSERACT" "1" "06/12/2015" "\ \&" "\ \&" .\" ----------------------------------------------------------------- .\" * Define some portability stuff .\" ----------------------------------------------------------------- @@ -224,7 +224,7 @@ The engine was developed at Hewlett Packard Laboratories Bristol and at Hewlett .sp Version 2\&.00 brought Unicode (UTF\-8) support, six languages, and the ability to train Tesseract\&. .sp -Tesseract was included in UNLV\(cqs Fourth Annual Test of OCR Accuracy\&. See \m[blue]\fBhttp://www\&.isri\&.unlv\&.edu/downloads/AT\-1995\&.pdf\fR\m[]\&. With Tesseract 2\&.00, scripts are now included to allow anyone to reproduce some of these tests\&. See \m[blue]\fBhttp://code\&.google\&.com/p/tesseract\-ocr/wiki/TestingTesseract\fR\m[] for more details\&. +Tesseract was included in UNLV\(cqs Fourth Annual Test of OCR Accuracy\&. See \m[blue]\fBhttp://www\&.isri\&.unlv\&.edu/downloads/AT\-1995\&.pdf\fR\m[]\&. With Tesseract 2\&.00, scripts are now included to allow anyone to reproduce some of these tests\&. See \m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TestingTesseract\fR\m[] for more details\&. .sp Tesseract 3\&.00 adds a number of new languages, including Chinese, Japanese, and Korean\&. It also introduces a new, single\-file based system of managing language data\&. .sp @@ -233,7 +233,7 @@ Tesseract 3\&.02 adds BiDirectional text support, the ability to recognize multi For further details, see the file ReleaseNotes included with the distribution\&. .SH "RESOURCES" .sp -Main web site: \m[blue]\fBhttp://code\&.google\&.com/p/tesseract\-ocr/\fR\m[] Information on training: \m[blue]\fBhttp://code\&.google\&.com/p/tesseract\-ocr/wiki/TrainingTesseract3\fR\m[] +Main web site: \m[blue]\fBhttps://github\&.com/tesseract\-ocr\fR\m[] Information on training: \m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TrainingTesseract\fR\m[] .SH "SEE ALSO" .sp ambiguous_words(1), cntraining(1), combine_tessdata(1), dawg2wordlist(1), shape_training(1), mftraining(1), unicharambigs(5), unicharset(5), unicharset_extractor(1), wordlist2dawg(1) diff --git a/doc/tesseract.1.asc b/doc/tesseract.1.asc index bcb3fccbb9..94048bb676 100644 --- a/doc/tesseract.1.asc +++ b/doc/tesseract.1.asc @@ -218,9 +218,9 @@ Version 2.00 brought Unicode (UTF-8) support, six languages, and the ability to train Tesseract. Tesseract was included in UNLV's Fourth Annual Test of OCR Accuracy. -See . With Tesseract 2.00, +See . With Tesseract 2.00, scripts are now included to allow anyone to reproduce some of these tests. -See for more +See for more details. Tesseract 3.00 adds a number of new languages, including Chinese, Japanese, @@ -234,8 +234,8 @@ For further details, see the file ReleaseNotes included with the distribution. RESOURCES --------- -Main web site: + -Information on training: +Main web site: + +Information on training: SEE ALSO -------- diff --git a/doc/tesseract.1.html b/doc/tesseract.1.html index 3e6d0e5f28..8619987e10 100644 --- a/doc/tesseract.1.html +++ b/doc/tesseract.1.html @@ -3,7 +3,7 @@ - + TESSERACT(1) - +
+

DESCRIPTION

The unicharambigs file (a component of traineddata, see combine_tessdata(1) ) @@ -588,7 +753,7 @@

DESCRIPTION

The file contains a number of lines, laid out as follow:

-
[num] <TAB> [char(s)] <TAB> [num] <TAB> [char(s)] <TAB> [num]
+
[num] <TAB> [char(s)] <TAB> [num] <TAB> [char(s)] <TAB> [num]
@@ -652,13 +817,15 @@

DESCRIPTION

unicharset. The numbers in fields one and three refer to the number of unichars (not bytes).

+ +

EXAMPLE

-
2       ' '     1       "     1
+
2       ' '     1       "     1
 1       m       2       r n   0
-3       i i i   1       m     0
+3 i i i 1 m 0

In this example, all instances of the 2 character sequence '' will always be replaced by the 1 character sequence "; a 1 character @@ -666,6 +833,8 @@

EXAMPLE

the 3 character sequence may be replaced by the 1 character sequence m.

+
+

HISTORY

The unicharambigs file first appeared in Tesseract 3.00; prior to that, a @@ -673,26 +842,33 @@

HISTORY

format was almost identical, except only mandatory replacements could be specified, and field 5 was absent.

+
+

BUGS

This is a documentation "bug": it’s not currently clear what should be done in the case of ligatures (such as fi) which may also appear as regular letters in the unicharset.

+
+

SEE ALSO

tesseract(1), unicharset(5)

+
+

AUTHOR

The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985-1995) and Google (2006-present).

+

diff --git a/doc/unicharambigs.5.xml b/doc/unicharambigs.5.xml index 12ecb1fc29..75b3c66431 100644 --- a/doc/unicharambigs.5.xml +++ b/doc/unicharambigs.5.xml @@ -3,11 +3,14 @@ + + UNICHARAMBIGS(5) + unicharambigs 5 -  -  +  +  unicharambigs diff --git a/doc/unicharset.5 b/doc/unicharset.5 index fd9cccd642..a5924db6e8 100644 --- a/doc/unicharset.5 +++ b/doc/unicharset.5 @@ -1,13 +1,13 @@ '\" t .\" Title: unicharset .\" Author: [see the "AUTHOR" section] -.\" Generator: DocBook XSL Stylesheets v1.75.2 -.\" Date: 02/09/2012 +.\" Generator: DocBook XSL Stylesheets v1.78.1 +.\" Date: 06/12/2015 .\" Manual: \ \& .\" Source: \ \& .\" Language: English .\" -.TH "UNICHARSET" "5" "02/09/2012" "\ \&" "\ \&" +.TH "UNICHARSET" "5" "06/12/2015" "\ \&" "\ \&" .\" ----------------------------------------------------------------- .\" * Define some portability stuff .\" ----------------------------------------------------------------- @@ -214,7 +214,7 @@ The unicharset format first appeared with Tesseract 2\&.00, which was the first .sp tesseract(1), combine_tessdata(1), unicharset_extractor(1) .sp -\m[blue]\fBhttp://code\&.google\&.com/p/tesseract\-ocr/wiki/TrainingTesseract3\fR\m[] +\m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TrainingTesseract\fR\m[] .SH "AUTHOR" .sp The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985\-1995) and Google (2006\-present)\&. diff --git a/doc/unicharset.5.asc b/doc/unicharset.5.asc index ed8c602ad3..5b859daa1e 100644 --- a/doc/unicharset.5.asc +++ b/doc/unicharset.5.asc @@ -124,7 +124,7 @@ SEE ALSO -------- tesseract(1), combine_tessdata(1), unicharset_extractor(1) - + AUTHOR diff --git a/doc/unicharset.5.html b/doc/unicharset.5.html index f76bafaffb..0f16c9e5e5 100644 --- a/doc/unicharset.5.html +++ b/doc/unicharset.5.html @@ -2,15 +2,25 @@ "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> - - + + UNICHARSET(5) - +
+

DESCRIPTION

Tesseract’s unicharset file contains information on each symbol @@ -596,12 +761,12 @@

DESCRIPTION

Each unichar line in the unicharset file (v2+) may have four space-separated fields:

-
'character' 'properties' 'script' 'id'
+
'character' 'properties' 'script' 'id'

Starting with Tesseract v3.02, more information may be given for each unichar:

-
'character' 'properties' 'glyph_metrics' 'script' 'other_case' 'direction' 'mirror' 'normed_form'
+
'character' 'properties' 'glyph_metrics' 'script' 'other_case' 'direction' 'mirror' 'normed_form'

Entries:

@@ -712,15 +877,17 @@

DESCRIPTION

+
+

EXAMPLE (v2)

-
; 10 Common 46
+
; 10 Common 46
 b 3 Latin 59
 W 5 Latin 40
 7 8 Common 66
-= 0 Common 93
+= 0 Common 93

";" is a punctuation character. Its properties are thus represented by the binary number 10000 (10 in hexadecimal).

@@ -736,20 +903,24 @@

EXAMPLE (v2)

binary number 00001 (1 in hexadecimal): they are alphabetic, but neither upper nor lower case.

+
+

EXAMPLE (v3.02)

-
110
+
110
 NULL 0 NULL 0
 N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N
 Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y
 1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1
 9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9
 a 3 58,65,186,198,85,164,0,26,97,185 Latin 56 0 5 a
-. . .
+. . .
+
+

CAVEATS

Although the unicharset reader maintains the ability to read unicharsets @@ -759,6 +930,8 @@

CAVEATS

so changing it without re-generating the others is likely to have dire consequences.

+
+

HISTORY

The unicharset format first appeared with Tesseract 2.00, which was the @@ -766,21 +939,26 @@

HISTORY

contained only the first two fields, and the "ispunctuation" property was absent (punctuation was regarded as "0", as "=" is in the above example.

+
+ +

AUTHOR

The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985-1995) and Google (2006-present).

+

diff --git a/doc/unicharset.5.xml b/doc/unicharset.5.xml index e14f2c1ce1..9ae6257e60 100644 --- a/doc/unicharset.5.xml +++ b/doc/unicharset.5.xml @@ -3,11 +3,14 @@ + + UNICHARSET(5) + unicharset 5 -  -  +  +  unicharset @@ -206,7 +209,7 @@ absent (punctuation was regarded as "0", as "=" is in the above example. SEE ALSO tesseract(1), combine_tessdata(1), unicharset_extractor(1) -http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 +https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract AUTHOR diff --git a/doc/unicharset_extractor.1 b/doc/unicharset_extractor.1 index c3bdf2fce3..ed2040dbfc 100644 --- a/doc/unicharset_extractor.1 +++ b/doc/unicharset_extractor.1 @@ -1,13 +1,13 @@ '\" t .\" Title: unicharset_extractor .\" Author: [see the "AUTHOR" section] -.\" Generator: DocBook XSL Stylesheets v1.75.2 -.\" Date: 02/09/2012 +.\" Generator: DocBook XSL Stylesheets v1.78.1 +.\" Date: 06/12/2015 .\" Manual: \ \& .\" Source: \ \& .\" Language: English .\" -.TH "UNICHARSET_EXTRACTOR" "1" "02/09/2012" "\ \&" "\ \&" +.TH "UNICHARSET_EXTRACTOR" "1" "06/12/2015" "\ \&" "\ \&" .\" ----------------------------------------------------------------- .\" * Define some portability stuff .\" ----------------------------------------------------------------- @@ -57,7 +57,7 @@ If your system supports the wctype functions, these values will be set automatic .sp tesseract(1), unicharset(5) .sp -\m[blue]\fBhttp://code\&.google\&.com/p/tesseract\-ocr/wiki/TrainingTesseract3\fR\m[] +\m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TrainingTesseract\fR\m[] .SH "HISTORY" .sp unicharset_extractor first appeared in Tesseract 2\&.00\&. diff --git a/doc/unicharset_extractor.1.asc b/doc/unicharset_extractor.1.asc index a331d597e7..c972783a8e 100644 --- a/doc/unicharset_extractor.1.asc +++ b/doc/unicharset_extractor.1.asc @@ -40,7 +40,7 @@ SEE ALSO -------- tesseract(1), unicharset(5) - + HISTORY ------- diff --git a/doc/unicharset_extractor.1.html b/doc/unicharset_extractor.1.html index 8ab1a3a73e..a6ac9e898b 100644 --- a/doc/unicharset_extractor.1.html +++ b/doc/unicharset_extractor.1.html @@ -2,15 +2,25 @@ "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> - - + + UNICHARSET_EXTRACTOR(1) - +
+

SYNOPSIS

unicharset_extractor [-D dir] FILE

+
+

DESCRIPTION

Tesseract needs to know the set of possible characters it can output. @@ -592,7 +759,7 @@

DESCRIPTION

clustering:

-
unicharset_extractor fontfile_1.box fontfile_2.box ...
+
unicharset_extractor fontfile_1.box fontfile_2.box ...

The unicharset will be put into the file dir/unicharset, or simply ./unicharset if no output directory is provided.

@@ -609,30 +776,39 @@

DESCRIPTION

previous versions by running unicharset_extractor before mftraining and cntraining, and giving the unicharset to mftraining.

+
+ +

HISTORY

unicharset_extractor first appeared in Tesseract 2.00.

+
+

COPYING

Copyright (C) 2006, Google Inc. Licensed under the Apache License, Version 2.0

+
+

AUTHOR

The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985-1995) and Google (2006-present).

+

diff --git a/doc/unicharset_extractor.1.xml b/doc/unicharset_extractor.1.xml index d4a5f766a7..bea4d1e16e 100644 --- a/doc/unicharset_extractor.1.xml +++ b/doc/unicharset_extractor.1.xml @@ -3,11 +3,14 @@ + + UNICHARSET_EXTRACTOR(1) + unicharset_extractor 1 -  -  +  +  unicharset_extractor @@ -41,7 +44,7 @@ cntraining, and giving the unicharset to mftraining. SEE ALSO tesseract(1), unicharset(5) -http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 +https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract HISTORY diff --git a/doc/wordlist2dawg.1 b/doc/wordlist2dawg.1 index 930a782652..4c8cd19e04 100644 --- a/doc/wordlist2dawg.1 +++ b/doc/wordlist2dawg.1 @@ -1,13 +1,13 @@ '\" t .\" Title: wordlist2dawg .\" Author: [see the "AUTHOR" section] -.\" Generator: DocBook XSL Stylesheets v1.75.2 -.\" Date: 02/09/2012 +.\" Generator: DocBook XSL Stylesheets v1.78.1 +.\" Date: 06/12/2015 .\" Manual: \ \& .\" Source: \ \& .\" Language: English .\" -.TH "WORDLIST2DAWG" "1" "02/09/2012" "\ \&" "\ \&" +.TH "WORDLIST2DAWG" "1" "06/12/2015" "\ \&" "\ \&" .\" ----------------------------------------------------------------- .\" * Define some portability stuff .\" ----------------------------------------------------------------- @@ -63,7 +63,7 @@ wordlist2dawg(1) converts a wordlist to a Directed Acyclic Word Graph (DAWG) for .sp tesseract(1), combine_tessdata(1), dawg2wordlist(1) .sp -\m[blue]\fBhttp://code\&.google\&.com/p/tesseract\-ocr/wiki/TrainingTesseract3\fR\m[] +\m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TrainingTesseract\fR\m[] .SH "COPYING" .sp Copyright (C) 2006 Google, Inc\&. Licensed under the Apache License, Version 2\&.0 diff --git a/doc/wordlist2dawg.1.asc b/doc/wordlist2dawg.1.asc index f0193f14bf..b4f84ad59e 100644 --- a/doc/wordlist2dawg.1.asc +++ b/doc/wordlist2dawg.1.asc @@ -56,7 +56,7 @@ SEE ALSO -------- tesseract(1), combine_tessdata(1), dawg2wordlist(1) - + COPYING ------- diff --git a/doc/wordlist2dawg.1.html b/doc/wordlist2dawg.1.html index a1f72443a2..58e5cab4fa 100644 --- a/doc/wordlist2dawg.1.html +++ b/doc/wordlist2dawg.1.html @@ -2,15 +2,25 @@ "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> - - + + WORDLIST2DAWG(1) - +
+

SYNOPSIS

wordlist2dawg WORDLIST DAWG lang.unicharset

@@ -588,12 +753,16 @@

SYNOPSIS

wordlist2dawg -r 2 WORDLIST DAWG lang.unicharset

wordlist2dawg -l <short> <long> WORDLIST DAWG lang.unicharset

+
+

DESCRIPTION

wordlist2dawg(1) converts a wordlist to a Directed Acyclic Word Graph (DAWG) for use with Tesseract. A DAWG is a compressed, space and time efficient representation of a word list.

+
+

OPTIONS

-t @@ -606,6 +775,8 @@

OPTIONS

Produce a file with several dawgs in it, one each for words of length <short>, <short+1>,… <long>

+
+

ARGUMENTS

WORDLIST @@ -616,26 +787,33 @@

ARGUMENTS

The unicharset of the language. This is the unicharset generated by mftraining(1).

+
+ +

COPYING

Copyright (C) 2006 Google, Inc. Licensed under the Apache License, Version 2.0

+
+

AUTHOR

The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985-1995) and Google (2006-present).

+

diff --git a/doc/wordlist2dawg.1.xml b/doc/wordlist2dawg.1.xml index cc05a0d155..907d3a574d 100644 --- a/doc/wordlist2dawg.1.xml +++ b/doc/wordlist2dawg.1.xml @@ -3,11 +3,14 @@ + + WORDLIST2DAWG(1) + wordlist2dawg 1 -  -  +  +  wordlist2dawg @@ -51,7 +54,7 @@ efficient representation of a word list. SEE ALSO tesseract(1), combine_tessdata(1), dawg2wordlist(1) -http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 +https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract COPYING