diff --git a/.gitattributes b/.gitattributes index 8d013618abfdd3bda36b5cd7deda43494205c0e2..d02cfa0fd4bf09b0446f3f1df183c5a3eccf2e5a 100644 --- a/.gitattributes +++ b/.gitattributes @@ -34,3 +34,33 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text *.zst filter=lfs diff=lfs merge=lfs -text *tfevents* filter=lfs diff=lfs merge=lfs -text tesseract.exe filter=lfs diff=lfs merge=lfs -text +Tesseract-OCR/ambiguous_words.exe filter=lfs diff=lfs merge=lfs -text +Tesseract-OCR/classifier_tester.exe filter=lfs diff=lfs merge=lfs -text +Tesseract-OCR/cntraining.exe filter=lfs diff=lfs merge=lfs -text +Tesseract-OCR/combine_lang_model.exe filter=lfs diff=lfs merge=lfs -text +Tesseract-OCR/combine_tessdata.exe filter=lfs diff=lfs merge=lfs -text +Tesseract-OCR/libcairo-2.dll filter=lfs diff=lfs merge=lfs -text +Tesseract-OCR/libcrypto-3-x64.dll filter=lfs diff=lfs merge=lfs -text +Tesseract-OCR/libgio-2.0-0.dll filter=lfs diff=lfs merge=lfs -text +Tesseract-OCR/libglib-2.0-0.dll filter=lfs diff=lfs merge=lfs -text +Tesseract-OCR/libharfbuzz-0.dll filter=lfs diff=lfs merge=lfs -text +Tesseract-OCR/libiconv-2.dll filter=lfs diff=lfs merge=lfs -text +Tesseract-OCR/libicudt73.dll filter=lfs diff=lfs merge=lfs -text +Tesseract-OCR/libicuin73.dll filter=lfs diff=lfs merge=lfs -text +Tesseract-OCR/libicuuc73.dll filter=lfs diff=lfs merge=lfs -text +Tesseract-OCR/libjpeg-8.dll filter=lfs diff=lfs merge=lfs -text +Tesseract-OCR/libleptonica-6.dll filter=lfs diff=lfs merge=lfs -text +Tesseract-OCR/libstdc++-6.dll filter=lfs diff=lfs merge=lfs -text +Tesseract-OCR/libtesseract-5.dll filter=lfs diff=lfs merge=lfs -text +Tesseract-OCR/libunistring-5.dll filter=lfs diff=lfs merge=lfs -text +Tesseract-OCR/libzstd.dll filter=lfs diff=lfs merge=lfs -text +Tesseract-OCR/lstmeval.exe filter=lfs diff=lfs merge=lfs -text +Tesseract-OCR/lstmtraining.exe filter=lfs diff=lfs merge=lfs -text +Tesseract-OCR/mftraining.exe filter=lfs diff=lfs merge=lfs -text +Tesseract-OCR/set_unicharset_properties.exe filter=lfs diff=lfs merge=lfs -text +Tesseract-OCR/shapeclustering.exe filter=lfs diff=lfs merge=lfs -text +Tesseract-OCR/tessdata/eng.traineddata filter=lfs diff=lfs merge=lfs -text +Tesseract-OCR/tessdata/osd.traineddata filter=lfs diff=lfs merge=lfs -text +Tesseract-OCR/text2image.exe filter=lfs diff=lfs merge=lfs -text +Tesseract-OCR/unicharset_extractor.exe filter=lfs diff=lfs merge=lfs -text +Tesseract-OCR/wordlist2dawg.exe filter=lfs diff=lfs merge=lfs -text diff --git a/Tesseract-OCR/ambiguous_words.1.html b/Tesseract-OCR/ambiguous_words.1.html new file mode 100644 index 0000000000000000000000000000000000000000..0e0c32c9713c669ac40f8973d1937071b896ab5d --- /dev/null +++ b/Tesseract-OCR/ambiguous_words.1.html @@ -0,0 +1,790 @@ + + + + + +AMBIGUOUS_WORDS(1) + + + + + +
+
+

SYNOPSIS

+
+

ambiguous_words [-l lang] TESSDATADIR WORDLIST AMBIGUOUSFILE

+
+
+
+

DESCRIPTION

+
+

ambiguous_words(1) runs Tesseract in a special mode, and for each word +in word list, produces a set of words which Tesseract thinks might be +ambiguous with it. TESSDATADIR must be set to the absolute path of +a directory containing tessdata/lang.traineddata.

+
+
+
+

SEE ALSO

+
+

tesseract(1)

+
+
+
+

COPYING

+
+

Copyright (C) 2012 Google, Inc. +Licensed under the Apache License, Version 2.0

+
+
+
+

AUTHOR

+
+

The Tesseract OCR engine was written by Ray Smith and his research groups +at Hewlett Packard (1985-1995) and Google (2006-present).

+
+
+
+

+ + + diff --git a/Tesseract-OCR/ambiguous_words.exe b/Tesseract-OCR/ambiguous_words.exe new file mode 100644 index 0000000000000000000000000000000000000000..f5a3c7c69db338f632c258d27248a42da54fce82 --- /dev/null +++ b/Tesseract-OCR/ambiguous_words.exe @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:3632e3965e30a66e85cc1ca9f77dbc9fa25a23bd255c8f9f28a82f816704b87a +size 1066496 diff --git a/Tesseract-OCR/classifier_tester.1.html b/Tesseract-OCR/classifier_tester.1.html new file mode 100644 index 0000000000000000000000000000000000000000..a00f961fe8e1f23dab0a28b72d4bb404da86f608 --- /dev/null +++ b/Tesseract-OCR/classifier_tester.1.html @@ -0,0 +1,857 @@ + + + + + +CLASSIFIER_TESTER(1) + + + + + +
+
+

NAME

+
+

classifier_tester - for legacy tesseract engine.

+
+
+
+

SYNOPSIS

+
+

classifier_tester -U unicharset_file -F font_properties_file -X xheights_file -classifier x -lang lang [-output_trainer trainer] *.tr

+
+
+
+

DESCRIPTION

+
+

classifier_tester(1) runs Tesseract in a special mode. +It takes a list of .tr files and tests a character classifier +on data as formatted for training, +but it doesn’t have to be the same as the training data.

+
+
+
+

IN/OUT ARGUMENTS

+
+

a list of .tr files

+
+
+
+

OPTIONS

+
+
+
+-l lang +
+
+

+ (Input) three character language code; default value eng. +

+
+
+-classifier x +
+
+

+ (Input) One of "pruner", "full". +

+
+
+-U unicharset +
+
+

+ (Input) The unicharset for the language. +

+
+
+-F font_properties_file +
+
+

+ (Input) font properties file, each line is of the following form, where each field other than the font name is 0 or 1: +

+
+
+
*font_name* *italic* *bold* *fixed_pitch* *serif* *fraktur*
+
+
+
+-X xheights_file +
+
+

+ (Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ] +

+
+
+
*font_name* *xheight*
+
+
+
+-output_trainer trainer +
+
+

+ (Output, Optional) Filename for output trainer. +

+
+
+
+
+
+

SEE ALSO

+
+

tesseract(1)

+
+
+
+

COPYING

+
+

Copyright (C) 2012 Google, Inc. +Licensed under the Apache License, Version 2.0

+
+
+
+

AUTHOR

+
+

The Tesseract OCR engine was written by Ray Smith and his research groups +at Hewlett Packard (1985-1995) and Google (2006-present).

+
+
+
+

+ + + diff --git a/Tesseract-OCR/classifier_tester.exe b/Tesseract-OCR/classifier_tester.exe new file mode 100644 index 0000000000000000000000000000000000000000..bf383d0cdd3eec1f0bcae4ef33a3c928fe8f7491 --- /dev/null +++ b/Tesseract-OCR/classifier_tester.exe @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:0741c8d4a6f3a8187ee24755b0d8893ee13c5da152a5efd77e6585ddb0002918 +size 4987040 diff --git a/Tesseract-OCR/cntraining.1.html b/Tesseract-OCR/cntraining.1.html new file mode 100644 index 0000000000000000000000000000000000000000..69a13f7fd5ed28903269cb62d8c8a0c1ccfc67c5 --- /dev/null +++ b/Tesseract-OCR/cntraining.1.html @@ -0,0 +1,803 @@ + + + + + +CNTRAINING(1) + + + + + +
+
+

NAME

+
+

cntraining - character normalization training for Tesseract

+
+
+
+

SYNOPSIS

+
+

cntraining [-D dir] FILE

+
+
+
+

DESCRIPTION

+
+

cntraining takes a list of .tr files, from which it generates the +normproto data file (the character normalization sensitivity +prototypes).

+
+
+
+

OPTIONS

+
+
+
+-D dir +
+
+

+ Directory to write output files to. +

+
+
+
+
+
+

SEE ALSO

+
+

tesseract(1), shapeclustering(1), mftraining(1)

+ +
+
+
+

COPYING

+
+

Copyright (c) Hewlett-Packard Company, 1988 +Licensed under the Apache License, Version 2.0

+
+
+
+

AUTHOR

+
+

The Tesseract OCR engine was written by Ray Smith and his research groups +at Hewlett Packard (1985-1995) and Google (2006-present).

+
+
+
+

+ + + diff --git a/Tesseract-OCR/cntraining.exe b/Tesseract-OCR/cntraining.exe new file mode 100644 index 0000000000000000000000000000000000000000..cac934f356f1f1909ffd1929cb68b73fae6254f3 --- /dev/null +++ b/Tesseract-OCR/cntraining.exe @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:15f8b5d76aa6d2dcf40a8c50150d81888f2e3d6b40a601a7d6e5a32ee4bc138f +size 4709776 diff --git a/Tesseract-OCR/combine_lang_model.1.html b/Tesseract-OCR/combine_lang_model.1.html new file mode 100644 index 0000000000000000000000000000000000000000..0f791366017d67b65278c13bff34d8177da72555 --- /dev/null +++ b/Tesseract-OCR/combine_lang_model.1.html @@ -0,0 +1,888 @@ + + + + + +COMBINE_LANG_MODEL(1) + + + + + +
+
+

SYNOPSIS

+
+

combine_lang_model --input_unicharset filename --script_dir dirname --output_dir rootdir --lang lang [--lang_is_rtl] [pass_through_recoder] [--words file --puncs file --numbers file]

+
+
+
+

DESCRIPTION

+
+

combine_lang_model(1) generates a starter traineddata file that can be used to train an LSTM-based neural network model. It takes as input a unicharset and an optional set of wordlists. It eliminates the need to run set_unicharset_properties(1), wordlist2dawg(1), some non-existent binary to generate the recoder (unicode compressor), and finally combine_tessdata(1).

+
+
+
+

OPTIONS

+
+
+
+--lang lang +
+
+

+ The language to use. + Tesseract uses 3-character ISO 639-2 language codes. (See LANGUAGES) +

+
+
+--script_dir PATH +
+
+

+ Directory name for input script unicharsets. It should point to the location of langdata (github repo) directory. (type:string default:) +

+
+
+--input_unicharset FILE +
+
+

+ Unicharset to complete and use in encoding. It can be a hand-created file with incomplete fields. Its basic and script properties will be set before it is used. (type:string default:) +

+
+
+--lang_is_rtl BOOL +
+
+

+ True if language being processed is written right-to-left (eg Arabic/Hebrew). (type:bool default:false) +

+
+
+--pass_through_recoder BOOL +
+
+

+ If true, the recoder is a simple pass-through of the unicharset. Otherwise, potentially a compression of it by encoding Hangul in Jamos, decomposing multi-unicode symbols into sequences of unicodes, and encoding Han using the data in the radical_table_data, which must be the content of the file: langdata/radical-stroke.txt. (type:bool default:false) +

+
+
+--version_str STRING +
+
+

+ An arbitrary version label to add to traineddata file (type:string default:) +

+
+
+--words FILE +
+
+

+ (Optional) File listing words to use for the system dictionary (type:string default:) +

+
+
+--numbers FILE +
+
+

+ (Optional) File listing number patterns (type:string default:) +

+
+
+--puncs FILE +
+
+

+ (Optional) File listing punctuation patterns. The words/puncs/numbers lists may be all empty. If any are non-empty then puncs must be non-empty. (type:string default:) +

+
+
+--output_dir PATH +
+
+

+ Root directory for output files. Output files will be written to <output_dir>/<lang>/<lang>.* (type:string default:) +

+
+
+
+
+
+

HISTORY

+
+

combine_lang_model(1) was first made available for tesseract4.00.00alpha.

+
+
+
+

RESOURCES

+ +
+
+

SEE ALSO

+
+

tesseract(1)

+
+
+
+

COPYING

+
+

Copyright (C) 2012 Google, Inc. +Licensed under the Apache License, Version 2.0

+
+
+
+

AUTHOR

+
+

The Tesseract OCR engine was written by Ray Smith and his research groups +at Hewlett Packard (1985-1995) and Google (2006-present).

+
+
+
+

+ + + diff --git a/Tesseract-OCR/combine_lang_model.exe b/Tesseract-OCR/combine_lang_model.exe new file mode 100644 index 0000000000000000000000000000000000000000..08d32d4f817f3ea9ed964ee1a5488564f30fa869 --- /dev/null +++ b/Tesseract-OCR/combine_lang_model.exe @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:a48e09839e8df98dd14d8bf34047f0472e5f7d82adbcb6967efea004f3b1fa47 +size 3503232 diff --git a/Tesseract-OCR/combine_tessdata.1.html b/Tesseract-OCR/combine_tessdata.1.html new file mode 100644 index 0000000000000000000000000000000000000000..e132f86a663f0a99fee15b985793eebf44dd9764 --- /dev/null +++ b/Tesseract-OCR/combine_tessdata.1.html @@ -0,0 +1,1070 @@ + + + + + +COMBINE_TESSDATA(1) + + + + + +
+
+

NAME

+
+

combine_tessdata - combine/extract/overwrite/list/compact Tesseract data

+
+
+
+

SYNOPSIS

+
+

combine_tessdata [OPTION] FILE

+
+
+
+

DESCRIPTION

+
+

combine_tessdata(1) is the main program to combine/extract/overwrite/list/compact +tessdata components in [lang].traineddata files.

+

To combine all the individual tessdata components (unicharset, DAWGs, +classifier templates, ambiguities, language configs) located at, say, +/home/$USER/temp/eng.* run:

+
+
+
combine_tessdata /home/$USER/temp/eng.
+
+

The result will be a combined tessdata file /home/$USER/temp/eng.traineddata

+

Specify option -e if you would like to extract individual components +from a combined traineddata file. For example, to extract language config +file and the unicharset from tessdata/eng.traineddata run:

+
+
+
combine_tessdata -e tessdata/eng.traineddata \
+  /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset
+
+

The desired config file and unicharset will be written to +/home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset

+

Specify option -o to overwrite individual components of the given +[lang].traineddata file. For example, to overwrite language config +and unichar ambiguities files in tessdata/eng.traineddata use:

+
+
+
combine_tessdata -o tessdata/eng.traineddata \
+  /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharambigs
+
+

As a result, tessdata/eng.traineddata will contain the new language config +and unichar ambigs, plus all the original DAWGs, classifier templates, etc.

+

Note: the file names of the files to extract to and to overwrite from should +have the appropriate file suffixes (extensions) indicating their tessdata +component type (.unicharset for the unicharset, .unicharambigs for unichar +ambigs, etc). See k*FileSuffix variable in ccutil/tessdatamanager.h.

+

Specify option -u to unpack all the components to the specified path:

+
+
+
combine_tessdata -u tessdata/eng.traineddata /home/$USER/temp/eng.
+
+

This will create /home/$USER/temp/eng.* files with individual tessdata +components from tessdata/eng.traineddata.

+
+
+
+

OPTIONS

+
+

-c .traineddata FILE…: + Compacts the LSTM component in the .traineddata file to int.

+

-d .traineddata FILE…: + Lists directory of components from the .traineddata file.

+

-e .traineddata FILE…: + Extracts the specified components from the .traineddata file

+

-l .traineddata FILE…: + List the network information.

+

-o .traineddata FILE…: + Overwrites the specified components of the .traineddata file + with those provided on the command line.

+

-u .traineddata PATHPREFIX + Unpacks the .traineddata using the provided prefix.

+
+
+
+

CAVEATS

+
+

Prefix refers to the full file prefix, including period (.)

+
+
+
+

COMPONENTS

+
+

The components in a Tesseract lang.traineddata file as of +Tesseract 4.0 are briefly described below; For more information on +many of these files, see +https://tesseract-ocr.github.io/tessdoc/Training-Tesseract.html +and +https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html

+
+
+lang.config +
+
+

+ (Optional) Language-specific overrides to default config variables. + For 4.0 traineddata files, lang.config provides control parameters which + can affect layout analysis, and sub-languages. +

+
+
+lang.unicharset +
+
+

+ (Required - 3.0x legacy tesseract) The list of symbols that Tesseract recognizes, with properties. + See unicharset(5). +

+
+
+lang.unicharambigs +
+
+

+ (Optional - 3.0x legacy tesseract) This file contains information on pairs of recognized symbols + which are often confused. For example, rn and m. +

+
+
+lang.inttemp +
+
+

+ (Required - 3.0x legacy tesseract) Character shape templates for each unichar. Produced by + mftraining(1). +

+
+
+lang.pffmtable +
+
+

+ (Required - 3.0x legacy tesseract) The number of features expected for each unichar. + Produced by mftraining(1) from .tr files. +

+
+
+lang.normproto +
+
+

+ (Required - 3.0x legacy tesseract) Character normalization prototypes generated by cntraining(1) + from .tr files. +

+
+
+lang.punc-dawg +
+
+

+ (Optional - 3.0x legacy tesseract) A dawg made from punctuation patterns found around words. + The "word" part is replaced by a single space. +

+
+
+lang.word-dawg +
+
+

+ (Optional - 3.0x legacy tesseract) A dawg made from dictionary words from the language. +

+
+
+lang.number-dawg +
+
+

+ (Optional - 3.0x legacy tesseract) A dawg made from tokens which originally contained digits. + Each digit is replaced by a space character. +

+
+
+lang.freq-dawg +
+
+

+ (Optional - 3.0x legacy tesseract) A dawg made from the most frequent words which would have + gone into word-dawg. +

+
+
+lang.fixed-length-dawgs +
+
+

+ (Optional - 3.0x legacy tesseract) Several dawgs of different fixed lengths — useful for + languages like Chinese. +

+
+
+lang.shapetable +
+
+

+ (Optional - 3.0x legacy tesseract) When present, a shapetable is an extra layer between the character + classifier and the word recognizer that allows the character classifier to + return a collection of unichar ids and fonts instead of a single unichar-id + and font. +

+
+
+lang.bigram-dawg +
+
+

+ (Optional - 3.0x legacy tesseract) A dawg of word bigrams where the words are separated by a space + and each digit is replaced by a ?. +

+
+
+lang.unambig-dawg +
+
+

+ (Optional - 3.0x legacy tesseract) . +

+
+
+lang.params-model +
+
+

+ (Optional - 3.0x legacy tesseract) . +

+
+
+lang.lstm +
+
+

+ (Required - 4.0 LSTM) Neural net trained recognition model generated by lstmtraining. +

+
+
+lang.lstm-punc-dawg +
+
+

+ (Optional - 4.0 LSTM) A dawg made from punctuation patterns found around words. + The "word" part is replaced by a single space. Uses lang.lstm-unicharset. +

+
+
+lang.lstm-word-dawg +
+
+

+ (Optional - 4.0 LSTM) A dawg made from dictionary words from the language. + Uses lang.lstm-unicharset. +

+
+
+lang.lstm-number-dawg +
+
+

+ (Optional - 4.0 LSTM) A dawg made from tokens which originally contained digits. + Each digit is replaced by a space character. Uses lang.lstm-unicharset. +

+
+
+lang.lstm-unicharset +
+
+

+ (Required - 4.0 LSTM) The unicode character set that Tesseract recognizes, with properties. + Same unicharset must be used to train the LSTM and build the lstm-*-dawgs files. +

+
+
+lang.lstm-recoder +
+
+

+ (Required - 4.0 LSTM) Unicharcompress, aka the recoder, which maps the unicharset + further to the codes actually used by the neural network recognizer. This is created as + part of the starter traineddata by combine_lang_model. +

+
+
+lang.version +
+
+

+ (Optional) Version string for the traineddata file. + First appeared in version 4.0 of Tesseract. + Old version of traineddata files will report Version:Pre-4.0.0. + 4.0 version of traineddata files may include the network spec + used for LSTM training as part of version string. +

+
+
+
+
+
+

HISTORY

+
+

combine_tessdata(1) first appeared in version 3.00 of Tesseract

+
+
+
+

SEE ALSO

+
+

tesseract(1), wordlist2dawg(1), cntraining(1), mftraining(1), unicharset(5), +unicharambigs(5)

+
+
+
+

COPYING

+
+

Copyright (C) 2009, Google Inc. +Licensed under the Apache License, Version 2.0

+
+
+
+

AUTHOR

+
+

The Tesseract OCR engine was written by Ray Smith and his research groups +at Hewlett Packard (1985-1995) and Google (2006-present).

+
+
+
+

+ + + diff --git a/Tesseract-OCR/combine_tessdata.exe b/Tesseract-OCR/combine_tessdata.exe new file mode 100644 index 0000000000000000000000000000000000000000..a8c329849b17c9568ef7446df171597d10013fde --- /dev/null +++ b/Tesseract-OCR/combine_tessdata.exe @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:8e287a0238aff1dc9f9ecb9f2fde2dba89803a6c37dfdee932cfc0ef3448a075 +size 1280584 diff --git a/Tesseract-OCR/dawg2wordlist.1.html b/Tesseract-OCR/dawg2wordlist.1.html new file mode 100644 index 0000000000000000000000000000000000000000..bac0b5e809f2b5313ea1f5a40f54749180103b87 --- /dev/null +++ b/Tesseract-OCR/dawg2wordlist.1.html @@ -0,0 +1,802 @@ + + + + + +DAWG2WORDLIST(1) + + + + + +
+
+

SYNOPSIS

+
+

dawg2wordlist UNICHARSET DAWG WORDLIST

+
+
+
+

DESCRIPTION

+
+

dawg2wordlist(1) converts a Tesseract Directed Acyclic Word +Graph (DAWG) to a list of words using a unicharset as key.

+
+
+
+

OPTIONS

+
+

UNICHARSET + The unicharset of the language. This is the unicharset + generated by mftraining(1).

+

DAWG + The input DAWG, created by wordlist2dawg(1)

+

WORDLIST + Plain text (output) file in UTF-8, one word per line

+
+
+
+

SEE ALSO

+
+

tesseract(1), mftraining(1), wordlist2dawg(1), unicharset(5), +combine_tessdata(1)

+ +
+
+
+

COPYING

+
+

Copyright (C) 2012 Google, Inc. +Licensed under the Apache License, Version 2.0

+
+
+
+

AUTHOR

+
+

The Tesseract OCR engine was written by Ray Smith and his research groups +at Hewlett Packard (1985-1995) and Google (2006-present).

+
+
+
+

+ + + diff --git a/Tesseract-OCR/dawg2wordlist.exe b/Tesseract-OCR/dawg2wordlist.exe new file mode 100644 index 0000000000000000000000000000000000000000..61b4a5a3ddbf415cc7a5b8daf0081b5aad18fa44 Binary files /dev/null and b/Tesseract-OCR/dawg2wordlist.exe differ diff --git a/Tesseract-OCR/doc/AUTHORS b/Tesseract-OCR/doc/AUTHORS new file mode 100644 index 0000000000000000000000000000000000000000..9d1e73c3083260ae9485e7d07f6ac1fa271737fb --- /dev/null +++ b/Tesseract-OCR/doc/AUTHORS @@ -0,0 +1,51 @@ +Ray Smith (lead developer) +Ahmad Abdulkader +Rika Antonova +Nicholas Beato +Jeff Breidenbach +Samuel Charron +Phil Cheatle +Simon Crouch +David Eger +Sheelagh Huddleston +Dan Johnson +Rajesh Katikam +Thomas Kielbus +Dar-Shyang Lee +Zongyi (Joe) Liu +Robert Moss +Chris Newton +Michael Reimer +Marius Renn +Raquel Romano +Christy Russon +Shobhit Saxena +Mark Seaman +Faisal Shafait +Hiroshi Takenaka +Ranjith Unnikrishnan +Joern Wanke +Ping Ping Xiu +Andrew Ziem +Oscar Zuniga + +Community Contributors: +Zdenko Podobný (Maintainer) +Jim Regan (Maintainer) +James R Barlow +Stefan Brechtken +Thomas Breuel +Amit Dovev +Martin Ettl +Shree Devi Kumar +Noah Metzger +Tom Morris +Tobias Müller +Egor Pugin +Robert Sachunsky +Raf Schietekat +Sundar M. Vaidya +Robin Watts +Stefan Weil +Nick White +Alexander Zaitsev diff --git a/Tesseract-OCR/doc/LICENSE b/Tesseract-OCR/doc/LICENSE new file mode 100644 index 0000000000000000000000000000000000000000..d645695673349e3947e8e5ae42332d0ac3164cd7 --- /dev/null +++ b/Tesseract-OCR/doc/LICENSE @@ -0,0 +1,202 @@ + + Apache License + Version 2.0, January 2004 + http://www.apache.org/licenses/ + + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + + 1. Definitions. + + "License" shall mean the terms and conditions for use, reproduction, + and distribution as defined by Sections 1 through 9 of this document. + + "Licensor" shall mean the copyright owner or entity authorized by + the copyright owner that is granting the License. + + "Legal Entity" shall mean the union of the acting entity and all + other entities that control, are controlled by, or are under common + control with that entity. For the purposes of this definition, + "control" means (i) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract or + otherwise, or (ii) ownership of fifty percent (50%) or more of the + outstanding shares, or (iii) beneficial ownership of such entity. + + "You" (or "Your") shall mean an individual or Legal Entity + exercising permissions granted by this License. + + "Source" form shall mean the preferred form for making modifications, + including but not limited to software source code, documentation + source, and configuration files. + + "Object" form shall mean any form resulting from mechanical + transformation or translation of a Source form, including but + not limited to compiled object code, generated documentation, + and conversions to other media types. + + "Work" shall mean the work of authorship, whether in Source or + Object form, made available under the License, as indicated by a + copyright notice that is included in or attached to the work + (an example is provided in the Appendix below). + + "Derivative Works" shall mean any work, whether in Source or Object + form, that is based on (or derived from) the Work and for which the + editorial revisions, annotations, elaborations, or other modifications + represent, as a whole, an original work of authorship. For the purposes + of this License, Derivative Works shall not include works that remain + separable from, or merely link (or bind by name) to the interfaces of, + the Work and Derivative Works thereof. + + "Contribution" shall mean any work of authorship, including + the original version of the Work and any modifications or additions + to that Work or Derivative Works thereof, that is intentionally + submitted to Licensor for inclusion in the Work by the copyright owner + or by an individual or Legal Entity authorized to submit on behalf of + the copyright owner. For the purposes of this definition, "submitted" + means any form of electronic, verbal, or written communication sent + to the Licensor or its representatives, including but not limited to + communication on electronic mailing lists, source code control systems, + and issue tracking systems that are managed by, or on behalf of, the + Licensor for the purpose of discussing and improving the Work, but + excluding communication that is conspicuously marked or otherwise + designated in writing by the copyright owner as "Not a Contribution." + + "Contributor" shall mean Licensor and any individual or Legal Entity + on behalf of whom a Contribution has been received by Licensor and + subsequently incorporated within the Work. + + 2. Grant of Copyright License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + copyright license to reproduce, prepare Derivative Works of, + publicly display, publicly perform, sublicense, and distribute the + Work and such Derivative Works in Source or Object form. + + 3. Grant of Patent License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + (except as stated in this section) patent license to make, have made, + use, offer to sell, sell, import, and otherwise transfer the Work, + where such license applies only to those patent claims licensable + by such Contributor that are necessarily infringed by their + Contribution(s) alone or by combination of their Contribution(s) + with the Work to which such Contribution(s) was submitted. If You + institute patent litigation against any entity (including a + cross-claim or counterclaim in a lawsuit) alleging that the Work + or a Contribution incorporated within the Work constitutes direct + or contributory patent infringement, then any patent licenses + granted to You under this License for that Work shall terminate + as of the date such litigation is filed. + + 4. Redistribution. You may reproduce and distribute copies of the + Work or Derivative Works thereof in any medium, with or without + modifications, and in Source or Object form, provided that You + meet the following conditions: + + (a) You must give any other recipients of the Work or + Derivative Works a copy of this License; and + + (b) You must cause any modified files to carry prominent notices + stating that You changed the files; and + + (c) You must retain, in the Source form of any Derivative Works + that You distribute, all copyright, patent, trademark, and + attribution notices from the Source form of the Work, + excluding those notices that do not pertain to any part of + the Derivative Works; and + + (d) If the Work includes a "NOTICE" text file as part of its + distribution, then any Derivative Works that You distribute must + include a readable copy of the attribution notices contained + within such NOTICE file, excluding those notices that do not + pertain to any part of the Derivative Works, in at least one + of the following places: within a NOTICE text file distributed + as part of the Derivative Works; within the Source form or + documentation, if provided along with the Derivative Works; or, + within a display generated by the Derivative Works, if and + wherever such third-party notices normally appear. The contents + of the NOTICE file are for informational purposes only and + do not modify the License. You may add Your own attribution + notices within Derivative Works that You distribute, alongside + or as an addendum to the NOTICE text from the Work, provided + that such additional attribution notices cannot be construed + as modifying the License. + + You may add Your own copyright statement to Your modifications and + may provide additional or different license terms and conditions + for use, reproduction, or distribution of Your modifications, or + for any such Derivative Works as a whole, provided Your use, + reproduction, and distribution of the Work otherwise complies with + the conditions stated in this License. + + 5. Submission of Contributions. Unless You explicitly state otherwise, + any Contribution intentionally submitted for inclusion in the Work + by You to the Licensor shall be under the terms and conditions of + this License, without any additional terms or conditions. + Notwithstanding the above, nothing herein shall supersede or modify + the terms of any separate license agreement you may have executed + with Licensor regarding such Contributions. + + 6. Trademarks. This License does not grant permission to use the trade + names, trademarks, service marks, or product names of the Licensor, + except as required for reasonable and customary use in describing the + origin of the Work and reproducing the content of the NOTICE file. + + 7. Disclaimer of Warranty. Unless required by applicable law or + agreed to in writing, Licensor provides the Work (and each + Contributor provides its Contributions) on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied, including, without limitation, any warranties or conditions + of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A + PARTICULAR PURPOSE. You are solely responsible for determining the + appropriateness of using or redistributing the Work and assume any + risks associated with Your exercise of permissions under this License. + + 8. Limitation of Liability. In no event and under no legal theory, + whether in tort (including negligence), contract, or otherwise, + unless required by applicable law (such as deliberate and grossly + negligent acts) or agreed to in writing, shall any Contributor be + liable to You for damages, including any direct, indirect, special, + incidental, or consequential damages of any character arising as a + result of this License or out of the use or inability to use the + Work (including but not limited to damages for loss of goodwill, + work stoppage, computer failure or malfunction, or any and all + other commercial damages or losses), even if such Contributor + has been advised of the possibility of such damages. + + 9. Accepting Warranty or Additional Liability. While redistributing + the Work or Derivative Works thereof, You may choose to offer, + and charge a fee for, acceptance of support, warranty, indemnity, + or other liability obligations and/or rights consistent with this + License. However, in accepting such obligations, You may act only + on Your own behalf and on Your sole responsibility, not on behalf + of any other Contributor, and only if You agree to indemnify, + defend, and hold each Contributor harmless for any liability + incurred by, or claims asserted against, such Contributor by reason + of your accepting any such warranty or additional liability. + + END OF TERMS AND CONDITIONS + + APPENDIX: How to apply the Apache License to your work. + + To apply the Apache License to your work, attach the following + boilerplate notice, with the fields enclosed by brackets "[]" + replaced with your own identifying information. (Don't include + the brackets!) The text should be enclosed in the appropriate + comment syntax for the file format. We also recommend that a + file or class name and description of purpose be included on the + same "printed page" as the copyright notice for easier + identification within third-party archives. + + Copyright [yyyy] [name of copyright owner] + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. diff --git a/Tesseract-OCR/doc/README.md b/Tesseract-OCR/doc/README.md new file mode 100644 index 0000000000000000000000000000000000000000..cbd9af46ccf112e3f40509322022fc8427d1a562 --- /dev/null +++ b/Tesseract-OCR/doc/README.md @@ -0,0 +1,134 @@ +# Tesseract OCR + +[![Build status](https://ci.appveyor.com/api/projects/status/miah0ikfsf0j3819/branch/master?svg=true)](https://ci.appveyor.com/project/zdenop/tesseract/) +[![Build status](https://github.com/tesseract-ocr/tesseract/workflows/sw/badge.svg)](https://github.com/tesseract-ocr/tesseract/actions/workflows/sw.yml)\ +[![Coverity Scan Build Status](https://scan.coverity.com/projects/tesseract-ocr/badge.svg)](https://scan.coverity.com/projects/tesseract-ocr) +[![CodeQL](https://github.com/tesseract-ocr/tesseract/workflows/CodeQL/badge.svg)](https://github.com/tesseract-ocr/tesseract/security/code-scanning) +[![OSS-Fuzz](https://img.shields.io/badge/oss--fuzz-fuzzing-brightgreen)](https://bugs.chromium.org/p/oss-fuzz/issues/list?sort=-opened&can=2&q=proj:tesseract-ocr) +\ +[![GitHub license](https://img.shields.io/badge/license-Apache--2.0-blue.svg)](https://raw.githubusercontent.com/tesseract-ocr/tesseract/main/LICENSE) +[![Downloads](https://img.shields.io/badge/download-all%20releases-brightgreen.svg)](https://github.com/tesseract-ocr/tesseract/releases/) + +## Table of Contents + +* [Tesseract OCR](#tesseract-ocr) + * [About](#about) + * [Brief history](#brief-history) + * [Installing Tesseract](#installing-tesseract) + * [Running Tesseract](#running-tesseract) + * [For developers](#for-developers) + * [Support](#support) + * [License](#license) + * [Dependencies](#dependencies) + * [Latest Version of README](#latest-version-of-readme) + +## About + +This package contains an **OCR engine** - `libtesseract` and a **command line program** - `tesseract`. + +Tesseract 4 adds a new neural net (LSTM) based [OCR engine](https://en.wikipedia.org/wiki/Optical_character_recognition) which is focused on line recognition, but also still supports the legacy Tesseract OCR engine of Tesseract 3 which works by recognizing character patterns. Compatibility with Tesseract 3 is enabled by using the Legacy OCR Engine mode (--oem 0). +It also needs [traineddata](https://tesseract-ocr.github.io/tessdoc/Data-Files.html) files which support the legacy engine, for example those from the [tessdata](https://github.com/tesseract-ocr/tessdata) repository. + +Stefan Weil is the current lead developer. Ray Smith was the lead developer until 2018. The maintainer is Zdenko Podobny. For a list of contributors see [AUTHORS](https://github.com/tesseract-ocr/tesseract/blob/main/AUTHORS) +and GitHub's log of [contributors](https://github.com/tesseract-ocr/tesseract/graphs/contributors). + +Tesseract has **unicode (UTF-8) support**, and can **recognize [more than 100 languages](https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html)** "out of the box". + +Tesseract supports **[various image formats](https://tesseract-ocr.github.io/tessdoc/InputFormats)** including PNG, JPEG and TIFF. + +Tesseract supports **various output formats**: plain text, hOCR (HTML), PDF, invisible-text-only PDF, TSV and ALTO (the last one - since version 4.1.0). + +You should note that in many cases, in order to get better OCR results, you'll need to **[improve the quality](https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html) of the image** you are giving Tesseract. + +This project **does not include a GUI application**. If you need one, please see the [3rdParty](https://tesseract-ocr.github.io/tessdoc/User-Projects-%E2%80%93-3rdParty.html) documentation. + +Tesseract **can be trained to recognize other languages**. +See [Tesseract Training](https://tesseract-ocr.github.io/tessdoc/Training-Tesseract.html) for more information. + +## Brief history + +Tesseract was originally developed at Hewlett-Packard Laboratories Bristol UK and at Hewlett-Packard Co, Greeley Colorado USA between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. From 2006 until November 2018 it was developed by Google. + +Major version 5 is the current stable version and started with release +[5.0.0](https://github.com/tesseract-ocr/tesseract/releases/tag/5.0.0) on November 30, 2021. Newer minor versions and bugfix versions are available from +[GitHub](https://github.com/tesseract-ocr/tesseract/releases/). + +Latest source code is available from [main branch on GitHub](https://github.com/tesseract-ocr/tesseract/tree/main). +Open issues can be found in [issue tracker](https://github.com/tesseract-ocr/tesseract/issues), +and [planning documentation](https://tesseract-ocr.github.io/tessdoc/Planning.html). + +See **[Release Notes](https://tesseract-ocr.github.io/tessdoc/ReleaseNotes.html)** +and **[Change Log](https://github.com/tesseract-ocr/tesseract/blob/main/ChangeLog)** for more details of the releases. + +## Installing Tesseract + +You can either [Install Tesseract via pre-built binary package](https://tesseract-ocr.github.io/tessdoc/Installation.html) +or [build it from source](https://tesseract-ocr.github.io/tessdoc/Compiling.html). + +A C++ compiler with good C++17 support is required for building Tesseract from source. + +## Running Tesseract + +Basic **[command line usage](https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html)**: + + tesseract imagename outputbase [-l lang] [--oem ocrenginemode] [--psm pagesegmode] [configfiles...] + +For more information about the various command line options use `tesseract --help` or `man tesseract`. + +Examples can be found in the [documentation](https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html#simplest-invocation-to-ocr-an-image). + +## For developers + +Developers can use `libtesseract` [C](https://github.com/tesseract-ocr/tesseract/blob/main/include/tesseract/capi.h) or +[C++](https://github.com/tesseract-ocr/tesseract/blob/main/include/tesseract/baseapi.h) API to build their own application. If you need bindings to `libtesseract` for other programming languages, please see the +[wrapper](https://tesseract-ocr.github.io/tessdoc/AddOns.html#tesseract-wrappers) section in the AddOns documentation. + +Documentation of Tesseract generated from source code by doxygen can be found on [tesseract-ocr.github.io](https://tesseract-ocr.github.io/). + +## Support + +Before you submit an issue, please review **[the guidelines for this repository](https://github.com/tesseract-ocr/tesseract/blob/main/CONTRIBUTING.md)**. + +For support, first read the [documentation](https://tesseract-ocr.github.io/tessdoc/), +particularly the [FAQ](https://tesseract-ocr.github.io/tessdoc/FAQ.html) to see if your problem is addressed there. +If not, search the [Tesseract user forum](https://groups.google.com/g/tesseract-ocr), the [Tesseract developer forum](https://groups.google.com/g/tesseract-dev) and [past issues](https://github.com/tesseract-ocr/tesseract/issues), and if you still can't find what you need, ask for support in the mailing-lists. + +Mailing-lists: + +* [tesseract-ocr](https://groups.google.com/g/tesseract-ocr) - For tesseract users. +* [tesseract-dev](https://groups.google.com/g/tesseract-dev) - For tesseract developers. + +Please report an issue only for a **bug**, not for asking questions. + +## License + + The code in this repository is licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. + +**NOTE**: This software depends on other packages that may be licensed under different open source licenses. + +Tesseract uses [Leptonica library](http://leptonica.com/) which essentially +uses a [BSD 2-clause license](http://leptonica.com/about-the-license.html). + +## Dependencies + +Tesseract uses [Leptonica library](https://github.com/DanBloomberg/leptonica) +for opening input images (e.g. not documents like pdf). +It is suggested to use leptonica with built-in support for [zlib](https://zlib.net), +[png](https://sourceforge.net/projects/libpng) and +[tiff](http://www.simplesystems.org/libtiff) (for multipage tiff). + +## Latest Version of README + +For the latest online version of the README.md see: + + diff --git a/Tesseract-OCR/libLerc.dll b/Tesseract-OCR/libLerc.dll new file mode 100644 index 0000000000000000000000000000000000000000..1c089e20ba3aa0a64a9af9fc49db564eab85a52d Binary files /dev/null and b/Tesseract-OCR/libLerc.dll differ diff --git a/Tesseract-OCR/libarchive-13.dll b/Tesseract-OCR/libarchive-13.dll new file mode 100644 index 0000000000000000000000000000000000000000..aac9601980571e9a542cbb2e0696f198974a1ace Binary files /dev/null and b/Tesseract-OCR/libarchive-13.dll differ diff --git a/Tesseract-OCR/libb2-1.dll b/Tesseract-OCR/libb2-1.dll new file mode 100644 index 0000000000000000000000000000000000000000..1c1166a758d115949698f7ca4f48fb1287d37f4f Binary files /dev/null and b/Tesseract-OCR/libb2-1.dll differ diff --git a/Tesseract-OCR/libbrotlicommon.dll b/Tesseract-OCR/libbrotlicommon.dll new file mode 100644 index 0000000000000000000000000000000000000000..6e49a79e514c0758a4d0a845f70924c6c2170299 Binary files /dev/null and b/Tesseract-OCR/libbrotlicommon.dll differ diff --git a/Tesseract-OCR/libbrotlidec.dll b/Tesseract-OCR/libbrotlidec.dll new file mode 100644 index 0000000000000000000000000000000000000000..c67e14b5053dd81bd35747db1e90b2f7798de3ae Binary files /dev/null and b/Tesseract-OCR/libbrotlidec.dll differ diff --git a/Tesseract-OCR/libbz2-1.dll b/Tesseract-OCR/libbz2-1.dll new file mode 100644 index 0000000000000000000000000000000000000000..6f7db4827a41fef8a3aaa03efa31e72be3b7d366 Binary files /dev/null and b/Tesseract-OCR/libbz2-1.dll differ diff --git a/Tesseract-OCR/libcairo-2.dll b/Tesseract-OCR/libcairo-2.dll new file mode 100644 index 0000000000000000000000000000000000000000..8ce472e6fb219127c528ed5a93122101ff8951c0 --- /dev/null +++ b/Tesseract-OCR/libcairo-2.dll @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:0ff56b86aa29a01c55bfe2b2bb4647914bcb567e62a8c32c863b92bb8d504718 +size 1206005 diff --git a/Tesseract-OCR/libcrypto-3-x64.dll b/Tesseract-OCR/libcrypto-3-x64.dll new file mode 100644 index 0000000000000000000000000000000000000000..10b391a648448612585b10a656402181fcb06ced --- /dev/null +++ b/Tesseract-OCR/libcrypto-3-x64.dll @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:6243b25e268aacba8ae0d0cbee2fb3d10f61a53ab9e2b0ca5116b3f1f3e2cd3a +size 4901263 diff --git a/Tesseract-OCR/libcurl-4.dll b/Tesseract-OCR/libcurl-4.dll new file mode 100644 index 0000000000000000000000000000000000000000..0a035dc59251669873da70410bee893b51f55e60 Binary files /dev/null and b/Tesseract-OCR/libcurl-4.dll differ diff --git a/Tesseract-OCR/libdatrie-1.dll b/Tesseract-OCR/libdatrie-1.dll new file mode 100644 index 0000000000000000000000000000000000000000..672e29208adc991593fcef2efbafc2728157cbc3 Binary files /dev/null and b/Tesseract-OCR/libdatrie-1.dll differ diff --git a/Tesseract-OCR/libdeflate.dll b/Tesseract-OCR/libdeflate.dll new file mode 100644 index 0000000000000000000000000000000000000000..77b7131dc1f49c342276b3034652d83f7ab129ef Binary files /dev/null and b/Tesseract-OCR/libdeflate.dll differ diff --git a/Tesseract-OCR/libexpat-1.dll b/Tesseract-OCR/libexpat-1.dll new file mode 100644 index 0000000000000000000000000000000000000000..f3ea25d8ede42f8d856ddbce4ed2f4a271d37a38 Binary files /dev/null and b/Tesseract-OCR/libexpat-1.dll differ diff --git a/Tesseract-OCR/libffi-8.dll b/Tesseract-OCR/libffi-8.dll new file mode 100644 index 0000000000000000000000000000000000000000..ab5b22547dcacc8f1f938ab7201c6b12e8c8090f Binary files /dev/null and b/Tesseract-OCR/libffi-8.dll differ diff --git a/Tesseract-OCR/libfontconfig-1.dll b/Tesseract-OCR/libfontconfig-1.dll new file mode 100644 index 0000000000000000000000000000000000000000..fcd9d04cd2d5643be48bcc8415f0f53318740500 Binary files /dev/null and b/Tesseract-OCR/libfontconfig-1.dll differ diff --git a/Tesseract-OCR/libfreetype-6.dll b/Tesseract-OCR/libfreetype-6.dll new file mode 100644 index 0000000000000000000000000000000000000000..0b4250e04bb19668d9d5723f7d8bb11184904458 Binary files /dev/null and b/Tesseract-OCR/libfreetype-6.dll differ diff --git a/Tesseract-OCR/libfribidi-0.dll b/Tesseract-OCR/libfribidi-0.dll new file mode 100644 index 0000000000000000000000000000000000000000..19b30d590a03b49b091c48a4f619a8fd5d80a20e Binary files /dev/null and b/Tesseract-OCR/libfribidi-0.dll differ diff --git a/Tesseract-OCR/libgcc_s_seh-1.dll b/Tesseract-OCR/libgcc_s_seh-1.dll new file mode 100644 index 0000000000000000000000000000000000000000..b587b37c3335d0438417a6df1921f017933ed5a8 Binary files /dev/null and b/Tesseract-OCR/libgcc_s_seh-1.dll differ diff --git a/Tesseract-OCR/libgif-7.dll b/Tesseract-OCR/libgif-7.dll new file mode 100644 index 0000000000000000000000000000000000000000..6831ad0a2c215a1823aa74521bf86b2b490eecda Binary files /dev/null and b/Tesseract-OCR/libgif-7.dll differ diff --git a/Tesseract-OCR/libgio-2.0-0.dll b/Tesseract-OCR/libgio-2.0-0.dll new file mode 100644 index 0000000000000000000000000000000000000000..f464ee0d833af8fe50c40d312fa3ecc93462ab45 --- /dev/null +++ b/Tesseract-OCR/libgio-2.0-0.dll @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:62d86d157e9f4d51404f01793e2680685e8ecb9e360df0572be25a3229786b5c +size 1813053 diff --git a/Tesseract-OCR/libglib-2.0-0.dll b/Tesseract-OCR/libglib-2.0-0.dll new file mode 100644 index 0000000000000000000000000000000000000000..610910561374cb89ed9d235c0180e4a76422564a --- /dev/null +++ b/Tesseract-OCR/libglib-2.0-0.dll @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:b2ea54b809da30786c7c5718ef71b8b53a9bf141cf0143ef24955599031d0a38 +size 1437504 diff --git a/Tesseract-OCR/libgmodule-2.0-0.dll b/Tesseract-OCR/libgmodule-2.0-0.dll new file mode 100644 index 0000000000000000000000000000000000000000..16e7ea3554613b6ee232c13c6dcdc5e3fa9471f2 Binary files /dev/null and b/Tesseract-OCR/libgmodule-2.0-0.dll differ diff --git a/Tesseract-OCR/libgobject-2.0-0.dll b/Tesseract-OCR/libgobject-2.0-0.dll new file mode 100644 index 0000000000000000000000000000000000000000..ed8e5f7cb7eb71941676761a8d1581d4c15b0aa3 Binary files /dev/null and b/Tesseract-OCR/libgobject-2.0-0.dll differ diff --git a/Tesseract-OCR/libgraphite2.dll b/Tesseract-OCR/libgraphite2.dll new file mode 100644 index 0000000000000000000000000000000000000000..9b5178ac8ab3cc3656cad59e95d56f41f888310d Binary files /dev/null and b/Tesseract-OCR/libgraphite2.dll differ diff --git a/Tesseract-OCR/libharfbuzz-0.dll b/Tesseract-OCR/libharfbuzz-0.dll new file mode 100644 index 0000000000000000000000000000000000000000..f6e1709f221b0764d0b7e79d5f46c7f5137f0001 --- /dev/null +++ b/Tesseract-OCR/libharfbuzz-0.dll @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:22e440b207b741feae2fe02059ff5bf1d392b2880095fe095113a5efc39b0a58 +size 1265121 diff --git a/Tesseract-OCR/libiconv-2.dll b/Tesseract-OCR/libiconv-2.dll new file mode 100644 index 0000000000000000000000000000000000000000..65460fc9625621432874c2d672f6b4576995a0ba --- /dev/null +++ b/Tesseract-OCR/libiconv-2.dll @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:55de11531dc0e566cb91f26e48d1301a161a4b8b24abed42304d711412368760 +size 1117178 diff --git a/Tesseract-OCR/libicudt73.dll b/Tesseract-OCR/libicudt73.dll new file mode 100644 index 0000000000000000000000000000000000000000..e0fb37918c8d84e63a1915c128a3e23b83b2e4b4 --- /dev/null +++ b/Tesseract-OCR/libicudt73.dll @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:a81f05d574cba8eecd07f41e0b2cfba0286d46478cd9d5b9f1335c603ad27770 +size 32043526 diff --git a/Tesseract-OCR/libicuin73.dll b/Tesseract-OCR/libicuin73.dll new file mode 100644 index 0000000000000000000000000000000000000000..85de98a36805cf3f1b418955298d2c6fc89cf23f --- /dev/null +++ b/Tesseract-OCR/libicuin73.dll @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:9752124678a9881d17dc79db725f7dff487bb440326a744c94690808371d427f +size 2945634 diff --git a/Tesseract-OCR/libicuuc73.dll b/Tesseract-OCR/libicuuc73.dll new file mode 100644 index 0000000000000000000000000000000000000000..4873e3ddf1d290e6569e0c9f963981348e10391c --- /dev/null +++ b/Tesseract-OCR/libicuuc73.dll @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:aa2392540795ee6b06c36a071f80881e59f0e186b22130a16f2c741805bace88 +size 1820576 diff --git a/Tesseract-OCR/libidn2-0.dll b/Tesseract-OCR/libidn2-0.dll new file mode 100644 index 0000000000000000000000000000000000000000..f9b07cba9e59e75f73ea684830fae6ca8e7cb37f Binary files /dev/null and b/Tesseract-OCR/libidn2-0.dll differ diff --git a/Tesseract-OCR/libintl-8.dll b/Tesseract-OCR/libintl-8.dll new file mode 100644 index 0000000000000000000000000000000000000000..b7ca4f22fefde01fb4a24db89f31fe1be17d09b1 Binary files /dev/null and b/Tesseract-OCR/libintl-8.dll differ diff --git a/Tesseract-OCR/libjbig-0.dll b/Tesseract-OCR/libjbig-0.dll new file mode 100644 index 0000000000000000000000000000000000000000..9bee42fd5a922d62660c5d71a53c2ad830c7c019 Binary files /dev/null and b/Tesseract-OCR/libjbig-0.dll differ diff --git a/Tesseract-OCR/libjpeg-8.dll b/Tesseract-OCR/libjpeg-8.dll new file mode 100644 index 0000000000000000000000000000000000000000..2a414937cac2f39e7c8ee5c418c243ed688a8ffe --- /dev/null +++ b/Tesseract-OCR/libjpeg-8.dll @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:74255432d7cb982178a28a54f8efc56e675163f608da388fd52cf7ccb5bd4071 +size 1071435 diff --git a/Tesseract-OCR/libleptonica-6.dll b/Tesseract-OCR/libleptonica-6.dll new file mode 100644 index 0000000000000000000000000000000000000000..53f790a8670df0097c0bbc878161f013600f4800 --- /dev/null +++ b/Tesseract-OCR/libleptonica-6.dll @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:97e00d7858e1f8a07ad098a951ef6569268a2234a0b76033ce3ef201e81b0645 +size 2698960 diff --git a/Tesseract-OCR/liblz4.dll b/Tesseract-OCR/liblz4.dll new file mode 100644 index 0000000000000000000000000000000000000000..df85958c15533b23ab75ce58bd42c89ff26ce0e2 Binary files /dev/null and b/Tesseract-OCR/liblz4.dll differ diff --git a/Tesseract-OCR/liblzma-5.dll b/Tesseract-OCR/liblzma-5.dll new file mode 100644 index 0000000000000000000000000000000000000000..187428a944422a10b5119739abe0758bf08de4cd Binary files /dev/null and b/Tesseract-OCR/liblzma-5.dll differ diff --git a/Tesseract-OCR/libopenjp2-7.dll b/Tesseract-OCR/libopenjp2-7.dll new file mode 100644 index 0000000000000000000000000000000000000000..c34b1ebfc22d2a8278e981d236ac8ab82c2e727e Binary files /dev/null and b/Tesseract-OCR/libopenjp2-7.dll differ diff --git a/Tesseract-OCR/libpango-1.0-0.dll b/Tesseract-OCR/libpango-1.0-0.dll new file mode 100644 index 0000000000000000000000000000000000000000..2a3fd2bb6f06ea5e8780a69f135b417a4481503f Binary files /dev/null and b/Tesseract-OCR/libpango-1.0-0.dll differ diff --git a/Tesseract-OCR/libpangocairo-1.0-0.dll b/Tesseract-OCR/libpangocairo-1.0-0.dll new file mode 100644 index 0000000000000000000000000000000000000000..dd3cdca925be11d502399b20718a6bb4071a1c8e Binary files /dev/null and b/Tesseract-OCR/libpangocairo-1.0-0.dll differ diff --git a/Tesseract-OCR/libpangoft2-1.0-0.dll b/Tesseract-OCR/libpangoft2-1.0-0.dll new file mode 100644 index 0000000000000000000000000000000000000000..2517300351ec3c895a4e5b50dcacb46477145d7d Binary files /dev/null and b/Tesseract-OCR/libpangoft2-1.0-0.dll differ diff --git a/Tesseract-OCR/libpangowin32-1.0-0.dll b/Tesseract-OCR/libpangowin32-1.0-0.dll new file mode 100644 index 0000000000000000000000000000000000000000..149ece90e111642f5ea3ecd46c00ae067569d587 Binary files /dev/null and b/Tesseract-OCR/libpangowin32-1.0-0.dll differ diff --git a/Tesseract-OCR/libpcre2-8-0.dll b/Tesseract-OCR/libpcre2-8-0.dll new file mode 100644 index 0000000000000000000000000000000000000000..4f87a28e21da495e779c15af7d272401fa91782f Binary files /dev/null and b/Tesseract-OCR/libpcre2-8-0.dll differ diff --git a/Tesseract-OCR/libpixman-1-0.dll b/Tesseract-OCR/libpixman-1-0.dll new file mode 100644 index 0000000000000000000000000000000000000000..491bde2fe8f7f2cb3ac807ea16a40100949e31eb Binary files /dev/null and b/Tesseract-OCR/libpixman-1-0.dll differ diff --git a/Tesseract-OCR/libpng16-16.dll b/Tesseract-OCR/libpng16-16.dll new file mode 100644 index 0000000000000000000000000000000000000000..392f62a194032dbb5bf4d6fa6fcaf6b250e49e11 Binary files /dev/null and b/Tesseract-OCR/libpng16-16.dll differ diff --git a/Tesseract-OCR/libpsl-5.dll b/Tesseract-OCR/libpsl-5.dll new file mode 100644 index 0000000000000000000000000000000000000000..d0e0242459271cf8701fb288e9365a0a535e70f8 Binary files /dev/null and b/Tesseract-OCR/libpsl-5.dll differ diff --git a/Tesseract-OCR/libsharpyuv-0.dll b/Tesseract-OCR/libsharpyuv-0.dll new file mode 100644 index 0000000000000000000000000000000000000000..2bf0fa5edf1f3c32bf8d7ee05c4c038d7ab867dd Binary files /dev/null and b/Tesseract-OCR/libsharpyuv-0.dll differ diff --git a/Tesseract-OCR/libssh2-1.dll b/Tesseract-OCR/libssh2-1.dll new file mode 100644 index 0000000000000000000000000000000000000000..2860272c6336c51479779327a40e5014f35f5455 Binary files /dev/null and b/Tesseract-OCR/libssh2-1.dll differ diff --git a/Tesseract-OCR/libstdc++-6.dll b/Tesseract-OCR/libstdc++-6.dll new file mode 100644 index 0000000000000000000000000000000000000000..23a8a15aa37b0b9430fca8577341eb737b382953 --- /dev/null +++ b/Tesseract-OCR/libstdc++-6.dll @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:515c1cc0163eb5a5294bf18b4980019cb1c7e7279830048b4cc81db8a2736770 +size 2266606 diff --git a/Tesseract-OCR/libtesseract-5.dll b/Tesseract-OCR/libtesseract-5.dll new file mode 100644 index 0000000000000000000000000000000000000000..d825be398f5ba9e35f84ac6ab4ac13132b3ecc74 --- /dev/null +++ b/Tesseract-OCR/libtesseract-5.dll @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:a4c9104ec0d74371b34c91a7131a43baaf23ad2fbeb964253ca12bc88a43f62a +size 99256240 diff --git a/Tesseract-OCR/libthai-0.dll b/Tesseract-OCR/libthai-0.dll new file mode 100644 index 0000000000000000000000000000000000000000..b931aad412604d3f4ae88b369212af7af8c3156a Binary files /dev/null and b/Tesseract-OCR/libthai-0.dll differ diff --git a/Tesseract-OCR/libtiff-6.dll b/Tesseract-OCR/libtiff-6.dll new file mode 100644 index 0000000000000000000000000000000000000000..ffbfe6d3cf9b17d9a41073943d3324f4719e665c Binary files /dev/null and b/Tesseract-OCR/libtiff-6.dll differ diff --git a/Tesseract-OCR/libunistring-5.dll b/Tesseract-OCR/libunistring-5.dll new file mode 100644 index 0000000000000000000000000000000000000000..eb931731b3f7b744fb193a4666193e4082419c06 --- /dev/null +++ b/Tesseract-OCR/libunistring-5.dll @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:cd08541db31221a2e26238cd6a2bcd6a9ebb9a4c4205c09b1b2050fde409f284 +size 1989559 diff --git a/Tesseract-OCR/libwebp-7.dll b/Tesseract-OCR/libwebp-7.dll new file mode 100644 index 0000000000000000000000000000000000000000..6257b8c4ecad12e2c1a965125a8ba09897e18dbc Binary files /dev/null and b/Tesseract-OCR/libwebp-7.dll differ diff --git a/Tesseract-OCR/libwebpmux-3.dll b/Tesseract-OCR/libwebpmux-3.dll new file mode 100644 index 0000000000000000000000000000000000000000..08ce8d9dfde45e9087c10a35f3063e01d4d4004a Binary files /dev/null and b/Tesseract-OCR/libwebpmux-3.dll differ diff --git a/Tesseract-OCR/libwinpthread-1.dll b/Tesseract-OCR/libwinpthread-1.dll new file mode 100644 index 0000000000000000000000000000000000000000..5ebc4e66fdbc37bef5671ec05ca06997ad8549f0 Binary files /dev/null and b/Tesseract-OCR/libwinpthread-1.dll differ diff --git a/Tesseract-OCR/libzstd.dll b/Tesseract-OCR/libzstd.dll new file mode 100644 index 0000000000000000000000000000000000000000..591427c8145246a04348cb40e095e8551918c02a --- /dev/null +++ b/Tesseract-OCR/libzstd.dll @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:03fbbf1f2cb5e6be9a9b57dae4d9af403f831a13618eefd6ca8b1271c28a5b27 +size 1165512 diff --git a/Tesseract-OCR/lstmeval.1.html b/Tesseract-OCR/lstmeval.1.html new file mode 100644 index 0000000000000000000000000000000000000000..b982bd5b0257d281a51bc4c7085a2e90870b2cf5 --- /dev/null +++ b/Tesseract-OCR/lstmeval.1.html @@ -0,0 +1,847 @@ + + + + + +LSTMEVAL(1) + + + + + +
+
+

SYNOPSIS

+
+

lstmeval --model lang.lstm|modelname_checkpoint|modelname_N.NN_NN_NN.checkpoint [--traineddata lang/lang.traineddata] --eval_listfile lang.eval_files.txt [--verbosity N] [--max_image_MB NNNN]

+
+
+
+

DESCRIPTION

+
+

lstmeval(1) evaluates LSTM-based networks. Either a recognition model or a training checkpoint can be given as input for evaluation along with a list of lstmf files. If evaluating a training checkpoint, --traineddata should also be specified. Intermediate training checkpoints can also be used.

+
+
+
+

OPTIONS

+
+
+
+--model FILE +
+
+

+ Name of model file (training or recognition) (type:string default:) +

+
+
+--traineddata FILE +
+
+

+ If model is a training checkpoint, then traineddata must be the traineddata file that was given to the trainer (type:string default:) +

+
+
+--eval_listfile FILE +
+
+

+ File listing sample files in lstmf training format. (type:string default:) +

+
+
+--max_image_MB INT +
+
+

+ Max memory to use for images. (type:int default:2000) +

+
+
+--verbosity INT +
+
+

+ Amount of diagnosting information to output (0-2). (type:int default:1) +

+
+
+
+
+
+

HISTORY

+
+

lstmeval(1) was first made available for tesseract4.00.00alpha.

+
+
+
+

RESOURCES

+ +
+
+

SEE ALSO

+
+

tesseract(1)

+
+
+
+

COPYING

+
+

Copyright (C) 2012 Google, Inc. +Licensed under the Apache License, Version 2.0

+
+
+
+

AUTHOR

+
+

The Tesseract OCR engine was written by Ray Smith and his research groups +at Hewlett Packard (1985-1995) and Google (2006-present).

+
+
+
+

+ + + diff --git a/Tesseract-OCR/lstmeval.exe b/Tesseract-OCR/lstmeval.exe new file mode 100644 index 0000000000000000000000000000000000000000..99ef42b606ecae00a1f6c8f4167305f256c87ea4 --- /dev/null +++ b/Tesseract-OCR/lstmeval.exe @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:896cfd770ad7b279668e52c95b136e9196bb36c74d5edfb37e0d688d5cbff6e3 +size 8597448 diff --git a/Tesseract-OCR/lstmtraining.1.html b/Tesseract-OCR/lstmtraining.1.html new file mode 100644 index 0000000000000000000000000000000000000000..1da64ba678b493320283ed087e489a6176501950 --- /dev/null +++ b/Tesseract-OCR/lstmtraining.1.html @@ -0,0 +1,999 @@ + + + + + +LSTMTRAINING(1) + + + + + +
+
+

SYNOPSIS

+
+

lstmtraining + --continue_from train_output_dir/continue_from_lang.lstm + --old_traineddata bestdata_dir/continue_from_lang.traineddata + --traineddata train_output_dir/lang/lang.traineddata + --max_iterations NNN + --debug_interval 0|-1 + --train_listfile train_output_dir/lang.training_files.txt + --model_output train_output_dir/newlstmmodel

+
+
+
+

DESCRIPTION

+
+

lstmtraining(1) trains LSTM-based networks using a list of lstmf files and starter traineddata file as the main input. Training from scratch is not recommended to be done by users. Finetuning (example command shown in synopsis above) or replacing a layer options can be used instead. Different options apply to different types of training. +Read the [training documentation](https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html) for details.

+
+
+
+

OPTIONS

+
+
+
+'--debug_interval ' +
+
+

+ How often to display the alignment. (type:int default:0) +

+
+
+'--net_mode ' +
+
+

+ Controls network behavior. (type:int default:192) +

+
+
+'--perfect_sample_delay ' +
+
+

+ How many imperfect samples between perfect ones. (type:int default:0) +

+
+
+'--max_image_MB ' +
+
+

+ Max memory to use for images. (type:int default:6000) +

+
+
+'--append_index ' +
+
+

+ Index in continue_from Network at which to attach the new network defined by net_spec (type:int default:-1) +

+
+
+'--max_iterations ' +
+
+

+ If set, exit after this many iterations. A negative value is interpreted as epochs, 0 means infinite iterations. (type:int default:0) +

+
+
+'--target_error_rate ' +
+
+

+ Final error rate in percent. (type:double default:0.01) +

+
+
+'--weight_range ' +
+
+

+ Range of initial random weights. (type:double default:0.1) +

+
+
+'--learning_rate ' +
+
+

+ Weight factor for new deltas. (type:double default:0.001) +

+
+
+'--momentum ' +
+
+

+ Decay factor for repeating deltas. (type:double default:0.5) +

+
+
+'--adam_beta ' +
+
+

+ Decay factor for repeating deltas. (type:double default:0.999) +

+
+
+'--stop_training ' +
+
+

+ Just convert the training model to a runtime model. (type:bool default:false) +

+
+
+'--convert_to_int ' +
+
+

+ Convert the recognition model to an integer model. (type:bool default:false) +

+
+
+'--sequential_training ' +
+
+

+ Use the training files sequentially instead of round-robin. (type:bool default:false) +

+
+
+'--debug_network ' +
+
+

+ Get info on distribution of weight values (type:bool default:false) +

+
+
+'--randomly_rotate ' +
+
+

+ Train OSD and randomly turn training samples upside-down (type:bool default:false) +

+
+
+'--net_spec ' +
+
+

+ Network specification (type:string default:) +

+
+
+'--continue_from ' +
+
+

+ Existing model to extend (type:string default:) +

+
+
+'--model_output ' +
+
+

+ Basename for output models (type:string default:lstmtrain) +

+
+
+'--train_listfile ' +
+
+

+ File listing training files in lstmf training format. (type:string default:) +

+
+
+'--eval_listfile ' +
+
+

+ File listing eval files in lstmf training format. (type:string default:) +

+
+
+'--traineddata ' +
+
+

+ Starter traineddata with combined Dawgs/Unicharset/Recoder for language model (type:string default:) +

+
+
+'--old_traineddata ' +
+
+

+ When changing the character set, this specifies the traineddata with the old character set that is to be replaced (type:string default:) +

+
+
+
+
+
+

HISTORY

+
+

lstmtraining(1) was first made available for tesseract4.00.00alpha.

+
+
+
+

RESOURCES

+ +
+
+

SEE ALSO

+
+

tesseract(1)

+
+
+
+

COPYING

+
+

Copyright (C) 2012 Google, Inc. +Licensed under the Apache License, Version 2.0

+
+
+
+

AUTHOR

+
+

The Tesseract OCR engine was written by Ray Smith and his research groups +at Hewlett Packard (1985-1995) and Google (2006-present).

+
+
+
+

+ + + diff --git a/Tesseract-OCR/lstmtraining.exe b/Tesseract-OCR/lstmtraining.exe new file mode 100644 index 0000000000000000000000000000000000000000..0ecb47138e7072dd76d7bb4942cf5aadf2ce47a0 --- /dev/null +++ b/Tesseract-OCR/lstmtraining.exe @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:250a7196115bf70f946c903e6d4e3c1dbbb3ab062363d936aba46708dc3e069e +size 9648984 diff --git a/Tesseract-OCR/merge_unicharsets.1.html b/Tesseract-OCR/merge_unicharsets.1.html new file mode 100644 index 0000000000000000000000000000000000000000..543b8d08544c3341a4ec9db5521716d6f3db4d59 --- /dev/null +++ b/Tesseract-OCR/merge_unicharsets.1.html @@ -0,0 +1,833 @@ + + + + + +MERGE_UNICHARSETS(1) + + + + + +
+
+

SYNOPSIS

+
+

merge_unicharsets unicharset-in-1unicharset-in-n unicharset-out

+
+
+
+

DESCRIPTION

+
+

merge_unicharsets(1) is a simple tool to merge two or more unicharsets. +It could be used to create a combined unicharset for a script-level engine, +like the new Latin or Devanagari.

+
+
+
+

IN/OUT ARGUMENTS

+
+
+
+unicharset-in-1 +
+
+

+ (Input) The name of the first unicharset file to be merged. +

+
+
+unicharset-in-n +
+
+

+ (Input) The name of the nth unicharset file to be merged. +

+
+
+unicharset-out +
+
+

+ (Output) The name of the merged unicharset file. +

+
+
+
+
+
+

HISTORY

+
+

merge_unicharsets(1) was first made available for tesseract4.00.00alpha.

+
+
+
+

RESOURCES

+ +
+
+

SEE ALSO

+
+

tesseract(1)

+
+
+
+

COPYING

+
+

Copyright (C) 2012 Google, Inc. +Licensed under the Apache License, Version 2.0

+
+
+
+

AUTHOR

+
+

The Tesseract OCR engine was written by Ray Smith and his research groups +at Hewlett Packard (1985-1995) and Google (2006-present).

+
+
+
+

+ + + diff --git a/Tesseract-OCR/merge_unicharsets.exe b/Tesseract-OCR/merge_unicharsets.exe new file mode 100644 index 0000000000000000000000000000000000000000..b96b3e9d1a258ced1eab478bf5f48a8c9449ec3f Binary files /dev/null and b/Tesseract-OCR/merge_unicharsets.exe differ diff --git a/Tesseract-OCR/mftraining.1.html b/Tesseract-OCR/mftraining.1.html new file mode 100644 index 0000000000000000000000000000000000000000..718cbd1efacae92e0c4059ab0120a7dece845e55 --- /dev/null +++ b/Tesseract-OCR/mftraining.1.html @@ -0,0 +1,847 @@ + + + + + +MFTRAINING(1) + + + + + +
+
+

SYNOPSIS

+
+

mftraining -U unicharset -O lang.unicharset FILE

+
+
+
+

DESCRIPTION

+
+

mftraining takes a list of .tr files, from which it generates the +files inttemp (the shape prototypes), shapetable, and pffmtable +(the number of expected features for each character). (A fourth file +called Microfeat is also written by this program, but it is not used.)

+
+
+
+

OPTIONS

+
+
+
+-U FILE +
+
+

+ (Input) The unicharset generated by unicharset_extractor(1) +

+
+
+-F font_properties_file +
+
+

+ (Input) font properties file, each line is of the following form, where each field other than the font name is 0 or 1: +

+
+
+
*font_name* *italic* *bold* *fixed_pitch* *serif* *fraktur*
+
+
+
+-X xheights_file +
+
+

+ (Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ] +

+
+
+
*font_name* *xheight*
+
+
+
+-D dir +
+
+

+ Directory to write output files to. +

+
+
+-O FILE +
+
+

+ (Output) The output unicharset that will be given to combine_tessdata(1) +

+
+
+
+
+
+

SEE ALSO

+
+

tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1), +shapeclustering(1), unicharset(5)

+ +
+
+
+

COPYING

+
+

Copyright (C) Hewlett-Packard Company, 1988 +Licensed under the Apache License, Version 2.0

+
+
+
+

AUTHOR

+
+

The Tesseract OCR engine was written by Ray Smith and his research groups +at Hewlett Packard (1985-1995) and Google (2006-present).

+
+
+
+

+ + + diff --git a/Tesseract-OCR/mftraining.exe b/Tesseract-OCR/mftraining.exe new file mode 100644 index 0000000000000000000000000000000000000000..004517d0f7f495aa9d0e43d93e867146bd2b3348 --- /dev/null +++ b/Tesseract-OCR/mftraining.exe @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:5b8d1a9c800f4cf97ed734a3a94b23b23a42f12561bb32f967f6cbd6037b23aa +size 5308448 diff --git a/Tesseract-OCR/set_unicharset_properties.1.html b/Tesseract-OCR/set_unicharset_properties.1.html new file mode 100644 index 0000000000000000000000000000000000000000..146dd96c2e2d731193395c26836a5ddbe6ce8807 --- /dev/null +++ b/Tesseract-OCR/set_unicharset_properties.1.html @@ -0,0 +1,831 @@ + + + + + +SET_UNICHARSET_PROPERTIES(1) + + + + + +
+
+

SYNOPSIS

+
+

set_unicharset_properties --U input_unicharsetfile --script_dir /path/to/langdata --O output_unicharsetfile

+
+
+
+

DESCRIPTION

+
+

set_unicharset_properties(1) reads a unicharset file, puts the result in a UNICHARSET object, fills it with properties about the unichars it contains and writes the result back to another unicharset file.

+
+
+
+

OPTIONS

+
+
+
+--script_dir /path/to/langdata +
+
+

+ (Input) Specify the location of directory for universal script unicharsets and font xheights (type:string default:) +

+
+
+--U unicharsetfile +
+
+

+ (Input) Specify the location of the unicharset to load as input. +

+
+
+--O unicharsetfile +
+
+

+ (Output) Specify the location of the unicharset to be written with updated properties. +

+
+
+
+
+
+

HISTORY

+
+

set_unicharset_properties(1) was first made available for tesseract version 3.03.

+
+
+ +
+

SEE ALSO

+
+

tesseract(1)

+
+
+
+

COPYING

+
+

Copyright (C) 2012 Google, Inc. +Licensed under the Apache License, Version 2.0

+
+
+
+

AUTHOR

+
+

The Tesseract OCR engine was written by Ray Smith and his research groups +at Hewlett Packard (1985-1995) and Google (2006-present).

+
+
+
+

+ + + diff --git a/Tesseract-OCR/set_unicharset_properties.exe b/Tesseract-OCR/set_unicharset_properties.exe new file mode 100644 index 0000000000000000000000000000000000000000..4c9ff94ac57e77e4391ea0a85873e292e0f97e6d --- /dev/null +++ b/Tesseract-OCR/set_unicharset_properties.exe @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:03adb04c1f908e6b3b05e3424e4bd359afcd0cb8856b2b32cc8244ea1e715be4 +size 6483904 diff --git a/Tesseract-OCR/shapeclustering.1.html b/Tesseract-OCR/shapeclustering.1.html new file mode 100644 index 0000000000000000000000000000000000000000..673f603201f455a89a2fcd3263286e0cac1cfdc2 --- /dev/null +++ b/Tesseract-OCR/shapeclustering.1.html @@ -0,0 +1,850 @@ + + + + + +SHAPECLUSTERING(1) + + + + + +
+
+

SYNOPSIS

+
+

shapeclustering -D output_dir + -U unicharset -O mfunicharset + -F font_props -X xheights + FILE

+
+
+
+

DESCRIPTION

+
+

shapeclustering(1) takes extracted feature .tr files (generated by +tesseract(1) run in a special mode from box files) and produces a +file shapetable and an enhanced unicharset. This program is still +experimental, and is not required (yet) for training Tesseract.

+
+
+
+

OPTIONS

+
+
+
+-U FILE +
+
+

+ The unicharset generated by unicharset_extractor(1). +

+
+
+-D dir +
+
+

+ Directory to write output files to. +

+
+
+-F font_properties_file +
+
+

+ (Input) font properties file, where each line is of the following form, where each field other than the font name is 0 or 1: +

+
+
+
'font_name' 'italic' 'bold' 'fixed_pitch' 'serif' 'fraktur'
+
+
+
+-X xheights_file +
+
+

+ (Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ] +

+
+
+
'font_name' 'xheight'
+
+
+
+-O FILE +
+
+

+ The output unicharset that will be given to combine_tessdata(1). +

+
+
+
+
+
+

SEE ALSO

+
+

tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1), +unicharset(5)

+ +
+
+
+

COPYING

+
+

Copyright (C) Google, 2011 +Licensed under the Apache License, Version 2.0

+
+
+
+

AUTHOR

+
+

The Tesseract OCR engine was written by Ray Smith and his research groups +at Hewlett Packard (1985-1995) and Google (2006-present).

+
+
+
+

+ + + diff --git a/Tesseract-OCR/shapeclustering.exe b/Tesseract-OCR/shapeclustering.exe new file mode 100644 index 0000000000000000000000000000000000000000..bc1fd7ae23e30209bfcd0e18bb710b4e1f2c739f --- /dev/null +++ b/Tesseract-OCR/shapeclustering.exe @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:931b8248817e0d7fbab2b75eacfc5a42cd25cd966b3e37f0220fbc228d732f84 +size 4986944 diff --git a/Tesseract-OCR/tessdata/ScrollView.jar b/Tesseract-OCR/tessdata/ScrollView.jar new file mode 100644 index 0000000000000000000000000000000000000000..824eda1173884d8de8f597fe0da0c64cfa5ebe94 Binary files /dev/null and b/Tesseract-OCR/tessdata/ScrollView.jar differ diff --git a/Tesseract-OCR/tessdata/configs/alto b/Tesseract-OCR/tessdata/configs/alto new file mode 100644 index 0000000000000000000000000000000000000000..0dd12a7a709c14e267fb7141cd9ef705102b8a3a --- /dev/null +++ b/Tesseract-OCR/tessdata/configs/alto @@ -0,0 +1 @@ +tessedit_create_alto 1 diff --git a/Tesseract-OCR/tessdata/configs/ambigs.train b/Tesseract-OCR/tessdata/configs/ambigs.train new file mode 100644 index 0000000000000000000000000000000000000000..23035a1904cfb8a2e5ad143ac638447bc1b04b4c --- /dev/null +++ b/Tesseract-OCR/tessdata/configs/ambigs.train @@ -0,0 +1,7 @@ +tessedit_ambigs_training 1 +load_freq_dawg 0 +load_punc_dawg 0 +load_system_dawg 0 +load_number_dawg 0 +ambigs_debug_level 3 +load_fixed_length_dawgs 0 diff --git a/Tesseract-OCR/tessdata/configs/api_config b/Tesseract-OCR/tessdata/configs/api_config new file mode 100644 index 0000000000000000000000000000000000000000..5cd6ec0310213adbc59e5c48a49f858daf3cdc4f --- /dev/null +++ b/Tesseract-OCR/tessdata/configs/api_config @@ -0,0 +1 @@ +tessedit_zero_rejection T diff --git a/Tesseract-OCR/tessdata/configs/bigram b/Tesseract-OCR/tessdata/configs/bigram new file mode 100644 index 0000000000000000000000000000000000000000..5d6c2d061f4a0bae8ab3b2270da8e6744a048d11 --- /dev/null +++ b/Tesseract-OCR/tessdata/configs/bigram @@ -0,0 +1,5 @@ +load_bigram_dawg True +tessedit_enable_bigram_correction True +tessedit_bigram_debug 3 +save_raw_choices True +save_alt_choices True diff --git a/Tesseract-OCR/tessdata/configs/box.train b/Tesseract-OCR/tessdata/configs/box.train new file mode 100644 index 0000000000000000000000000000000000000000..d39f2687ef8c50f090895ba1f26367e25613a685 --- /dev/null +++ b/Tesseract-OCR/tessdata/configs/box.train @@ -0,0 +1,12 @@ +disable_character_fragments T +file_type .bl +textord_fast_pitch_test T +tessedit_zero_rejection T +tessedit_minimal_rejection F +tessedit_write_rep_codes F +edges_children_fix F +edges_childarea 0.65 +edges_boxarea 0.9 +tessedit_resegment_from_boxes T +tessedit_train_from_boxes T +textord_no_rejects T diff --git a/Tesseract-OCR/tessdata/configs/box.train.stderr b/Tesseract-OCR/tessdata/configs/box.train.stderr new file mode 100644 index 0000000000000000000000000000000000000000..82754e9cc90b7da072c9074e7ee1492beaa72db6 --- /dev/null +++ b/Tesseract-OCR/tessdata/configs/box.train.stderr @@ -0,0 +1,13 @@ +file_type .bl +#tessedit_use_nn F +textord_fast_pitch_test T +tessedit_zero_rejection T +tessedit_minimal_rejection F +tessedit_write_rep_codes F +edges_children_fix F +edges_childarea 0.65 +edges_boxarea 0.9 +tessedit_resegment_from_boxes T +tessedit_train_from_boxes T +#textord_repeat_extraction F +textord_no_rejects T diff --git a/Tesseract-OCR/tessdata/configs/digits b/Tesseract-OCR/tessdata/configs/digits new file mode 100644 index 0000000000000000000000000000000000000000..6a329f892910ae9dd7af1f9fe8f7a1d48378fd8b --- /dev/null +++ b/Tesseract-OCR/tessdata/configs/digits @@ -0,0 +1 @@ +tessedit_char_whitelist 0123456789-. diff --git a/Tesseract-OCR/tessdata/configs/get.images b/Tesseract-OCR/tessdata/configs/get.images new file mode 100644 index 0000000000000000000000000000000000000000..7d00b613ffcbf7e4fa712d2c50a85c7643a027e3 --- /dev/null +++ b/Tesseract-OCR/tessdata/configs/get.images @@ -0,0 +1 @@ +tessedit_write_images T diff --git a/Tesseract-OCR/tessdata/configs/hocr b/Tesseract-OCR/tessdata/configs/hocr new file mode 100644 index 0000000000000000000000000000000000000000..5ab372eaf819b05bdd87ba419c874f6a1be4677b --- /dev/null +++ b/Tesseract-OCR/tessdata/configs/hocr @@ -0,0 +1,2 @@ +tessedit_create_hocr 1 +hocr_font_info 0 diff --git a/Tesseract-OCR/tessdata/configs/inter b/Tesseract-OCR/tessdata/configs/inter new file mode 100644 index 0000000000000000000000000000000000000000..252f1a171a154f9ade798e210015a720af039d00 --- /dev/null +++ b/Tesseract-OCR/tessdata/configs/inter @@ -0,0 +1,2 @@ +interactive_display_mode T +tessedit_display_outwords T diff --git a/Tesseract-OCR/tessdata/configs/kannada b/Tesseract-OCR/tessdata/configs/kannada new file mode 100644 index 0000000000000000000000000000000000000000..c6ac105788137bc4e89821e94843ea86ed5b4564 --- /dev/null +++ b/Tesseract-OCR/tessdata/configs/kannada @@ -0,0 +1,4 @@ +textord_skewsmooth_offset 8 +textord_skewsmooth_offset2 8 +textord_merge_desc 0.5 +textord_no_rejects 1 diff --git a/Tesseract-OCR/tessdata/configs/linebox b/Tesseract-OCR/tessdata/configs/linebox new file mode 100644 index 0000000000000000000000000000000000000000..bd9c114df65ddf13e640298075adb940225c5f96 --- /dev/null +++ b/Tesseract-OCR/tessdata/configs/linebox @@ -0,0 +1,2 @@ +tessedit_resegment_from_line_boxes 1 +tessedit_make_boxes_from_boxes 1 diff --git a/Tesseract-OCR/tessdata/configs/logfile b/Tesseract-OCR/tessdata/configs/logfile new file mode 100644 index 0000000000000000000000000000000000000000..a160f9be275a70fe3af1935fb8fe7af29efa8451 --- /dev/null +++ b/Tesseract-OCR/tessdata/configs/logfile @@ -0,0 +1 @@ +debug_file tesseract.log diff --git a/Tesseract-OCR/tessdata/configs/lstm.train b/Tesseract-OCR/tessdata/configs/lstm.train new file mode 100644 index 0000000000000000000000000000000000000000..5ff37726211ab360ceead7c76a1b52613cc2f277 --- /dev/null +++ b/Tesseract-OCR/tessdata/configs/lstm.train @@ -0,0 +1,11 @@ +file_type .bl +textord_fast_pitch_test T +tessedit_zero_rejection T +tessedit_minimal_rejection F +tessedit_write_rep_codes F +edges_children_fix F +edges_childarea 0.65 +edges_boxarea 0.9 +tessedit_train_line_recognizer T +textord_no_rejects T +tessedit_init_config_only T diff --git a/Tesseract-OCR/tessdata/configs/lstmbox b/Tesseract-OCR/tessdata/configs/lstmbox new file mode 100644 index 0000000000000000000000000000000000000000..a6f2cedc504e9010eec3bfb0b1336b75ef0c28e5 --- /dev/null +++ b/Tesseract-OCR/tessdata/configs/lstmbox @@ -0,0 +1 @@ +tessedit_create_lstmbox 1 diff --git a/Tesseract-OCR/tessdata/configs/lstmdebug b/Tesseract-OCR/tessdata/configs/lstmdebug new file mode 100644 index 0000000000000000000000000000000000000000..3fa3dee71aafe30913c1863a5e67529872984743 --- /dev/null +++ b/Tesseract-OCR/tessdata/configs/lstmdebug @@ -0,0 +1,4 @@ +stopper_debug_level 1 +classify_debug_level 1 +segsearch_debug_level 1 +language_model_debug_level 3 diff --git a/Tesseract-OCR/tessdata/configs/makebox b/Tesseract-OCR/tessdata/configs/makebox new file mode 100644 index 0000000000000000000000000000000000000000..3d90ac26f9542c6beac1082b2d900859906af8e9 --- /dev/null +++ b/Tesseract-OCR/tessdata/configs/makebox @@ -0,0 +1 @@ +tessedit_create_boxfile 1 diff --git a/Tesseract-OCR/tessdata/configs/pdf b/Tesseract-OCR/tessdata/configs/pdf new file mode 100644 index 0000000000000000000000000000000000000000..59645d71ce52a143d819f2057c8c4e9ce2d46e40 --- /dev/null +++ b/Tesseract-OCR/tessdata/configs/pdf @@ -0,0 +1 @@ +tessedit_create_pdf 1 diff --git a/Tesseract-OCR/tessdata/configs/quiet b/Tesseract-OCR/tessdata/configs/quiet new file mode 100644 index 0000000000000000000000000000000000000000..35b59a9d41dd462c6d13b2301d4b2c31219c582f --- /dev/null +++ b/Tesseract-OCR/tessdata/configs/quiet @@ -0,0 +1 @@ +debug_file /dev/null diff --git a/Tesseract-OCR/tessdata/configs/rebox b/Tesseract-OCR/tessdata/configs/rebox new file mode 100644 index 0000000000000000000000000000000000000000..f8342b4c2c7eb733e1d4078f32a0aa5aee677cc3 --- /dev/null +++ b/Tesseract-OCR/tessdata/configs/rebox @@ -0,0 +1,2 @@ +tessedit_resegment_from_boxes 1 +tessedit_make_boxes_from_boxes 1 diff --git a/Tesseract-OCR/tessdata/configs/strokewidth b/Tesseract-OCR/tessdata/configs/strokewidth new file mode 100644 index 0000000000000000000000000000000000000000..e95b59263daf6d43f1b20682a4fa79d386484536 --- /dev/null +++ b/Tesseract-OCR/tessdata/configs/strokewidth @@ -0,0 +1,12 @@ +textord_show_blobs 0 +textord_debug_tabfind 3 +textord_tabfind_show_partitions 1 +textord_tabfind_show_initial_partitions 1 +textord_tabfind_show_columns 1 +textord_tabfind_show_blocks 1 +textord_tabfind_show_initialtabs 1 +textord_tabfind_show_finaltabs 1 +textord_tabfind_show_strokewidths 1 +textord_tabfind_show_vlines 0 +textord_tabfind_show_images 1 +tessedit_dump_pageseg_images 0 diff --git a/Tesseract-OCR/tessdata/configs/tsv b/Tesseract-OCR/tessdata/configs/tsv new file mode 100644 index 0000000000000000000000000000000000000000..dc52478177fd6fb7b1fe278e1374c2054f3e2442 --- /dev/null +++ b/Tesseract-OCR/tessdata/configs/tsv @@ -0,0 +1 @@ +tessedit_create_tsv 1 diff --git a/Tesseract-OCR/tessdata/configs/txt b/Tesseract-OCR/tessdata/configs/txt new file mode 100644 index 0000000000000000000000000000000000000000..a0cc952977f0f3562a5c94011c13044ace865519 --- /dev/null +++ b/Tesseract-OCR/tessdata/configs/txt @@ -0,0 +1,3 @@ +# This config file should be used with other config files which create renderers. +# usage example: tesseract eurotext.tif eurotext txt hocr pdf +tessedit_create_txt 1 diff --git a/Tesseract-OCR/tessdata/configs/unlv b/Tesseract-OCR/tessdata/configs/unlv new file mode 100644 index 0000000000000000000000000000000000000000..d2e22f5b93585032eef94f22966329144ba44d6f --- /dev/null +++ b/Tesseract-OCR/tessdata/configs/unlv @@ -0,0 +1,2 @@ +tessedit_write_unlv 1 +unlv_tilde_crunching T diff --git a/Tesseract-OCR/tessdata/configs/wordstrbox b/Tesseract-OCR/tessdata/configs/wordstrbox new file mode 100644 index 0000000000000000000000000000000000000000..38cd41cd60f89ec7ab3bc5161e094b20f06a3361 --- /dev/null +++ b/Tesseract-OCR/tessdata/configs/wordstrbox @@ -0,0 +1 @@ +tessedit_create_wordstrbox 1 diff --git a/Tesseract-OCR/tessdata/eng.traineddata b/Tesseract-OCR/tessdata/eng.traineddata new file mode 100644 index 0000000000000000000000000000000000000000..b15334db4df434b6e9b7e8119fe96f85f855589b --- /dev/null +++ b/Tesseract-OCR/tessdata/eng.traineddata @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:7d4322bd2a7749724879683fc3912cb542f19906c83bcc1a52132556427170b2 +size 4113088 diff --git a/Tesseract-OCR/tessdata/eng.user-patterns b/Tesseract-OCR/tessdata/eng.user-patterns new file mode 100644 index 0000000000000000000000000000000000000000..5daba44df897f1c6d67caeb6d0414f7b55625fa1 --- /dev/null +++ b/Tesseract-OCR/tessdata/eng.user-patterns @@ -0,0 +1,2 @@ +1-\d\d\d-GOOG-411 +www.\n\\\*.com diff --git a/Tesseract-OCR/tessdata/eng.user-words b/Tesseract-OCR/tessdata/eng.user-words new file mode 100644 index 0000000000000000000000000000000000000000..e0c5a630214ac69273e2b54107c62ed171fc50a0 --- /dev/null +++ b/Tesseract-OCR/tessdata/eng.user-words @@ -0,0 +1,5 @@ +the +quick +brown +fox +jumped diff --git a/Tesseract-OCR/tessdata/jaxb-api-2.3.1.jar b/Tesseract-OCR/tessdata/jaxb-api-2.3.1.jar new file mode 100644 index 0000000000000000000000000000000000000000..45658654712b88d45c9464286ffc2fcb07036bdf Binary files /dev/null and b/Tesseract-OCR/tessdata/jaxb-api-2.3.1.jar differ diff --git a/Tesseract-OCR/tessdata/osd.traineddata b/Tesseract-OCR/tessdata/osd.traineddata new file mode 100644 index 0000000000000000000000000000000000000000..f4f02c3c55ddbbfdb31fe2687f2852e17fda3d9b --- /dev/null +++ b/Tesseract-OCR/tessdata/osd.traineddata @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:9cf5d576fcc47564f11265841e5ca839001e7e6f38ff7f7aacf46d15a96b00ff +size 10562727 diff --git a/Tesseract-OCR/tessdata/pdf.ttf b/Tesseract-OCR/tessdata/pdf.ttf new file mode 100644 index 0000000000000000000000000000000000000000..d1472b20ef1aebbf5e11573867e9ac13873681b9 Binary files /dev/null and b/Tesseract-OCR/tessdata/pdf.ttf differ diff --git a/Tesseract-OCR/tessdata/piccolo2d-core-3.0.1.jar b/Tesseract-OCR/tessdata/piccolo2d-core-3.0.1.jar new file mode 100644 index 0000000000000000000000000000000000000000..df84ed5cc8059d640876d3be0ac4c11af0bf0853 Binary files /dev/null and b/Tesseract-OCR/tessdata/piccolo2d-core-3.0.1.jar differ diff --git a/Tesseract-OCR/tessdata/piccolo2d-extras-3.0.1.jar b/Tesseract-OCR/tessdata/piccolo2d-extras-3.0.1.jar new file mode 100644 index 0000000000000000000000000000000000000000..daf51c0ebbea07291c85cab9bc81b19ef0e99efc Binary files /dev/null and b/Tesseract-OCR/tessdata/piccolo2d-extras-3.0.1.jar differ diff --git a/Tesseract-OCR/tessdata/tessconfigs/batch b/Tesseract-OCR/tessdata/tessconfigs/batch new file mode 100644 index 0000000000000000000000000000000000000000..a681e4a443fa21ce6f32bbcf0334af3433888566 --- /dev/null +++ b/Tesseract-OCR/tessdata/tessconfigs/batch @@ -0,0 +1 @@ +# No content needed as all defaults are correct. diff --git a/Tesseract-OCR/tessdata/tessconfigs/batch.nochop b/Tesseract-OCR/tessdata/tessconfigs/batch.nochop new file mode 100644 index 0000000000000000000000000000000000000000..ebaab9438e309b4dfdfd8428676170ab2b64a858 --- /dev/null +++ b/Tesseract-OCR/tessdata/tessconfigs/batch.nochop @@ -0,0 +1,2 @@ +chop_enable 0 +wordrec_enable_assoc 0 diff --git a/Tesseract-OCR/tessdata/tessconfigs/matdemo b/Tesseract-OCR/tessdata/tessconfigs/matdemo new file mode 100644 index 0000000000000000000000000000000000000000..c34567be7565d519806076b795fceff9fdad1477 --- /dev/null +++ b/Tesseract-OCR/tessdata/tessconfigs/matdemo @@ -0,0 +1,7 @@ +################################################# +# Adaptive Matcher Using PreAdapted Templates +################################################# + +classify_enable_adaptive_debugger 1 +matcher_debug_flags 6 +matcher_debug_level 1 diff --git a/Tesseract-OCR/tessdata/tessconfigs/msdemo b/Tesseract-OCR/tessdata/tessconfigs/msdemo new file mode 100644 index 0000000000000000000000000000000000000000..9c1184a0c84bde5f58ea74a1f316af9eb2574b52 --- /dev/null +++ b/Tesseract-OCR/tessdata/tessconfigs/msdemo @@ -0,0 +1,12 @@ +################################################# +# Adaptive Matcher Using PreAdapted Templates +################################################# + +classify_enable_adaptive_debugger 1 +matcher_debug_flags 6 +matcher_debug_level 1 + +wordrec_display_splits 0 +wordrec_display_all_blobs 1 +wordrec_display_segmentations 2 +classify_debug_level 1 diff --git a/Tesseract-OCR/tessdata/tessconfigs/nobatch b/Tesseract-OCR/tessdata/tessconfigs/nobatch new file mode 100644 index 0000000000000000000000000000000000000000..8b137891791fe96927ad78e64b0aad7bded08bdc --- /dev/null +++ b/Tesseract-OCR/tessdata/tessconfigs/nobatch @@ -0,0 +1 @@ + diff --git a/Tesseract-OCR/tessdata/tessconfigs/segdemo b/Tesseract-OCR/tessdata/tessconfigs/segdemo new file mode 100644 index 0000000000000000000000000000000000000000..eaff69f54b9ed62d49987634a5fba2c79cccf3cf --- /dev/null +++ b/Tesseract-OCR/tessdata/tessconfigs/segdemo @@ -0,0 +1,9 @@ +################################################# +# Adaptive Matcher Using PreAdapted Templates +################################################# + +wordrec_display_splits 0 +wordrec_display_all_blobs 1 +wordrec_display_segmentations 2 +classify_debug_level 1 +stopper_debug_level 1 diff --git a/Tesseract-OCR/tesseract-uninstall.exe b/Tesseract-OCR/tesseract-uninstall.exe new file mode 100644 index 0000000000000000000000000000000000000000..e32f899bd38ae0b847e8d299bb6fa6c10310ccdd Binary files /dev/null and b/Tesseract-OCR/tesseract-uninstall.exe differ diff --git a/Tesseract-OCR/tesseract.1.html b/Tesseract-OCR/tesseract.1.html new file mode 100644 index 0000000000000000000000000000000000000000..93951ed6992f5ddb6d95063e458aabb645e1815d --- /dev/null +++ b/Tesseract-OCR/tesseract.1.html @@ -0,0 +1,1387 @@ + + + + + +TESSERACT(1) + + + + + +
+
+

SYNOPSIS

+
+

tesseract FILE OUTPUTBASE [OPTIONS]… [CONFIGFILE]…

+
+
+
+

DESCRIPTION

+
+

tesseract(1) is a commercial quality OCR engine originally developed at HP +between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by +UNLV. It was open-sourced by HP and UNLV in 2005, and has been developed +at Google since then.

+
+
+
+

IN/OUT ARGUMENTS

+
+
+
+FILE +
+
+

+ The name of the input file. + This can either be an image file or a text file.
+ Most image file formats (anything readable by Leptonica) are supported.
+ A text file lists the names of all input images (one image name per line). + The results will be combined in a single file for each output file format + (txt, pdf, hocr, xml).
+ If FILE is stdin or - then the standard input is used. +

+
+
+OUTPUTBASE +
+
+

+ The basename of the output file (to which the appropriate extension + will be appended). By default the output will be a text file + with .txt added to the basename unless there are one or more + parameters set which explicitly specify the desired output.
+ If OUTPUTBASE is stdout or - then the standard output is used. +

+
+
+
+
+
+

OPTIONS

+
+
+
+-c CONFIGVAR=VALUE +
+
+

+ Set value for parameter CONFIGVAR to VALUE. Multiple -c arguments are allowed. +

+
+
+--dpi N +
+
+

+ Specify the resolution N in DPI for the input image(s). + A typical value for N is 300. Without this option, + the resolution is read from the metadata included in the image. + If an image does not include that information, Tesseract tries to guess it. +

+
+
+-l LANG +
+
+-l SCRIPT +
+
+

+ The language or script to use. + If none is specified, eng (English) is assumed. + Multiple languages may be specified, separated by plus characters. + Tesseract uses 3-character ISO 639-2 language codes + (see LANGUAGES AND SCRIPTS). +

+
+
+--psm N +
+
+

+ Set Tesseract to only run a subset of layout analysis and assume + a certain form of image. The options for N are: +

+
+
+
0 = Orientation and script detection (OSD) only.
+1 = Automatic page segmentation with OSD.
+2 = Automatic page segmentation, but no OSD, or OCR. (not implemented)
+3 = Fully automatic page segmentation, but no OSD. (Default)
+4 = Assume a single column of text of variable sizes.
+5 = Assume a single uniform block of vertically aligned text.
+6 = Assume a single uniform block of text.
+7 = Treat the image as a single text line.
+8 = Treat the image as a single word.
+9 = Treat the image as a single word in a circle.
+10 = Treat the image as a single character.
+11 = Sparse text. Find as much text as possible in no particular order.
+12 = Sparse text with OSD.
+13 = Raw line. Treat the image as a single text line,
+     bypassing hacks that are Tesseract-specific.
+
+
+
+--oem N +
+
+

+ Specify OCR Engine mode. The options for N are: +

+
+
+
0 = Original Tesseract only.
+1 = Neural nets LSTM only.
+2 = Tesseract + LSTM.
+3 = Default, based on what is available.
+
+
+
+--tessdata-dir PATH +
+
+

+ Specify the location of tessdata path. +

+
+
+--user-patterns FILE +
+
+

+ Specify the location of user patterns file. +

+
+
+--user-words FILE +
+
+

+ Specify the location of user words file. +

+
+
+CONFIGFILE +
+
+

+ The name of a config to use. The name can be a file in tessdata/configs + or tessdata/tessconfigs, or an absolute or relative file path. + A config is a plain text file which contains a list of parameters and + their values, one per line, with a space separating parameter from value.
+ Interesting config files include: +

+
    +
  • +

    +alto — Output in ALTO format (OUTPUTBASE.xml). +

    +
  • +
  • +

    +hocr — Output in hOCR format (OUTPUTBASE.hocr). +

    +
  • +
  • +

    +pdf — Output PDF (OUTPUTBASE.pdf). +

    +
  • +
  • +

    +tsv — Output TSV (OUTPUTBASE.tsv). +

    +
  • +
  • +

    +txt — Output plain text (OUTPUTBASE.txt). +

    +
  • +
  • +

    +get.images — Write processed input images to file (OUTPUTBASE.processedPAGENUMBER.tif). +

    +
  • +
  • +

    +logfile — Redirect debug messages to file (tesseract.log). +

    +
  • +
  • +

    +lstm.train — Output files used by LSTM training (OUTPUTBASE.lstmf). +

    +
  • +
  • +

    +makebox — Write box file (OUTPUTBASE.box). +

    +
  • +
  • +

    +quiet — Redirect debug messages to /dev/null. +

    +
  • +
+
+
+

It is possible to select several config files, for example +tesseract image.png demo alto hocr pdf txt will create four output files +demo.alto, demo.hocr, demo.pdf and demo.txt with the OCR results.

+

Nota bene: The options -l LANG, -l SCRIPT and --psm N +must occur before any CONFIGFILE.

+
+
+
+

SINGLE OPTIONS

+
+
+
+-h, --help +
+
+

+ Show help message. +

+
+
+--help-extra +
+
+

+ Show extra help for advanced users. +

+
+
+--help-psm +
+
+

+ Show page segmentation modes. +

+
+
+--help-oem +
+
+

+ Show OCR Engine modes. +

+
+
+-v, --version +
+
+

+ Returns the current version of the tesseract(1) executable. +

+
+
+--list-langs +
+
+

+ List available languages for tesseract engine. + Can be used with --tessdata-dir PATH. +

+
+
+--print-parameters +
+
+

+ Print tesseract parameters. +

+
+
+
+
+
+

LANGUAGES AND SCRIPTS

+
+

To recognize some text with Tesseract, it is normally necessary to specify +the language(s) or script(s) of the text (unless it is English text which is +supported by default) using -l LANG or -l SCRIPT.

+

Selecting a language automatically also selects the language specific +character set and dictionary (word list).

+

Selecting a script typically selects all characters of that script +which can be from different languages. The dictionary which is included +also contains a mix from different languages. +In most cases, a script also supports English. +So it is possible to recognize a language that has not been specifically +trained for by using traineddata for the script it is written in.

+

More than one language or script may be specified by using +. +Example: tesseract myimage.png myimage -l eng+deu+fra.

+

https://github.com/tesseract-ocr/tessdata_fast provides fast language and +script models which are also part of Linux distributions.

+

For Tesseract 4, tessdata_fast includes traineddata files for the +following languages:

+

afr (Afrikaans), +amh (Amharic), +ara (Arabic), +asm (Assamese), +aze (Azerbaijani), +aze_cyrl (Azerbaijani - Cyrilic), +bel (Belarusian), +ben (Bengali), +bod (Tibetan), +bos (Bosnian), +bre (Breton), +bul (Bulgarian), +cat (Catalan; Valencian), +ceb (Cebuano), +ces (Czech), +chi_sim (Chinese simplified), +chi_tra (Chinese traditional), +chr (Cherokee), +cos (Corsican), +cym (Welsh), +dan (Danish), +deu (German), +div (Dhivehi), +dzo (Dzongkha), +ell (Greek, Modern, 1453-), +eng (English), +enm (English, Middle, 1100-1500), +epo (Esperanto), +equ (Math / equation detection module), +est (Estonian), +eus (Basque), +fas (Persian), +fao (Faroese), +fil (Filipino), +fin (Finnish), +fra (French), +frk (Frankish), +frm (French, Middle, ca.1400-1600), +fry (West Frisian), +gla (Scottish Gaelic), +gle (Irish), +glg (Galician), +grc (Greek, Ancient, to 1453), +guj (Gujarati), +hat (Haitian; Haitian Creole), +heb (Hebrew), +hin (Hindi), +hrv (Croatian), +hun (Hungarian), +hye (Armenian), +iku (Inuktitut), +ind (Indonesian), +isl (Icelandic), +ita (Italian), +ita_old (Italian - Old), +jav (Javanese), +jpn (Japanese), +kan (Kannada), +kat (Georgian), +kat_old (Georgian - Old), +kaz (Kazakh), +khm (Central Khmer), +kir (Kirghiz; Kyrgyz), +kmr (Kurdish Kurmanji), +kor (Korean), +kor_vert (Korean vertical), +lao (Lao), +lat (Latin), +lav (Latvian), +lit (Lithuanian), +ltz (Luxembourgish), +mal (Malayalam), +mar (Marathi), +mkd (Macedonian), +mlt (Maltese), +mon (Mongolian), +mri (Maori), +msa (Malay), +mya (Burmese), +nep (Nepali), +nld (Dutch; Flemish), +nor (Norwegian), +oci (Occitan post 1500), +ori (Oriya), +osd (Orientation and script detection module), +pan (Panjabi; Punjabi), +pol (Polish), +por (Portuguese), +pus (Pushto; Pashto), +que (Quechua), +ron (Romanian; Moldavian; Moldovan), +rus (Russian), +san (Sanskrit), +sin (Sinhala; Sinhalese), +slk (Slovak), +slv (Slovenian), +snd (Sindhi), +spa (Spanish; Castilian), +spa_old (Spanish; Castilian - Old), +sqi (Albanian), +srp (Serbian), +srp_latn (Serbian - Latin), +sun (Sundanese), +swa (Swahili), +swe (Swedish), +syr (Syriac), +tam (Tamil), +tat (Tatar), +tel (Telugu), +tgk (Tajik), +tha (Thai), +tir (Tigrinya), +ton (Tonga), +tur (Turkish), +uig (Uighur; Uyghur), +ukr (Ukrainian), +urd (Urdu), +uzb (Uzbek), +uzb_cyrl (Uzbek - Cyrilic), +vie (Vietnamese), +yid (Yiddish), +yor (Yoruba)

+

To use a non-standard language pack named foo.traineddata, set the +TESSDATA_PREFIX environment variable so the file can be found at +TESSDATA_PREFIX/tessdata/foo.traineddata and give Tesseract the +argument -l foo.

+

For Tesseract 4, tessdata_fast includes traineddata files for the +following scripts:

+

Arabic, +Armenian, +Bengali, +Canadian_Aboriginal, +Cherokee, +Cyrillic, +Devanagari, +Ethiopic, +Fraktur, +Georgian, +Greek, +Gujarati, +Gurmukhi, +HanS (Han simplified), +HanS_vert (Han simplified, vertical), +HanT (Han traditional), +HanT_vert (Han traditional, vertical), +Hangul, +Hangul_vert (Hangul vertical), +Hebrew, +Japanese, +Japanese_vert (Japanese vertical), +Kannada, +Khmer, +Lao, +Latin, +Malayalam, +Myanmar, +Oriya (Odia), +Sinhala, +Syriac, +Tamil, +Telugu, +Thaana, +Thai, +Tibetan, +Vietnamese.

+

The same languages and scripts are available from +https://github.com/tesseract-ocr/tessdata_best. +tessdata_best provides slow language and script models. +These models are needed for training. They also can give better OCR results, +but the recognition takes much more time.

+

Both tessdata_fast and tessdata_best only support the LSTM OCR engine.

+

There is a third repository, https://github.com/tesseract-ocr/tessdata, +with models which support both the Tesseract 3 legacy OCR engine and the +Tesseract 4 LSTM OCR engine.

+
+
+
+

CONFIG FILES AND AUGMENTING WITH USER DATA

+
+

Tesseract config files consist of lines with parameter-value pairs (space +separated). The parameters are documented as flags in the source code like +the following one in tesseractclass.h:

+

STRING_VAR_H(tessedit_char_blacklist, "", + "Blacklist of chars not to recognize");

+

These parameters may enable or disable various features of the engine, and +may cause it to load (or not load) various data. For instance, let’s suppose +you want to OCR in English, but suppress the normal dictionary and load an +alternative word list and an alternative list of patterns — these two files +are the most commonly used extra data files.

+

If your language pack is in /path/to/eng.traineddata and the hocr config +is in /path/to/configs/hocr then create three new files:

+

/path/to/eng.user-words:

+
+
the
+quick
+brown
+fox
+jumped
+
+
+

/path/to/eng.user-patterns:

+
+
1-\d\d\d-GOOG-411
+www.\n\\\*.com
+
+
+

/path/to/configs/bazaar:

+
+
load_system_dawg     F
+load_freq_dawg       F
+user_words_suffix    user-words
+user_patterns_suffix user-patterns
+
+
+

Now, if you pass the word bazaar as a CONFIGFILE to +Tesseract, Tesseract will not bother loading the system dictionary nor +the dictionary of frequent words and will load and use the eng.user-words +and eng.user-patterns files you provided. The former is a simple word list, +one per line. The format of the latter is documented in dict/trie.h +on read_pattern_list().

+
+
+
+

ENVIRONMENT VARIABLES

+
+
+
+TESSDATA_PREFIX +
+
+

+ If the TESSDATA_PREFIX is set to a path, then that path is used to + find the tessdata directory with language and script recognition + models and config files. + Using --tessdata-dir PATH is the recommended alternative. +

+
+
+OMP_THREAD_LIMIT +
+
+

+ If the tesseract executable was built with multithreading support, + it will normally use four CPU cores for the OCR process. While this + can be faster for a single image, it gives bad performance if the host + computer provides less than four CPU cores or if OCR is made for many images. + Only a single CPU core is used with OMP_THREAD_LIMIT=1. +

+
+
+
+
+
+

HISTORY

+
+

The engine was developed at Hewlett Packard Laboratories Bristol and at +Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some more +changes made in 1996 to port to Windows, and some C++izing in 1998. A +lot of the code was written in C, and then some more was written in C++. +The C++ code makes heavy use of a list system using macros. This predates +STL, was portable before STL, and is more efficient than STL lists, but has +the big negative that if you do get a segmentation violation, it is hard to +debug.

+

Version 2.00 brought Unicode (UTF-8) support, six languages, and the ability +to train Tesseract.

+

Tesseract was included in UNLV’s Fourth Annual Test of OCR Accuracy. +See https://github.com/tesseract-ocr/docs/blob/main/AT-1995.pdf. +Since Tesseract 2.00, +scripts are now included to allow anyone to reproduce some of these tests. +See https://tesseract-ocr.github.io/tessdoc/TestingTesseract.html for more +details.

+

Tesseract 3.00 added a number of new languages, including Chinese, Japanese, +and Korean. It also introduced a new, single-file based system of managing +language data.

+

Tesseract 3.02 added BiDirectional text support, the ability to recognize +multiple languages in a single image, and improved layout analysis.

+

Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused +on line recognition, but also still supports the legacy Tesseract OCR engine of +Tesseract 3 which works by recognizing character patterns. Compatibility with +Tesseract 3 is enabled by --oem 0. This also needs traineddata files which +support the legacy engine, for example those from the tessdata repository +(https://github.com/tesseract-ocr/tessdata).

+

For further details, see the release notes in the Tesseract documentation +(https://tesseract-ocr.github.io/tessdoc/ReleaseNotes.html).

+
+
+ +
+

SEE ALSO

+
+

ambiguous_words(1), cntraining(1), combine_tessdata(1), dawg2wordlist(1), +shape_training(1), mftraining(1), unicharambigs(5), unicharset(5), +unicharset_extractor(1), wordlist2dawg(1)

+
+
+
+

AUTHOR

+
+

Tesseract development was led at Hewlett-Packard and Google by Ray Smith. +The development team has included:

+

Ahmad Abdulkader, Chris Newton, Dan Johnson, Dar-Shyang Lee, David Eger, +Eric Wiseblatt, Faisal Shafait, Hiroshi Takenaka, Joe Liu, Joern Wanke, +Mark Seaman, Mickey Namiki, Nicholas Beato, Oded Fuhrmann, Phil Cheatle, +Pingping Xiu, Pong Eksombatchai (Chantat), Ranjith Unnikrishnan, Raquel +Romano, Ray Smith, Rika Antonova, Robert Moss, Samuel Charron, Sheelagh +Lloyd, Shobhit Saxena, and Thomas Kielbus.

+ +
+
+
+

COPYING

+
+

Licensed under the Apache License, Version 2.0

+
+
+
+

+ + + diff --git a/Tesseract-OCR/tesseract.exe b/Tesseract-OCR/tesseract.exe new file mode 100644 index 0000000000000000000000000000000000000000..353d03782b1ada4160fcad80c6ff4930a47de914 --- /dev/null +++ b/Tesseract-OCR/tesseract.exe @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:eff536715a0637f7f3416eea212cabfcf2917ec90331e5314021306d26c6db0c +size 1562688 diff --git a/Tesseract-OCR/text2image.1.html b/Tesseract-OCR/text2image.1.html new file mode 100644 index 0000000000000000000000000000000000000000..28e850a035b6af2ffe07c2e91e2ef6c350d22acc --- /dev/null +++ b/Tesseract-OCR/text2image.1.html @@ -0,0 +1,1121 @@ + + + + + +TEXT2IMAGE(1) + + + + + +
+
+

SYNOPSIS

+
+

text2image --text FILE --outputbase PATH --fonts_dir PATH [OPTION]

+
+
+
+

DESCRIPTION

+
+

text2image(1) generates OCR training pages. Given a text file it outputs an image with a given font and degradation.

+
+
+
+

OPTIONS

+
+
+
+--text FILE +
+
+

+ File name of text input to use for creating synthetic training data. (type:string default:) +

+
+
+--outputbase FILE +
+
+

+ Basename for output image/box file (type:string default:) +

+
+
+--fontconfig_tmpdir PATH +
+
+

+ Overrides fontconfig default temporary dir (type:string default:/tmp) +

+
+
+--fonts_dir PATH +
+
+

+ If empty it use system default. Otherwise it overrides system default font location (type:string default:) +

+
+
+--font FONTNAME +
+
+

+ Font description name to use (type:string default:Arial) +

+
+
+--writing_mode MODE +
+
+

+ Specify one of the following writing modes. + horizontal : Render regular horizontal text. (default) + vertical : Render vertical text. Glyph orientation is selected by Pango. + vertical-upright : Render vertical text. Glyph orientation is set to be upright. (type:string default:horizontal) +

+
+
+--tlog_level INT +
+
+

+ Minimum logging level for tlog() output (type:int default:0) +

+
+
+--max_pages INT +
+
+

+ Maximum number of pages to output (0=unlimited) (type:int default:0) +

+
+
+--degrade_image BOOL +
+
+

+ Degrade rendered image with speckle noise, dilation/erosion and rotation (type:bool default:true) +

+
+
+--rotate_image BOOL +
+
+

+ Rotate the image in a random way. (type:bool default:true) +

+
+
+--strip_unrenderable_words BOOL +
+
+

+ Remove unrenderable words from source text (type:bool default:true) +

+
+
+--ligatures BOOL +
+
+

+ Rebuild and render ligatures (type:bool default:false) +

+
+
+--exposure INT +
+
+

+ Exposure level in photocopier (type:int default:0) +

+
+
+--resolution INT +
+
+

+ Pixels per inch (type:int default:300) +

+
+
+--xsize INT +
+
+

+ Width of output image (type:int default:3600) +

+
+
+--ysize INT +
+
+

+ Height of output image (type:int default:4800) +

+
+
+--margin INT +
+
+

+ Margin round edges of image (type:int default:100) +

+
+
+--ptsize INT +
+
+

+ Size of printed text (type:int default:12) +

+
+
+--leading INT +
+
+

+ Inter-line space (in pixels) (type:int default:12) +

+
+
+--box_padding INT +
+
+

+ Padding around produced bounding boxes (type:int default:0) +

+
+
+--char_spacing DOUBLE +
+
+

+ Inter-character space in ems (type:double default:0) +

+
+
+--underline_start_prob DOUBLE +
+
+

+ Fraction of words to underline (value in [0,1]) (type:double default:0) +

+
+
+--underline_continuation_prob DOUBLE +
+
+

+ Fraction of words to underline (value in [0,1]) (type:double default:0) +

+
+
+--render_ngrams BOOL +
+
+

+ Put each space-separated entity from the input file into one bounding box. The ngrams in the input file will be randomly permuted before rendering (so that there is sufficient variety of characters on each line). (type:bool default:false) +

+
+
+--output_word_boxes BOOL +
+
+

+ Output word bounding boxes instead of character boxes. This is used for Cube training, and implied by --render_ngrams. (type:bool default:false) +

+
+
+--unicharset_file FILE +
+
+

+ File with characters in the unicharset. If --render_ngrams is true and --unicharset_file is specified, ngrams with characters that are not in unicharset will be omitted (type:string default:) +

+
+
+--bidirectional_rotation BOOL +
+
+

+ Rotate the generated characters both ways. (type:bool default:false) +

+
+
+--only_extract_font_properties BOOL +
+
+

+ Assumes that the input file contains a list of ngrams. Renders each ngram, extracts spacing properties and records them in output_base/[font_name].fontinfo file. (type:bool default:false) +

+
+
+
+
+
+

Use these flags to output zero-padded, square individual character images

+
+
+
+--output_individual_glyph_images BOOL +
+
+

+ If true also outputs individual character images (type:bool default:false) +

+
+
+--glyph_resized_size INT +
+
+

+ Each glyph is square with this side length in pixels (type:int default:0) +

+
+
+--glyph_num_border_pixels_to_pad INT +
+
+

+ Final_size=glyph_resized_size+2*glyph_num_border_pixels_to_pad (type:int default:0) +

+
+
+
+
+
+

Use these flags to find fonts that can render a given text

+
+
+
+--find_fonts BOOL +
+
+

+ Search for all fonts that can render the text (type:bool default:false) +

+
+
+--render_per_font BOOL +
+
+

+ If find_fonts==true, render each font to its own image. Image filenames are of the form output_name.font_name.tif (type:bool default:true) +

+
+
+--min_coverage DOUBLE +
+
+

+ If find_fonts==true, the minimum coverage the font has of the characters in the text file to include it, between 0 and 1. (type:double default:1) +

+
+
+

Example Usage: +``` +text2image --find_fonts \ +--fonts_dir /usr/share/fonts \ +--text ../langdata/hin/hin.training_text \ +--min_coverage .9 \ +--render_per_font \ +--outputbase ../langdata/hin/hin \ +|& grep raw | sed -e s/ :.*/" \\/g | sed -e s/^/ "/ >../langdata/hin/fontslist.txt +```

+
+
+
+

SINGLE OPTIONS

+
+
+
+--list_available_fonts BOOL +
+
+

+ List available fonts and quit. (type:bool default:false) +

+
+
+
+
+
+

HISTORY

+
+

text2image(1) was first made available for tesseract 3.03.

+
+
+
+

RESOURCES

+ +
+
+

SEE ALSO

+
+

tesseract(1)

+
+
+
+

COPYING

+
+

Copyright (C) 2012 Google, Inc. +Licensed under the Apache License, Version 2.0

+
+
+
+

AUTHOR

+
+

The Tesseract OCR engine was written by Ray Smith and his research groups +at Hewlett Packard (1985-1995) and Google (2006-present).

+
+
+
+

+ + + diff --git a/Tesseract-OCR/text2image.exe b/Tesseract-OCR/text2image.exe new file mode 100644 index 0000000000000000000000000000000000000000..7212075aef06e6341590431d5c6d7772afb864ab --- /dev/null +++ b/Tesseract-OCR/text2image.exe @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:4c35215b499656185246ff16b7f5d1efc6eba41f5f9ef9f6c9014c56f4330e51 +size 9977320 diff --git a/Tesseract-OCR/unicharambigs.5.html b/Tesseract-OCR/unicharambigs.5.html new file mode 100644 index 0000000000000000000000000000000000000000..9d0c91ef45f7c04c6850ed6da4b3fd062f6c3278 --- /dev/null +++ b/Tesseract-OCR/unicharambigs.5.html @@ -0,0 +1,893 @@ + + + + + +UNICHARAMBIGS(5) + + + + + +
+
+

NAME

+
+

unicharambigs - Tesseract unicharset ambiguities

+
+
+
+

DESCRIPTION

+
+

The unicharambigs file (a component of traineddata, see combine_tessdata(1) ) +is used by Tesseract to represent possible ambiguities between characters, +or groups of characters.

+

The file contains a number of lines, laid out as follow:

+
+
+
[num] <TAB> [char(s)] <TAB> [num] <TAB> [char(s)] <TAB> [num]
+
+
+ + + + + + + + + + + + + + + + + + + + +
+Field one +
+
+

+the number of characters contained in field two +

+
+Field two +
+
+

+the character sequence to be replaced +

+
+Field three +
+
+

+the number of characters contained in field four +

+
+Field four +
+
+

+the character sequence used to replace field two +

+
+Field five +
+
+

+contains either 1 or 0. 1 denotes a mandatory +replacement, 0 denotes an optional replacement. +

+
+

Characters appearing in fields two and four should appear in +unicharset. The numbers in fields one and three refer to the +number of unichars (not bytes).

+
+
+
+

EXAMPLE

+
+
+
+
v1
+2       ' '     1       "     1
+1       m       2       r n   0
+3       i i i   1       m     0
+
+

The first line is a version identifier. +In this example, all instances of the 2 character sequence '' will +always be replaced by the 1 character sequence "; a 1 character +sequence m may be replaced by the 2 character sequence rn, and +the 3 character sequence may be replaced by the 1 character +sequence m.

+

Version 3.03 and on supports a new, simpler format for the unicharambigs +file:

+
+
+
v2
+'' " 1
+m rn 0
+iii m 0
+
+

In this format, the "error" and "correction" are simple UTF-8 strings +separated by a space, and, after another space, the same type specifier +as v1 (0 for optional and 1 for mandatory substitution). Note the downside +of this simpler format is that Tesseract has to encode the UTF-8 strings +into the components of the unicharset. In complex scripts, this encoding +may be ambiguous. In this case, the encoding is chosen such as to use the +least UTF-8 characters for each component, ie the shortest unicharset +components will make up the encoding.

+
+
+
+

HISTORY

+
+

The unicharambigs file first appeared in Tesseract 3.00; prior to that, a +similar format, called DangAmbigs (dangerous ambiguities) was used: the +format was almost identical, except only mandatory replacements could be +specified, and field 5 was absent.

+
+
+
+

BUGS

+
+

This is a documentation "bug": it’s not currently clear what should be done +in the case of ligatures (such as fi) which may also appear as regular +letters in the unicharset.

+
+
+ +
+

AUTHOR

+
+

The Tesseract OCR engine was written by Ray Smith and his research groups +at Hewlett Packard (1985-1995) and Google (2006-present).

+
+
+
+

+ + + diff --git a/Tesseract-OCR/unicharset.5.html b/Tesseract-OCR/unicharset.5.html new file mode 100644 index 0000000000000000000000000000000000000000..8692456a395efc5157dfff2eedd3b944c0815e1f --- /dev/null +++ b/Tesseract-OCR/unicharset.5.html @@ -0,0 +1,965 @@ + + + + + +UNICHARSET(5) + + + + + +
+
+

DESCRIPTION

+
+

Tesseract’s unicharset file contains information on each symbol +(unichar) the Tesseract OCR engine is trained to recognize.

+

A unicharset file (i.e. eng.unicharset) is distributed as part of a +Tesseract language pack (i.e. eng.traineddata). For information on +extracting the unicharset file, see combine_tessdata(1).

+

The first line of a unicharset file contains the number of unichars in +the file. After this line, each subsequent line provides information for +a single unichar. The first such line contains a placeholder reserved for +the space character. Each unichar is referred to within Tesseract by its +Unichar ID, which is the line number (minus 1) within the unicharset file. +Therefore, space gets unichar 0.

+

Each unichar line in the unicharset file (v2+) may have four space-separated fields:

+
+
+
'character' 'properties' 'script' 'id'
+
+

Starting with Tesseract v3.02, more information may be given for each unichar:

+
+
+
'character' 'properties' 'glyph_metrics' 'script' 'other_case' 'direction' 'mirror' 'normed_form'
+
+

Entries:

+
+
+character +
+
+

+The UTF-8 encoded string to be produced for this unichar. +

+
+
+properties +
+
+

+An integer mask of character properties, one per bit. + From least to most significant bit, these are: isalpha, islower, isupper, + isdigit, ispunctuation. +

+
+
+glyph_metrics +
+
+

+Ten comma-separated integers representing various standards + for where this glyph is to be found within a baseline-normalized coordinate + system where 128 is normalized to x-height. +

+
    +
  • +

    +min_bottom, max_bottom: the ranges where the bottom of the character can + be found. +

    +
  • +
  • +

    +min_top, max_top: the ranges where the top of the character may be found. +

    +
  • +
  • +

    +min_width, max_width: horizontal width of the character. +

    +
  • +
  • +

    +min_bearing, max_bearing: how far from the usual start position does the + leftmost part of the character begin. +

    +
  • +
  • +

    +min_advance, max_advance: how far from the printer’s cell left do we + advance to begin the next character. +

    +
  • +
+
+
+script +
+
+

+Name of the script (Latin, Common, Greek, Cyrillic, Han, null). +

+
+
+other_case +
+
+

+The Unichar ID of the other case version of this character + (upper or lower). +

+
+
+direction +
+
+

+The Unicode BiDi direction of this character, as defined by + ICU’s enum UCharDirection. (0 = Left to Right, 1 = Right to Left, + 2 = European Number…) +

+
+
+mirror +
+
+

+The Unichar ID of the BiDirectional mirror of this character. + For example the mirror of open paren is close paren, but Latin Capital C + has no mirror, so it remains a Latin Capital C. +

+
+
+normed_form +
+
+

+The UTF-8 representation of a "normalized form" of this unichar + for the purpose of blaming a module for errors given ground truth text. + For instance, a left or right single quote may normalize to an ASCII quote. +

+
+
+
+
+
+

EXAMPLE (v2)

+
+
+
+
; 10 Common 46
+b 3 Latin 59
+W 5 Latin 40
+7 8 Common 66
+= 0 Common 93
+
+

";" is a punctuation character. Its properties are thus represented by the +binary number 10000 (10 in hexadecimal).

+

"b" is an alphabetic character and a lower case character. Its properties are +thus represented by the binary number 00011 (3 in hexadecimal).

+

"W" is an alphabetic character and an upper case character. Its properties are +thus represented by the binary number 00101 (5 in hexadecimal).

+

"7" is just a digit. Its properties are thus represented by the binary number +01000 (8 in hexadecimal).

+

"=" is not punctuation nor a digit nor an alphabetic character. Its properties +are thus represented by the binary number 00000 (0 in hexadecimal).

+

Japanese or Chinese alphabetic character properties are represented by the +binary number 00001 (1 in hexadecimal): they are alphabetic, but neither +upper nor lower case.

+
+
+
+

EXAMPLE (v3.02)

+
+
+
+
110
+NULL 0 NULL 0
+N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N
+Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y
+1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1
+9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9
+a 3 58,65,186,198,85,164,0,26,97,185 Latin 56 0 5 a
+. . .
+
+
+
+
+

CAVEATS

+
+

Although the unicharset reader maintains the ability to read unicharsets +of older formats and will assign default values to missing fields, +the accuracy will be degraded.

+

Further, most other data files are indexed by the unicharset file, +so changing it without re-generating the others is likely to have dire +consequences.

+
+
+
+

HISTORY

+
+

The unicharset format first appeared with Tesseract 2.00, which was the +first version to support languages other than English. The unicharset file +contained only the first two fields, and the "ispunctuation" property was +absent (punctuation was regarded as "0", as "=" is in the above example.

+
+
+
+

SEE ALSO

+
+

tesseract(1), combine_tessdata(1), unicharset_extractor(1)

+ +
+
+
+

AUTHOR

+
+

The Tesseract OCR engine was written by Ray Smith and his research groups +at Hewlett Packard (1985-1995) and Google (2006-present).

+
+
+
+

+ + + diff --git a/Tesseract-OCR/unicharset_extractor.1.html b/Tesseract-OCR/unicharset_extractor.1.html new file mode 100644 index 0000000000000000000000000000000000000000..9cd19245ce3ed1ced1101473e75697097c6a78fe --- /dev/null +++ b/Tesseract-OCR/unicharset_extractor.1.html @@ -0,0 +1,804 @@ + + + + + +UNICHARSET_EXTRACTOR(1) + + + + + +
+
+

NAME

+
+

unicharset_extractor - Reads box or plain text files to extract the unicharset.

+
+
+
+

SYNOPSIS

+
+

unicharset_extractor [--output_unicharset filename] [--norm_mode mode] box_or_text_file […]

+

Where mode means: + 1=combine graphemes (use for Latin and other simple scripts) + 2=split graphemes (use for Indic/Khmer/Myanmar) + 3=pure unicode (use for Arabic/Hebrew/Thai/Tibetan)

+
+
+
+

DESCRIPTION

+
+

Tesseract needs to know the set of possible characters it can output. +To generate the unicharset data file, use the unicharset_extractor +program on training pages bounding box files or a plain text file:

+
+
+
unicharset_extractor fontfile_1.box fontfile_2.box ...
+
+

The unicharset will be put into the file ./unicharset if no output filename is provided.

+

NOTE Use the appropriate norm_mode based on the language.

+
+
+
+

SEE ALSO

+ +
+
+

HISTORY

+
+

unicharset_extractor first appeared in Tesseract 2.00.

+
+
+
+

COPYING

+
+

Copyright (C) 2006, Google Inc. +Licensed under the Apache License, Version 2.0

+
+
+
+

AUTHOR

+
+

The Tesseract OCR engine was written by Ray Smith and his research groups +at Hewlett Packard (1985-1995) and Google (2006-present).

+
+
+
+

+ + + diff --git a/Tesseract-OCR/unicharset_extractor.exe b/Tesseract-OCR/unicharset_extractor.exe new file mode 100644 index 0000000000000000000000000000000000000000..1920ecefb9a3564731d8d402d2c193ea6df9ff6e --- /dev/null +++ b/Tesseract-OCR/unicharset_extractor.exe @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:234657d037a96de05a8d41bf798193699d7875bd0ee60199e744e7461bee59fd +size 3771208 diff --git a/Tesseract-OCR/winpath.exe b/Tesseract-OCR/winpath.exe new file mode 100644 index 0000000000000000000000000000000000000000..428e40e914e1fd615ecd931333cb2d192c060499 Binary files /dev/null and b/Tesseract-OCR/winpath.exe differ diff --git a/Tesseract-OCR/wordlist2dawg.1.html b/Tesseract-OCR/wordlist2dawg.1.html new file mode 100644 index 0000000000000000000000000000000000000000..a56322bc19d9ba605ab3ee76d6e96a3442748702 --- /dev/null +++ b/Tesseract-OCR/wordlist2dawg.1.html @@ -0,0 +1,820 @@ + + + + + +WORDLIST2DAWG(1) + + + + + +
+
+

SYNOPSIS

+
+

wordlist2dawg WORDLIST DAWG lang.unicharset

+

wordlist2dawg -t WORDLIST DAWG lang.unicharset

+

wordlist2dawg -r 1 WORDLIST DAWG lang.unicharset

+

wordlist2dawg -r 2 WORDLIST DAWG lang.unicharset

+

wordlist2dawg -l <short> <long> WORDLIST DAWG lang.unicharset

+
+
+
+

DESCRIPTION

+
+

wordlist2dawg(1) converts a wordlist to a Directed Acyclic Word Graph +(DAWG) for use with Tesseract. A DAWG is a compressed, space and time +efficient representation of a word list.

+
+
+
+

OPTIONS

+
+

-t + Verify that a given dawg file is equivalent to a given wordlist.

+

-r 1 + Reverse a word if it contains an RTL character.

+

-r 2 + Reverse all words.

+

-l <short> <long> + Produce a file with several dawgs in it, one each for words + of length <short>, <short+1>,… <long>

+
+
+
+

ARGUMENTS

+
+

WORDLIST + A plain text file in UTF-8, one word per line.

+

DAWG + The output DAWG to write.

+

lang.unicharset + The unicharset of the language. This is the unicharset + generated by mftraining(1).

+
+
+
+

SEE ALSO

+
+

tesseract(1), combine_tessdata(1), dawg2wordlist(1)

+ +
+
+
+

COPYING

+
+

Copyright (C) 2006 Google, Inc. +Licensed under the Apache License, Version 2.0

+
+
+
+

AUTHOR

+
+

The Tesseract OCR engine was written by Ray Smith and his research groups +at Hewlett Packard (1985-1995) and Google (2006-present).

+
+
+
+

+ + + diff --git a/Tesseract-OCR/wordlist2dawg.exe b/Tesseract-OCR/wordlist2dawg.exe new file mode 100644 index 0000000000000000000000000000000000000000..95f32845acb867da4c1cae440e6f24b445d5ed0b --- /dev/null +++ b/Tesseract-OCR/wordlist2dawg.exe @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:aca149a053588d6176a07b427f8af28b77f78d7ba65cf2f65c14fa603dc727f8 +size 1019632 diff --git a/Tesseract-OCR/zlib1.dll b/Tesseract-OCR/zlib1.dll new file mode 100644 index 0000000000000000000000000000000000000000..0032dbe5c08b6ad27f96b9a53d36430be0b0004d Binary files /dev/null and b/Tesseract-OCR/zlib1.dll differ