OCR3：tesseract script

時間 2019-11-06

標籤 ocr3 ocr tesseract script 简体版

原文原文鏈接

經過命令：tesseract -h 可查看 OCR操做腳本參數：git

其中參數說明：github

–-oem：指定使用的算法，0：表明老的算法；1：表明LSTM算法；2：表明二者的結合；3：表明系統本身選擇。
–-psm：指定頁面切分模式。默認是3，也就是自動的頁面切分，可是不進行方向(Orientation)和文字(script，其實並不等同於文字，好比俄文和烏克蘭文都使用相同的script，中文和日文的script也有重合的部分)的檢測。若是咱們要識別的是單行的文字，我能夠指定7。咱們這裏已經知道文字是中文，而且方向是horizontal(從左往右再從上往下的寫法，古代中國是從上往下從右往左），所以使用默認的3就能夠了。

--psm:算法

combine_tessdata網絡

-e：經過-e 指令，能夠從一個已經合併了的traineddata文件中提取獨立的組件。如：combine_tessdata -e tessdata/eng.traineddata \ /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset
-o：經過 -o 指令，能夠覆蓋一個給定的traineddata文件中的對應組件。如：combine_tessdata -o tessdata/eng.traineddata \ /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharambigs
-u：經過 -u 指令，能夠將全部組件解壓到指定路徑。如：combine_tessdata -u tessdata/eng.traineddata /home/$USER/temp/eng.

codeapp

NAME combine_tessdata - combine/extract/overwrite/list/compact Tesseract data # 用於合併/提取/覆蓋/list(-d)/壓縮 tesseract data
SYNOPSIS combine_tessdata [OPTION] FILE... DESCRIPTION combine_tessdata(1) is the main program to combine/extract/overwrite/list/compact tessdata components in [lang].traineddata files. # combine_tessdata 是主要的程序，用來合併/提取/覆蓋/list/壓縮 [lang].traineddata files 中的tessdata組件。
To combine all the individual tessdata components (unicharset, DAWGs, classifier templates, ambiguities, language configs) located at, say, /home/$USER/temp/eng.* run: combine_tessdata /home/$USER/temp/eng. The result will be a combined tessdata file /home/$USER/temp/eng.traineddata # 將全部獨立的tessdat組件合併在一塊兒
Specify option -e if you would like to extract individual components from a combined traineddata file. For example, to extract language config file and the unicharset from tessdata/eng.traineddata run: combine_tessdata -e tessdata/eng.traineddata \ /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset The desired config file and unicharset will be written to /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset # 經過-e 指令，能夠從一個已經合併了的traineddata文件中提取獨立的組件。
Specify option -o to overwrite individual components of the given [lang].traineddata file. For example, to overwrite language config and unichar ambiguities files in tessdata/eng.traineddata use: combine_tessdata -o tessdata/eng.traineddata \ /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharambigs As a result, tessdata/eng.traineddata will contain the new language config and unichar ambigs, plus all the original DAWGs, classifier templates, etc. # 經過 -o 指令，能夠覆蓋一個給定的traineddata文件中的對應組件。
#
Note: the file names of the files to extract to and to overwrite from should have the appropriate file suffixes (extensions) indicating their tessdata component type (.unicharset for the unicharset, .unicharambigs for unichar ambigs, etc). See k*FileSuffix variable in ccutil/tessdatamanager.h. # 要提取和覆蓋的文件的文件名應具備對應文件相同的後綴名，以代表其tessdata組件的類型。
Specify option -u to unpack all the components to the specified path: combine_tessdata -u tessdata/eng.traineddata /home/$USER/temp/eng. This will create /home/$USER/temp/eng.* files with individual tessdata components from tessdata/eng.traineddata. # 經過 -u 指令，能夠將全部組件解壓到指定路徑
OPTIONS -c .traineddata FILE...: Compacts the LSTM component in the .traineddata file to int. -d .traineddata FILE...: Lists directory of components from the .traineddata file. -e .traineddata FILE...: Extracts the specified components from the .traineddata file -o .traineddata FILE...: Overwrites the specified components of the .traineddata file with those provided on the command line. -u .traineddata PATHPREFIX Unpacks the .traineddata using the provided prefix. CAVEATS Prefix refers to the full file prefix, including period (.) # 注意點 指令中的前綴要包含‘.’
COMPONENTS #組件
The components in a Tesseract lang.traineddata file as of Tesseract 4.0 are briefly described below; For more information on many of these files, see https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract and https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 lang.config (Optional) Language-specific overrides to default config variables. For 4.0 traineddata files, lang.config provides control parameters which can affect layout analysis, and sub-languages. # 根據語言特定，用來覆蓋默認的配置變量。
# 對於4.0的traineddata文件來講，config文件提供影響佈局分析（不知道跟文字分割算法有關）和子語言的控制參數
lang.unicharset (Required - 3.0x legacy tesseract) The list of symbols that Tesseract recognizes, with properties. See unicharset(5). # 3.0 必需的
# tesseract識別的符號列表，包含屬性.
lang.unicharambigs (Optional - 3.0x legacy tesseract) This file contains information on pairs of recognized symbols which are often confused. For example, rn and m. # 3.0 可選的
# 這個文件包含常常容易混淆的符號對的信息，例如‘rn和m’
# （若是識別中文的話，應該能夠用來處理一些形似字，好比日和曰）
lang.inttemp (Required - 3.0x legacy tesseract) Character shape templates for each unichar. Produced by mftraining(1). # 3.0 必需
# 每一個字符的形狀模板
# 經過模仿mftraining建立
lang.pffmtable (Required - 3.0x legacy tesseract) The number of features expected for each unichar. Produced by mftraining(1) from .tr files. # 3.0 必需
# 每一個字符的指望特徵數量
# 由 mftraining 經過 .tr文件產生
lang.normproto (Required - 3.0x legacy tesseract) Character normalization prototypes generated by cntraining(1) from .tr files. # 3.0 必需
# 字符的歸一化原型，由cntraining經過.tr文件生成
lang.punc-dawg (Optional - 3.0x legacy tesseract) A dawg made from punctuation patterns found around words. The "word" part is replaced by a single space. # 3.0 可選的
# 一個由字符周圍標點符號構建的dawg
# word部分由一個單獨的空格替代
lang.word-dawg (Optional - 3.0x legacy tesseract) A dawg made from dictionary words from the language. # 3.0 可選
# 一個由字典單詞構建的dawg
lang.number-dawg (Optional - 3.0x legacy tesseract) A dawg made from tokens which originally contained digits. Each digit is replaced by a space character. # 3.0 可選
# 一個由符號構建的dawg，最初包含數字，每個數字被一個空格字符代替？？？不是很理解
lang.freq-dawg (Optional - 3.0x legacy tesseract) A dawg made from the most frequent words which would have gone into word-dawg. # 3.0 可選
# 一個由最經常使用單詞構建的dawg，這些單詞將會進入word-dwag
lang.fixed-length-dawgs (Optional - 3.0x legacy tesseract) Several dawgs of different fixed lengths — useful for languages like Chinese. # 3.0 可選
# 混合長度dawgs
# 對相似於中文的語言有用
#
lang.shapetable (Optional - 3.0x legacy tesseract) When present, a shapetable is an extra layer between the character classifier and the word recognizer that allows the character classifier to return a collection of unichar ids and fonts instead of a single unichar-id and font. # 3.0 可選
# 若是存在，shapetable是字符分類器和單詞識別器之間的額外層，容許字符分類器返回unichar ID和字體的集合，而不是單個unichar-id和字體。
# （應該是指用來應對多字符識別的，應該可以提升準確率）
lang.bigram-dawg (Optional - 3.0x legacy tesseract) A dawg of word bigrams where the words are separated by a space and each digit is replaced by a ?. # 一個由雙字母組構成的dawg
# bigram？？二元語法
[wiki bigram](https://en.wikipedia.org/wiki/N-gram) lang.unambig-dawg (Optional - 3.0x legacy tesseract) . lang.params-model (Optional - 3.0x legacy tesseract) . lang.lstm (Required - 4.0 LSTM) Neural net trained recognition model generated by lstmtraining. # 4.0 必需
# 由lstmtraining生成的神經網絡識別模型
lang.lstm-punc-dawg (Optional - 4.0 LSTM) A dawg made from punctuation patterns found around words. The "word" part is replaced by a single space. Uses lang.lstm-unicharset. # 4.0 可選
# 由單詞周邊的標點符號構造的dawg，須要用到lang.lstm-unicharset
lang.lstm-word-dawg (Optional - 4.0 LSTM) A dawg made from dictionary words from the language. Uses lang.lstm-unicharset. # 4.0 可選
# 由指定的語言的字典單詞構造的dawg，須要用到lang.lstm-unicharset
lang.lstm-number-dawg (Optional - 4.0 LSTM) A dawg made from tokens which originally contained digits. Each digit is replaced by a space character. Uses lang.lstm-unicharset. # 4.0可選
# 一個由最初包含數字的符號集構造的dawg
# Each digit is replaced by a space character.這句話仍是不是很理解，直譯的話就是每一個數字都由一個空格字符代替，
我想或者是否是能夠理解爲每一個數字都由一個空格字符所佔用的位置表明？？ # 須要用到lang.lstm-unicharset
lang.lstm-unicharset (Required - 4.0 LSTM) The unicode character set that Tesseract recognizes, with properties. Same unicharset must be used to train the LSTM and build the lstm-*-dawgs files. # 4.0 必需
# 一個Tesseract能夠識別的包含屬性的unicode字符集。
# 相同的單字符集必須用來被訓練LSTM，而且構造the lstm-*-dawgs files
lang.lstm-recoder (Required - 4.0 LSTM) Unicharcompress, aka the recoder, which maps the unicharset further to the codes actually used by the neural network recognizer. This is created as part of the starter traineddata by combine_lang_model. # 4.0 必需
# Unicharcompress又名the recoder （單字符壓縮？又名編碼器？）
# 將單字符集合進一步映射到神經網絡識別器實際使用到的代碼上
# 這個lang.lang_recoder由combine_lang_model建立的starter traineddata的一部分
# （這個lang.lang_recoder能夠經過combine_tessdata從traineddata中提取出來）
lang.version (Optional) Version string for the traineddata file. First appeared in version 4.0 of Tesseract. Old version of traineddata files will report Version string:Pre-4.0.0. 4.0 version of traineddata files may include the network spec used for LSTM training as part of version string. # 4.0 可選
# 爲the traineddata file.建立的版本字符串
# HISTORY combine_tessdata(1) first appeared in version 3.00 of Tesseract SEE ALSO tesseract(1), wordlist2dawg(1), cntraining(1), mftraining(1), unicharset(5), unicharambigs(5) COPYING Copyright (C) 2009, Google Inc. Licensed under the Apache License, Version 2.0 AUTHOR The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985-1995) and Google (2006-present).

參考資料ide

相關標籤/搜索

imagemagick+tesseract

Tesseract-OCR

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。