Tesseract 3.02 OCR文字識別調查記錄

  • 安裝使用:

Tesseract下載地址windows

https://code.google.com/p/tesseract-ocr/electron

目前最新版本爲3.02字體

windows版下載解壓後,使用命令行,進入解壓後目錄運行this

命令格式:google

Usage:tesseract.exe imagename outputbase [-l lang] [-psm pagesegmode]
e...]

pagesegmode values are:
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.
-l lang and/or -psm pagesegmode must occur before anyconfigfile.

Single options:
  -v --version: version info
  --list-langs: list available languages for tesseract engine

命令舉例:spa

F:\Tesseract-OCR>tesseract.exe 2013-09-05_154628.jpg eng -l eng -psm 6命令行

相關命令列表:code

功能 命令
  ambiguous_words.exe
  classifier_tester.exe
  cntraining.exe
整合訓練文件 combine_tessdata.exe
  dawg2wordlist.exe
  mftraining.exe
  shapeclustering.exe
識別程序 tesseract.exe
  unicharset_extractor.exe
  wordlist2dawg.exe

 

 

  • 字庫訓練

 須要的字庫文件參考代碼:orm

tesseract-ocr\ccutil\tessdatamanager.hblog

對字庫相關的配置文件的格式要求:

ASCII or UTF-8 encoding without BOM

Unix end-of-line marker ('\n')

The last character must be an end of line marker ('\n'). Some text editors will show this as an empty line at the end of file. If you omit this you will got error message containing "last_char == '\n':Error:Assert failed..."

步驟:

1.生成訓練圖片

幾個原則:

保證每一個字符出現的頻率通常10次,經常使用字20次,不經常使用字5次;

不能把特殊字符都放在一塊兒,應該用更加接近實際使用的組合;

很是重要:在字符和行之間保持必定的間隔,不然可能致使失敗。(可能在3.0以後的版本修復)

訓練的數據須要以font分組,相同font的文字須要放在同一個tiff文件中,(支持多頁page)

除非字體過小(高度小於15px),沒有必要作不一樣尺寸的訓練;

絕對不能夠在同一個image文件中混雜多種字體

(能夠參考下載頁中的boxtiff文件樣例)

Next print and scan (or use some electronic rendering method) to create an image of your training page. Upto 32 training files can be used (of multiple pages). It is best to create a mix of fonts and styles (but in separate files), including italic and bold.

生成tiff文件

2.製做box文件

生成box文件命令:

tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] batch.nochop makebox

例: 

tesseract eng.timesitalic.exp0.tif eng.timesitalic.exp0 batch.nochop makebox

 

 

3.獲得一個新的字符集

 

  • 其餘

參考文檔:

解壓後doc目錄中有API說明

 

--end--

相關文章
相關標籤/搜索