關於tesseract-ocr3的訓練和使用

時間 2019-11-17

標籤關於 tesseract ocr3 ocr 訓練使用简体版

原文原文鏈接

衆所周知，這是一個出色的字符識別軟件。這個開源項目能夠在http://code.google.com/p/tesseract-ocr/downloads/list下載。html

在使用時，建議使用3而不要使用2，由於一些緣由，2雖然能夠直接用在工程，可是因爲一些顯而易見的BUG和其餘緣由，多致使程序沒法運行甚至崩潰。因此建議使用命令行版本的3 。java

除了下載tesseract安裝程序之外，還能夠在下載頁面下載一些語言庫，固然，也能夠在安裝過程當中選擇一些語言庫來進行安裝。git

1、訓練shell

在不少時候，默認的字庫等徹底能夠高準確度的識別，可是有些時候咱們須要訓練本身的庫來使用。訓練步驟以下：api

注：多線程

A、均以DOS命令爲例，即將每步下面的命令保存爲.BAT運行，或運行CMD進入tesseract所在目錄運行app

B、注意保證文件名的一致性（此處均爲DDT）ide

0、複製training目錄下的所有文件到tesseract3所在目錄函數

copy .\training\*.exe .\工具

一、標記邊框

tesseract ddt.tif ddt -l eng digits batch.nochop makebox

解釋一下

ddt.tif爲要識別的文件，支持jpg,gif,tiff等格式，建議用tif

ddt 爲要保存的文件名（自動添加擴展名.box）

-l eng 使用的庫，這個參數可讓咱們選擇用哪一個字庫來標記邊框

後面的都是配置文件了，也就是tesseract的其餘參數被以文件形式加載，而不是直接輸入參數

digits指定了只識別0-9的數字（固然你能夠編輯它，使它包含更多的字符），當你不須要指定時，必定要去掉這個參數，但使用這種字符集限定，能夠最大程度上減小被錯誤識別搞的你編輯ddt.box頭昏腦脹的概率。

注：

這一步很是關鍵，但也常常出現問題，即便你在http://code.google.com/p/bbtesseract/downloads/list下載了bbtesseract也是如此，因此我感受應該本身編一個邊框識別，但沒有時間作。徹底能夠將命令寫入到軟件裏，實現圖形化。因此，注意編輯你生成的ddt.box文件，保證字符都被識別且邊框正確。

這裏也有一個小技巧，例如我作過這樣一個tif：1.2-34567089，在這一步的時候，只識別了2-9這一部分，因而我修改tif爲：001.2-34567089，就所有識別了。也許能夠給你一些啓示。

二、造成語言庫

tesseract ddt.tif ddt -l eng digits nobatch box.train
unicharset_extractor ddt.box
rename unicharset ddt.unicharset
mftraining -U unicharset -O ddt.unicharset ddt.tr
rename inttemp ddt.inttemp
rename pffmtable ddt.pffmtable
rename Microfeat ddt.Microfeat
cntraining ddt.tr
rename normproto ddt.normproto
combine_tessdata ddt.

這裏麪包含了若干步驟，但其餘人扒的「教程」已經囉嗦不少了，再也不囉嗦。

注：那幾個rename是必要的，由於生成的文件只有擴展名。只要注意了這些，就沒問題了。

三、測試語言庫

copy ddt.traineddata .\tessdata\ddt.traineddata

tesseract ddt.tif ddt -l ddt
notepad ddt.txt

若是測試失敗了，你應該檢查：

A、是否tif寬度過小，若是是，我建議你在下面增長一行，就是說把1行改爲2行，增長什麼呢，隨意增長一些你字庫裏面的字符，但最好和圖像同樣寬。

B、若是還沒正確識別，回頭仔細檢查你的ddt.box

若是你失敗了，記得清理前面生成的文件，可使用一下命令：

copy ddt.tif tmp.tif
del ddt.* /f /s
copy tmp.tif ddt.tif
del tmp.tif

而後從第一步從新來過。

2、使用

使用時，只須要注意，對於單行而且字符數較少的圖像，若是不識別，最好是在下面添加一行無用行，並保證該行基本達到圖像寬。

注：

在使用時，可能會發生找不到字庫的狀況（尤爲當你卸載後重裝tesseract時），此時，應修改

HKEY_CURRENT_USER\Environment\TESSDATA_PREFIX的值爲你的tesseract所在目錄。

3、示例

最後給出一個tesseract3在VB.NET下使用的示例代碼。

Public Class TessOCR

    Dim path As String = My.Application.Info.DirectoryPath & "\tesseract3\"

    Sub New()
        My.Computer.Registry.CurrentUser.OpenSubKey("Environment", True).SetValue("TESSDATA_PREFIX", path)
    End Sub

    Public Function Tess3OCR(ByVal Rect As Rectangle, ByVal clr As Integer) As String
        '創建圖像，注意屏幕複製時使用SourceCopy以符合OCR要求的圖像格式，不然出錯或直接關閉
        Dim bmp As Bitmap = New Bitmap(Rect.Width, Rect.Height * 2)
        Dim gr As Graphics = Graphics.FromImage(bmp)
        gr.Clear(Color.White)
        gr.CopyFromScreen(Rect.Location, Point.Empty, Rect.Size, CopyPixelOperation.SourceCopy)
        '校訂爲白紙黑字
        For y As Integer = 0 To bmp.Height - 1
            For x As Integer = 0 To bmp.Width - 1
                If bmp.GetPixel(x, y).ToArgb = clr Then bmp.SetPixel(x, y, Color.Black) Else bmp.SetPixel(x, y, Color.White)
            Next
        Next
        Dim str As String = IIf(clr = AngleColor, "45.000000", "0.000000")
        gr.DrawString(str, New Font("Arial Black", 14), Brushes.Black, 0, Rect.Height)

        bmp.Save(path & "tmp.tif", System.Drawing.Imaging.ImageFormat.Tiff)
        Shell(path & "tesseract " & path & "tmp.tif " & path & "tmp -l ddt digits", AppWinStyle.Hide, True)
        My.Computer.FileSystem.DeleteFile(path & "tmp.tif")
        Dim ret As String = My.Computer.FileSystem.ReadAllText(path & "tmp.txt").Split(vbCrLf)(0)
        My.Computer.FileSystem.DeleteFile(path & "tmp.txt")
        Return ret
    End Function

End Class

在代碼的new函數中，我修改了註冊表，以防止出錯，更好的作法應該是在這以前記錄原始值並在類銷燬時恢復。以後，指出了屏幕複製時可能存在的一些問題，固然，若是你是取驗證碼啥的，那就不用關心這些了。而後對圖像進行了簡單的校訂，須要注意的是，必須校訂爲白底黑字才行，不然不識別。然後，我在下面添加了一行無用的文字，並在返回值時進行了適當處理。再有一點須要注意的是，shell函數的最後一個參數，指出了等待調用進程結束，若是你要在vb6當中使用，這裏就須要用api來實現等待——而不要用sleep等定時等待函數，那將會使得你的程序不夠健壯。

轉自：http://blog.csdn.net/foxwit/article/details/6547465

OCR識別引擎tesseract使用方法

最近一直跟OCR打交道，學習了下google的OCR引擎TESSERACT，是個很好的識別工具。tesseract-3.0已支持版面分析，功能很強大。安裝tesseract前可選擇性地安裝leptonica和libtiff。不過建議先安裝這兩個庫。不安裝tiff的話只能處理bmp文件。

這裏只是說明怎麼識別中文。依次安裝好libtiff,leptonica和tesseract後，下載簡體中文和繁體中文的訓練數據，在tesseract的下載頁能夠找到。放到某個目錄的tessdata文件夾下。而後設置環境變量TESSDATA_PREFIX=tessdata的目錄。而後，新建一個ocr.cpp文件，編寫以下代碼：

#include <mfcpch.h>

#include <ctype.h>

#include <sys/time.h>

#include "applybox.h"

#include "control.h"

#include "tessvars.h"

#include "tessedit.h"

#include "baseapi.h"

#include "thresholder.h"

#include "pageres.h"

#include "imgs.h"

#include "varabled.h"

#include "tprintf.h"

#include "stderr.h"

#include "notdll.h"

#include "mainblk.h"

#include "output.h"

#include "globals.h"

#include "helpers.h"

#include "blread.h"

#include "tfacep.h"

#include "callnet.h"

#include "allheaders.h"

int main(int argc,char **argv){

if(argc!=3){

printf("usage:%s <bmp file> <txt file>/n",argv[0]);

return -1;

}

char *image_file=argv[1];

char *txt_file=argv[2];

STRING text_out;

struct timeval beg,end;

tesseract::TessBaseAPI api;

IMAGE image;

api.Init(argv[0], "chi_sim", NULL, 0, false);//初始化api對象

api.SetPageSegMode(tesseract::PSM_AUTO);//設置自動進行版面分析

api.SetAccuracyVSpeed(tesseract::AVS_FASTEST);//要求速度最快

if (image.read_header(image_file) < 0) {//讀取bmp文件的元信息

printf("Read of file %s failed./n", image_file);

exit(1);

}

if (image.read(image.get_ysize ()) < 0){//讀取bmp文件

printf("Read of image %s error/n", image_file);

exit(1);

}

invert_image(&image);//反轉圖像的每一個像素，即便1變0,0變1

int bytes_per_line = check_legal_image_size(image.get_xsize(),

image.get_ysize(),

image.get_bpp());//計算每一行像素所佔字節數

api.SetImage(image.get_buffer(), image.get_xsize(), image.get_ysize(),

image.get_bpp() / 8, bytes_per_line);//設置圖像

gettimeofday(&beg,NULL);

char* text = api.GetUTF8Text();//識別圖像中的文字

gettimeofday(&end,NULL);

printf("%s:reconize sec=%f/n",argv[0],end.tv_sec-beg.tv_sec+(double)(end.tv_usec-beg.tv_usec)/1000000.0);//打印識別的時間

text_out += text;

delete [] text;

FILE* fout = fopen(txt_file, "w");

fwrite(text_out.string(), 1, text_out.length(), fout);//將識別結果寫入輸出文件

fclose(fout);

}

再編寫一個makefile文件以下：

all:ocr

CFLAGS=-Wall -g

LDFLAGS= -lz -lm -ltesseract_textord /

-ltesseract_wordrec -ltesseract_classify -ltesseract_dict -ltesseract_ccstruct/

-ltesseract_ccstruct -ltesseract_cutil -ltesseract_viewer -ltesseract_ccutil/

-ltesseract_api -ltesseract_image -ltesseract_main -llept

LD_LIBRARY_PATH =

INCLUDES= -I/usr/local/include/tesseract/ -I/usr/local/include/leptonica/

%.o:%.cpp

g++ -c $(CFLAGS) $(INCLUDES) $(SOURCE) -o $@ $<

ocr:ocr.o

g++ -o $@ $^ -g $(LD_LIBRARY_PATH) $(LDFLAGS)

clean:

rm ocr.o

在該目錄下運行make編譯成可執行文件ocr,運行./ocr 1.bmp 1.txt就能夠將圖像1.bmp識別結果寫到1.txt了，程序會打印識別的時間。值得注意的是，tesseract中文識別速度很慢，運行幾分鐘很正常。不知有哪位大蝦知道怎麼調優？

更鬱悶的是tesseract不支持多線程，不能在同一進程中運行多個實例。

其餘參考博客：

一、http://blog.csdn.net/zhoushuyan/article/details/5948289

二、http://www.blogjava.net/wangxinsh55/archive/2011/03/22/346787.html

三、http://haiquan.iteye.com/blog/945701

四、http://www.cnblogs.com/brooks-dotnet/archive/2010/10/05/1844203.html

五、http://www.cnblogs.com/physoft/archive/2011/07/15/2107417.html

六、http://hi.baidu.com/kuliuheng/blog/item/aae32d32216a9fcda2cc2ba1.html

七、http://code.google.com/p/leptonica/downloads/list

八、http://tesseract-ocr.repairfaq.org/

九、http://blog.wudilabs.org/entry/f25efc5f/

頂: 0

踩

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。