tersserorc的簡單使用

時間 2019-12-11

標籤 tersserorc 簡單使用简体版

原文原文鏈接

tesserocr 是 python 的一個 OCR 庫，它是對 tesseract 作的一層 Python API 封裝，因此他的核心是tesseract。html

tesseract 的安裝見 http://www.javashuo.com/article/p-yfjbydij-mb.htmlpython

windows 下安裝 tesserocr 是一個坑爹的事情，直接用 pip 安裝是不能夠的，會報錯，只能用 .whl 的方式安裝。聽說 pip 的方式只能用於 Linux 系統，沒驗證過。git

whl 下載地址：https://github.com/simonflueckiger/tesserocr-windows_build/releasesgithub

網站中列出了 tesserocr 和 tesseract 版本的對應關係，選擇對應的版本，不然會出現非預期字符。windows

安裝 whl 測試

λ pip install tesserocr-2.4.0-cp36-cp36m-win_amd64.whl

腳本：網站

import tesserocr
from PIL import Image


img = Image.open('1.png')
result = tesserocr.image_to_text(img)
print(result)

遇到的坑：‘ui

若是依照官方文檔，只安裝了 tesserocr 的 .whl 文件，並嘗試運行以下測試代碼：spa

import tesserocr
from PIL import Image


img = Image.open('1.png')
result = tesserocr.image_to_text(img)
print(result)

便會獲得以下錯誤提示：3d

Traceback (most recent call last):
  File "c:/Users/iwhal/Documents/GitHub/python_notes/notes_of_crawler/code_of_learn_is_ignored/test_of_tesserocr .py", line 4, in <module>
    print(tesserocr.image_to_text(image))
  File "tesserocr.pyx", line 2401, in tesserocr._tesserocr.image_to_textRuntimeError: Failed to init API, possibly an invalid tessdata path:

Traceback 告訴咱們：tessdata 路徑無效，沒法初始化 API。

錯誤的緣由是：stand-alone packages 雖然包含了 Windows 下所需的全部庫，但並是不包含語言數據文件(language data files)。而且數據文件須要被統一放置在 tessdata\ 文件夾中，並置於 C:\Python36 內。

得到數據文件有以下兩種方式：

方法一：按照下一節的方法安裝 "tesseract-ocr-w64-setup-v4.0.0-beta.1.20180608.exe"(由於要與 tesserocr-2.2.2 匹配)。而後，將 C:\Program Files (x86)\Tesseract-OCR\ 下的 tessdata\ 文件夾複製到 C:\Python36\ 下便可。
方法二：無需安裝 tesseract ，只需克隆 tesseract 倉庫的主分支，而後將其中的 tessdata\ 文件夾複製到 C:\Python36\中。接下來，經過 tessdata_fast 倉庫下載 eng.traineddata 語言文件，並放置於 C:\Python36\tessdata\ 內便可。

可見，解決此問題的關鍵在於得到 tesseract 的 tessdata\ 文件夾，並不必定要安裝 tesseract ，但 tesseract 的版本必定要正確。

接下來嘗試運行以前的代碼：

import tesserocr
from PIL import Image


img = Image.open('1.png')
result = tesserocr.image_to_text(img)
print(result)

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。