識別圖片中的文字 - Tesseract 和百度雲OCR的對比

時間 2019-12-06

標籤識別圖片文字 tesseract 百度 ocr 對比简体版

原文原文鏈接

當今時代人工智能都已是爛大街的詞了，OCR應該也不少人都知道。html

OCR （Optical Character Recognition，光學字符識別）是指電子設備（例如掃描儀或數碼相機）檢查紙上打印的字符，經過檢測暗、亮的模式肯定其形狀，而後用字符識別方法將形狀翻譯成計算機文字的過程。

本文主要記錄了經過Python使用OCR的兩次嘗試。python

Tesseract

Tesseract，一款由HP實驗室開發由Google維護的開源OCR（Optical Character Recognition , 光學字符識別）引擎，特色是開源，免費，支持多語言，多平臺。git

項目地址：https://github.com/tesseract-...github

安裝使用

Tesseract的安裝比較簡單，在mac能夠經過brew安裝。windows

brew install --with-training-tools tesseract

在windows能夠經過exe安裝包安裝，下載地址能夠從GitHub項目中的wiki找到。安裝完成後記得將Tesseract 執行文件的目錄加入到PATH中，方便後續調用。api

另外，默認安裝會包含英文語言訓練包，若是須要支持簡體中文或者繁體中文，須要在安裝時勾選。app

或者安裝結束後到項目地址下載：https://github.com/tesseract-...ide

下載好的語言包放入到安裝目錄中的testdata下便可。在windows系統你還須要將testdata目錄也加入環境變量。函數

TESSDATA_PREFIX=C:\Program Files (x86)\Tesseract-OCR\tessdata

若是一切就緒，你在命令行中就可使用Tesseract命令。測試

# tesseract
Usage:
  tesseract --help | --help-psm | --help-oem | --version
  tesseract --list-langs [--tessdata-dir PATH]
  tesseract --print-parameters [options...] [configfile...]
  tesseract imagename|stdin outputbase|stdout [options...] [configfile...]

OCR options:
  --tessdata-dir PATH   Specify the location of tessdata path.
  --user-words PATH     Specify the location of user words file.
  --user-patterns PATH  Specify the location of user patterns file.
  -l LANG[+LANG]        Specify language(s) used for OCR.
  -c VAR=VALUE          Set value for config variables.
                        Multiple -c arguments are allowed.
  --psm NUM             Specify page segmentation mode.
  --oem NUM             Specify OCR Engine mode.
NOTE: These options must occur before any configfile.

經過命令行你就能夠完成簡單的圖片文字識別任務。

tesseract test.png outfile -l chi_sim

經過Python調用

Tesseract安裝完成後能夠很方便的被Python調用，你須要安裝兩個包。

pip install pillow
pip install pytesseract

一個簡單的圖片轉文字的函數實現以下。

from PIL import Image
import pytesseract

class Languages:
    CHS = 'chi_sim'
    CHT = 'chi_tra'
    ENG = 'eng'

def img_to_str(image_path, lang=Languages.ENG):
    return pytesseract.image_to_string(Image.open(image_path), lang)
  
print(img_to_str('image/test1.png', lang=Languages.CHS))
print(img_to_str('image/test2.png', lang=Languages.CHS))

測試圖片- test1.png：

識別結果：

process image file "image/test1.png" in 1.4782530478747697 seconds

8 所 調 人 , 在 - 方 。
深 從 久 , 定 中 央
。 所 澈 伊 人 , 圭 水 淳
。 淇 渡 從 之 , 定 圭 北 中 阪 。
。 所 澈 伊人 , 圭 水 浩
從 丿 , 定 圭 水 中 瀝 。

測試圖片 - test2.png

識別結果：

process image file "image/test2.png" in 1.2131140296607923 seconds

清 明 時 節 雨 紛 紛 , 路 上 行 人 欲 斷 魂
信 問 酒 家 何 處 有 , 牧 奕 通 指 槍 花 村 。

小結

Tesseract在識別清晰的標準中文字體效果還行，稍微複雜的狀況就很糟糕，並且花費的時間也不少，我我的以爲惟一的優勢就是免費了。若是你不介意多花時間，能夠考慮使用它提供的訓練功能自定義你的語言庫，那樣在特定場景下識別率應該能上一個臺階。

百度雲OCR

這是偶然的發現，百度雲提供了必定額度的免費的OCR API，目前是每日500次，作作研究或者小應用還勉強夠用，本文主要爲了測試其效果。

文檔地址：https://cloud.baidu.com/doc/O...

安裝使用

首先你須要註冊一個百度雲BCE帳號，而後從控制面板新建一個文字識別應用。

以後你就能夠得到調用API須要的 AppID，API Key 和 Secret Key。後面只要根據官方文檔一步一步走就能夠了。

pip install baidu-aip

封裝和調用

參考文檔： https://cloud.baidu.com/doc/O...

from aip import AipOcr

config = {
    'appId': 'your-id',
    'apiKey': 'your-key',
    'secretKey': 'your-secret-key'
}

client = AipOcr(**config)

def get_file_content(file):
    with open(file, 'rb') as fp:
        return fp.read()

def img_to_str(image_path):
    image = get_file_content(image_path)
    result = client.basicGeneral(image)
    if 'words_result' in result:
        return '\n'.join([w['words'] for w in result['words_result']])

測試圖片- test1.png：

識別結果：

process image file "image/test1.png" in 0.6331169034812572 seconds

蒹葭
先秦:佚名
蒹葭蒼蒼,白露爲霜。所謂伊人,在水一方。
溯洄從之,道阻且長。溯游從之,宛在水中央。
蒹葭萋萋,白露未晞。所謂伊人,在水之湄。
溯洄從之,道陽且躋。溯游從之,宛在水中坻。
蒹葭采采,白露未已。所謂伊人,在水之涘。
溯洄從之,道阻且右。溯游從之,宛在水中沚。

測試圖片 - test2.png

識別結果：

process image file "image/test2.png" in 0.6621812639450142 seconds

清明時節雨紛紛,路上行人慾斷魂。
借問酒家何處有,牧童遙指杏花村。

小結

測試結果很明顯，我只能說百度雲這個OCR真是挺厲害的，一個錯別字都沒有，不服不行。論中文，仍是百度比谷歌更懂一點。並且百度OCR提供了更多的參數讓你更靈活的處理圖片，好比自定義旋轉，返回可信度，特定類型證件識別等等。

識別圖片中的文字 - Tesseract 和 百度雲OCR的對比

Tesseract

安裝使用

經過Python調用

小結

百度雲OCR

安裝使用

封裝和調用

小結

更多的OCR

識別圖片中的文字 - Tesseract 和百度雲OCR的對比