使用Tesseract，驗證碼識別So Easy

前言：微信

在對網站數據進行爬取的過程當中，因爲訪問過於頻繁或是其餘的緣由，常常會出現輸入驗證碼進行驗證的狀況，面對這種驗證碼驗證的問題，通常有三種解決方法：app

第一種，最簡單也是最費時的，手動輸入驗證碼；網站

第二種，使用一些公司的API接口對驗證碼進行判別和輸入；google

第三種，使用tessract對驗證碼進行識別；編碼

在這裏，咱們使用tessract對驗證碼進行識別。spa

Tesseract簡介：.net

tesseract是谷歌開源的一個ORC組件，並支持語言的訓練，支持中文的識別（須要下載語言包）ssr

Python中使用Tesseract：code

在Python中安裝Tesseract一共分爲三步：orm

一、pip安裝pytesseract及其餘依賴庫

pip pytesseract

在使用pytesseract中須要讀取圖像，因此還須要安裝Pillow

二、安裝tesseract

下載並安裝：https://tesseract-ocr.googlecode.com/files/tesseract-ocr-setup-3.02.02.exe

三、修改tesseract.py文件

# tesseract_cmd = 'tesseract'

tesseract_cmd = "C:/Program Files (x86)/Tesseract-OCR/tesseract.exe" # tesseract的安裝目錄

防止提示沒有匹配的文件

# f = open(output_file_name)

f = open(output_file_name,encoding='utf-8')

防止提示Unicode編碼錯誤

作完這三步，就可使用tesseract基本的功能了。

下面來看看在實際的代碼中如何利用tesseract進行驗證碼識別：

原始的驗證碼圖像爲：

示例驗證碼爲：

#coding:utf-8

'''

驗證碼識別

'''

from PIL import Image,ImageFilter,ImageEnhance

import pytesseract

# 二值化

threshold = 140

table = []

for i in range(256):

if i < threshold:

table.append(0)

else:

table.append(1)

# 識別驗證碼

def get_vcode():

# 打開原始圖像

image = Image.open("getimgbysig.jpg")

# image = Image.open("e:/a.jpg")

# 將圖像轉爲灰度，並另存爲

bimage = image.convert('L')

bimage.save('g'+"getimgbysig.jpg")

# 進行二值化處理，並另存爲

out = bimage.point(table,'1')

out.save('b'+"getimgbysig.jpg")

icode = pytesseract.image_to_string(image)

bcode = pytesseract.image_to_string(bimage)

vcode = pytesseract.image_to_string(out)

print(icode,bcode,vcode)

if __name__ == '__main__':

get_vcode()

結果爲：7364

對於簡單、清晰的數字，沒有通過任何訓練的Tesseract仍是可以很精確地識別出來。而對於那些模糊、變形的數字、字母或是中文，就須要先對Tesseract進行訓練了，暫且不表。

本文分享自微信公衆號 - 州的先生（zmister2016）。
若有侵權，請聯繫 support@oschina.cn 刪除。
本文參與「OSC源創計劃」，歡迎正在閱讀的你也加入，一塊兒分享。