Python爬蟲入門教程 55-100 python爬蟲高級技術之驗證碼篇

時間 2019-11-18

標籤 python 爬蟲入門教程高級技術驗證碼欄目 Python 简体版

原文原文鏈接

驗證碼探究

若是你是一個數據挖掘愛好者，那麼驗證碼是你避免不過去的一個天坑，和各類驗證碼鬥爭，必然是你成長的一條道路，接下來的幾篇文章，我會盡可能的找到各類驗證碼，而且去嘗試解決掉它，中間有些技術甚至我都沒有見過，來吧，一塊兒Coding吧html

數字+字母的驗證碼

我隨便在百度圖片搜索了一個驗證碼，以下

今天要作的是驗證碼識別中最簡單的一種辦法，採用pytesseract解決，它屬於Python當中比較簡單的OCR識別庫python

庫的安裝

使用pytesseract以前，你須要經過pip 安裝一下對應的模塊，須要兩個git

pytesseract庫還有圖像處理的pillow庫了github

pip install pytesseract
pip install pillow

若是你安裝了這兩個庫以後，編寫一個識別代碼，通常狀況下會報下面這個錯誤微信

pytesseract.pytesseract.TesseractNotFoundError: tesseract is not installed or it's not in your path

這是因爲你還缺乏一部份內容app

安裝一個Tesseract-OCR軟件。這個軟件是由Google維護的開源的OCR軟件。ide

下載地址 > https://github.com/tesseract-ocr/tesseract/wiki學習

中文包的下載地址 > https://github.com/tesseract-ocr/tessdata測試

選擇你須要的版本進行下載便可網站

pillow庫的基本操做

命令	釋義
open()	打開一個圖片 from PIL import Image im = Image.open("1.png") im.show()
save()	保存文件
convert()	convert() 是圖像實例對象的一個方法，接受一個 mode 參數，用以指定一種色彩模式，mode 的取值能夠是以下幾種： · 1 (1-bit pixels, black and white, stored with one pixel per byte) · L (8-bit pixels, black and white) · P (8-bit pixels, mapped to any other mode using a colour palette) · RGB (3x8-bit pixels, true colour) · RGBA (4x8-bit pixels, true colour with transparency mask) · CMYK (4x8-bit pixels, colour separation) · YCbCr (3x8-bit pixels, colour video format) · I (32-bit signed integer pixels) · F (32-bit floating point pixels)

Filter

from PIL import Image, ImageFilter 
im = Image.open(‘1.png’) 
# 高斯模糊 
im.filter(ImageFilter.GaussianBlur) 
# 普通模糊 
im.filter(ImageFilter.BLUR) 
# 邊緣加強 
im.filter(ImageFilter.EDGE_ENHANCE) 
# 找到邊緣 
im.filter(ImageFilter.FIND_EDGES) 
# 浮雕 
im.filter(ImageFilter.EMBOSS) 
# 輪廓 
im.filter(ImageFilter.CONTOUR) 
# 銳化 
im.filter(ImageFilter.SHARPEN) 
# 平滑 
im.filter(ImageFilter.SMOOTH) 
# 細節 
im.filter(ImageFilter.DETAIL)

Format

format屬性定義了圖像的格式，若是圖像不是從文件打開的，那麼該屬性值爲None；
size屬性是一個tuple，表示圖像的寬和高（單位爲像素）；
mode屬性爲表示圖像的模式，經常使用的模式爲：L爲灰度圖，RGB爲真彩色，CMYK爲pre-press圖像。若是文件不能打開，則拋出IOError異常。

這個地方能夠參照一篇博客，寫的不錯 > http://www.javashuo.com/article/p-vvwzkfmu-c.html

驗證碼識別

注意安裝完畢，若是仍是報錯，請找到模塊 pytesseract.py 這個文件，對這個文件進行編輯

通常這個文件在 C:\Program Files\Python36\Lib\site-packages\pytesseract\pytesseract.py 位置

文件中 tesseract_cmd = 'tesseract' 改成本身的地址
例如： tesseract_cmd = 'C:\Program Files (x86)\Tesseract-OCR\tesseract.exe'

若是報下面的BUG，請注意

Error opening data file \Program Files (x86)\Tesseract-OCR\tessdata/chi_sim.traineddata Please make sure the TESSDATA_PREFIX environment variable

解決辦法也比較容易，按照它的提示，表示缺失了 TESSDATA_PREFIX 這個環境變量。你只須要在系統環境變量中添加一條便可

將 TESSDATA_PREFIX=C:\Program Files (x86)\Tesseract-OCR 添加環境變量

重啓IDE或者從新CMD，而後繼續運行代碼，這個地方注意須要用管理員運行你的py腳本

步驟分爲

打開圖片 Image.open()
pytesseract識別圖片

import pytesseract
from PIL import Image

def main():
    image = Image.open("1.jpg")
 
    text = pytesseract.image_to_string(image,lang="chi_sim")
    print(text)

if __name__ == '__main__':
    main()

測試英文，數字什麼的基本沒有問題，中文簡直慘不忍睹。空白比較大的能夠識別出來。唉~很差用
固然剛纔那個7364 十分輕鬆的就識別出來了。

帶干擾的驗證碼識別

接下來識別以下的驗證碼，咱們首先依舊先嚐試一下。運行代碼發現沒有任何顯示。接下來須要對這個圖片進行處理

基本原理都是徹底同樣的

彩色轉灰度
灰度轉二值
二值圖像識別

彩色轉灰度

im = im.convert('L')

灰度轉二值，解決方案比較成套路，採用閾值分割法，threshold爲分割點

def initTable(threshold=140):
    table = []
    for i in range(256):
        if i < threshold:
            table.append(0)
        else:
            table.append(1)
    return table

調用

binaryImage = im.point(initTable(), '1')
binaryImage.show()

調整以後

咱們還須要對干擾線進行處理。在往下研究去，是圖片深刻處理的任務，對付小網站的簡單驗證碼，這個辦法足夠了，本篇博文OVER,下一篇咱們繼續研究驗證碼。

參考連接

tesserocr GitHub：https://github.com/sirfz/tesserocr
tesserocr PyPI：https://pypi.python.org/pypi/tesserocr
pytesserocr GitHub：https://github.com/madmaze/pytesseract
pytesserocr PyPI：https://pypi.org/project/pytesseract/
tesseract下載地址：http://digi.bib.uni-mannheim.de/tesseract
tesseract GitHub：https://github.com/tesseract-ocr/tesseract
tesseract 語言包：https://github.com/tesseract-ocr/tessdata
tesseract文檔：https://github.com/tesseract-ocr/tesseract/wiki/Documentation