圖形驗證碼的識別

時間 2019-11-06

標籤圖形驗證碼識別简体版

原文原文鏈接

OCR 技術：git

(1) 在爬蟲過程當中，不免會遇到各類各樣的驗證碼，而大多數驗證碼仍是罔形驗證碼，這時候咱們能夠直接用 OCR 來識別
(2) OCR ，即 Optical Character Recognition ，光學字符識別，是指經過掃描字符，而後經過其形狀將其翻譯成電子文本的過程
(3) tesserocr 是 Python 的一個OCR 識別庫，但實際上是對 tesseract 作的一層 Python API 封裝，因此它的核心是 tesseract。所以，在安裝 tesserocr 以前，咱們須要先安裝 tesseractgithub

Windows 下安裝 tessorocr：app

1. 先安裝 tessoract，下載地址：https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-setup-3.05.01.exe
2. 再安裝 tessorocr，使用 pip3 安裝便可：pip3 install tesserocr pillowide

Linux 下安裝 tessorocr：spa

yum install -y tesseract
git clone https://github.com/tesseract-ocr/tessdata.git
sudo mv tessdata/* /usr/share/tesseract/tessdata
pip3 install tesserocr pillow

Python 識別圖片驗證碼：翻譯

import tesserocr
from PIL import Image

image = Image.open('1.png')                 # Opens and identifies the given image file
result = tesserocr.image_to_text(image)     # Recognize OCR text from an image object
print(result)

Python 識別有干擾的圖片驗證碼：code

import tesserocr
from PIL import Image

image = Image.open('2.png')

image = image.convert('L')
threshold = 127
table = []
for i in range(256):
    if i < threshold:
        table.append(0)
    else:
        table.append(1)

image = image.point(table, '1')
result = tesserocr.image_to_text(image)
print(result)

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。