Python爬蟲入門教程 56-100 python爬蟲高級技術之驗證碼篇2-開放平臺OCR技術

時間 2019-11-16

標籤 python 爬蟲入門教程高級技術驗證碼開放平臺 ocr 欄目 Python 简体版

原文原文鏈接

今日的驗證碼之旅

今天你要學習的驗證碼採用經過第三方AI平臺開放的OCR接口實現，OCR文字識別技術目前已經比較成熟了，並且第三方比較多，今天採用的是百度的。html

註冊百度AI平臺

官方網址：ai.baidu.com/
接下來申請 python

接下來建立一個簡單應用以後，就可使用了，咱們找到

閱讀文字識別相關文檔

你須要具有基本的閱讀第三方文檔的能力，打開咱們須要的文檔程序員

cloud.baidu.com/doc/OCR/OCR…web

這個頁面基本上已經把咱們須要作的全部內容都已經標識清楚了json

編寫獲取accesstoken的代碼

在目前主流的API開發模式下，都是須要你進行accesstoken的獲取的api

代碼以下，重點須要參照文檔進行傳參的設計網絡

def get_accesstoken(self):
        res = requests.post(self.url.format(self.key,self.secret),headers=self.header)
        content = res.text
        if (content):
            return json.loads(content)["access_token"]
複製代碼

獲得accesstoken以後，你能夠繼續下面的操做app

import requests
import json

import base64

import urllib.request, urllib.parse

class GetCode(object):

    def __init__(self):
        self.url = "https://aip.baidubce.com/oauth/2.0/token?grant_type=client_credentials&client_id={}&client_secret={}"
        self.api = "https://aip.baidubce.com/rest/2.0/ocr/v1/general_basic?access_token={}"
        self.header = {
            "Content-Type":'application/json; charset=UTF-8'
        }

        self.key = "你的KEY"
        self.secret = "你的SECRET"
複製代碼

驗證碼識別階段

普通沒有干擾的驗證碼，咱們直接識別便可，可是有的驗證碼仍是有干擾的，在識別以前，須要對它進行基本的處理，咱們採用和上篇文章相似的辦法進行，對它進行灰度處理和二值化操做。部分代碼我直接硬編碼了，不過最終識別的效果並無比想象的優化多少。python爬蟲

def init_table(self,threshold=155):
        table = []
        for i in range(256):
            if i < threshold:
                table.append(0)
            else:
                table.append(1)
        return table



    def opt_image(self):
        im = Image.open("66.png")

        im = im.convert('L')
        im = im.point(self.init_table(), '1')
        im.save('66_s.png')
        return "66_s.png"
複製代碼

調用驗證碼接口

調用百度的驗證碼接口，不使用百度給的模塊直接編寫。按照它對應的文檔，書寫便可。在這個地方尤爲注意官方文檔提示 post

def get_file_content(self,file_path):
        with open(file_path, 'rb') as fp:
            base64_data = base64.b64encode(fp.read())
            s = base64_data.decode()

            data = {}
            data['image'] = s

            decoded_data = urllib.parse.urlencode(data)
            return decoded_data


    def show_code(self):
        image = self.get_file_content(self.opt_image())
        headers = {
            "Content-Type":	"application/x-www-form-urlencoded"
        }
        res = requests.post(self.api.format(self.get_accesstoken()),headers=headers,data=image)
        print(res.text)
複製代碼

經過百度模塊調用驗證碼識別

安裝百度AI

pip install baidu-aip

安裝以後，就可使用啦

聲明一些常量，你在百度建立應用以後就能夠獲取
初始化文字識別類
調用對應的方法

參考代碼

from aip import AipOcr


# 定義常量
APP_ID = '15736693'
API_KEY = '你的KEY'
SECRET_KEY = '你的SECRET'

# 初始化文字識別
aipOcr=AipOcr(APP_ID, API_KEY, SECRET_KEY)

# 讀取圖片
filePath = "1.jpg"

def get_file_content(filePath):
    with open(filePath, 'rb') as fp:
        return fp.read()

# 定義參數變量
options = {
    'detect_direction': 'true',
    'language_type': 'CHN_ENG',
}

# 網絡圖片文字文字識別接口
result = aipOcr.webImage(get_file_content(filePath),options)


print(result)


複製代碼