Python 爬蟲入門（四）—— 驗證碼下篇（破解簡單的驗證碼）

時間 2019-11-10

原文原文鏈接

　　年前寫了驗證碼上篇，原本很早前就想寫下篇來着，只是過年比較忙，還有就是驗證碼破解比較繁雜，方法不一樣，正確率也會有差別，我一直在找比較好的方案，可是好的方案都比較專業，設涉及到了圖形圖像處理這些，我也是隻知其一;不知其二，因此就耽誤了下來，在此對一直等待的同窗說聲抱歉。有興趣的同窗能夠自行看看這方面的資料。由於咱們都是入門，此次就以簡單點的驗證碼爲例，講述下流程。廢話很少說，正式開始。python

　　1.)獲取驗證碼算法

　　在上節，咱們已經講述了獲取驗證碼的方法，這裏不做贅述。下面是我獲取到的另外一個網站的驗證碼（最後我會放一個驗證碼的壓縮包，想要練習的同窗能夠下載下來，尋找準確率更高的方案）。windows

　2.)分析驗證碼app

　　a.)分析樣本空間ide

　　從上面的驗證碼能夠看出，圖片上總共有5個字，分別是操做數一、操做符、操做數二、"等於"。因此咱們提取的話，只有前三個字是有效字。同時操做數的取值範圍（0~9），操做符的取值爲（加、乘）。因此總共有12個樣本空間，操做數有10個，操做符有兩個。函數

　　b.)分析提取範圍工具

　　windows用戶能夠用系統自帶的畫板工具打開驗證碼，能夠看到以下信息。學習

　　首先能夠看到，驗證碼的像素是80*30，也就說橫向80像素，縱向30像素，若是給它畫上座標系的話，座標原點（0,0）爲左上方頂點，向右爲x軸（0=<x<80）,向下爲y軸（0=<y<30）。(10,17)是當前鼠標（圖片中的十字）所在位置的座標，這個能夠幫助咱們肯定裁剪的範圍。我用的裁剪範圍分別是：測試

　　操做數1和操做數2的大小作好保持一致，這樣可使兩個操做數共用樣本數據。region = (3,4,16,17) 其中（3,4）表明左上頂點的座標，（16,17）表明右下頂點的座標，這樣就能夠構成一個矩形。大小爲（16-3，17-4）即寬和高均爲13像素的矩形網站

　3.)處理驗證碼（這裏我用的是python的"PIL"圖像處理庫）

　　　a.)轉爲灰度圖

　　　　PIL 在這方面也提供了極完備的支持，咱們能夠：

　　　　img.convert("L")

　　　　把 img 轉換爲 256 級灰度圖像， convert() 是圖像實例對象的一個方法，接受一個 mode 參數，用以指定一種色彩模式，mode 的取值能夠是以下幾種：

　　　　· 1 (1-bit pixels, black and white, stored with one pixel per byte)

　　　　· L (8-bit pixels, black and white)

　　　　· P (8-bit pixels, mapped to any other mode using a colour palette)

　　　　· RGB (3x8-bit pixels, true colour)

　　　　· RGBA (4x8-bit pixels, true colour with transparency mask)

　　　　· CMYK (4x8-bit pixels, colour separation)

　　　　· YCbCr (3x8-bit pixels, colour video format)

　　　　· I (32-bit signed integer pixels)

　　　　· F (32-bit floating point pixels)

　　　　代碼以下：

from PIL import Image
image = Image.open("H:\\authcode\\origin\\code3.jpg")
imgry = image.convert("L")
imgry.show()

　　　　運行結果：

　　　　而後二值化：

from PIL import Image
image = Image.open("H:\\authcode\\origin\\code3.jpg")
imgry = image.convert("L")
# imgry.show()
threshold = 100
table = []
for i in range(256):
    if i < threshold:
        table.append(0)
    else:
        table.append(1)
out = imgry.point(table,'1')
out.show()

　　　　運行結果：

　　　　這個時候就是比較純粹的黑白圖了。

　　　　代碼說明：

　　　　　　a).threshold = 100這個是一個閾值，具體是多少，看狀況，若是比較專業的能夠根據圖片的灰度直方圖來肯定，通常而言，能夠本身試試不一樣的值，看哪一個效果最好。

　　　　　　b).其餘的函數都是PIL自帶的，有疑問的能夠本身找資料查看

　　　　b.)圖片裁剪

　　　　代碼以下：

from PIL import Image
image = Image.open("H:\\authcode\\origin\\code3.jpg")
imgry = image.convert("L")
# imgry.show()
threshold = 100
table = []
for i in range(256):
    if i < threshold:
        table.append(0)
    else:
        table.append(1)
out = imgry.point(table,'1')
# out.show()
region = (3,4,16,17)
result = out.crop(region)
result.show()

　　　　運行結果：

　　　　更改region的值就能夠裁剪到不一樣的圖片，而後對其進行分類。我是把每一個數字都不一樣的文件夾裏，結果以下：

　　4.)提取特徵值

　　提取特徵值的算法就是因人而異了，這裏我用的是，對每一個分割後的驗證碼，橫向畫兩條線，縱向畫兩條線，記錄與驗證碼的交點個數（很尷尬的是我這個方案，識別率不高，這裏意思到了就好了，你們懂的）。

　　就是這麼個意思。這四條線的表達式爲：(橫線)x=3和x=6,(豎線)y=2，y=11

　　　代碼以下：

def yCount1(image):
    count = 0;
    x = 3
    for y in range(0,13):
        pixel = image.getpixel((x,y))
        if(pixel==0):
            count = count+1
    return count
def yCount2(image):
    count = 0;
    x = 6
    for y in range(0,13):
        pixel = image.getpixel((x,y))
        if(pixel==0):
            count = count+1
    return count
def xCount1(image):
    count = 0
    y = 2
    for x in range(0,13):
        pixel = image.getpixel((x,y))
        if(pixel==0):
            count = count+1
    return count
def xCount2(image):
    count = 0
    y = 11
    for x in range(0,13):
        pixel = image.getpixel((x,y))
        if(pixel==0):
            count = count+1
    return count

　　把（0~9）這10個數字取特徵值以後就獲得以下圖的結果：

2:5:3:3-0
2:2:2:3-0
5:2:2:4-0
2:2:2:0-0
2:4:2:0-0
6:2:3:3-0
0:3:3:2-0
2:5:3:3-0
2:1:3:5-1
2:1:3:5-1
1:6:3:4-1
1:8:3:2-1
1:8:3:3-1
1:6:3:4-1
1:5:3:3-1
1:3:3:5-1
2:1:3:5-1
1:6:3:3-1
1:7:3:2-1
1:5:3:3-1
1:7:3:4-1
1:8:3:2-1
2:1:2:5-1
2:1:1:2-1
1:8:3:2-1
2:1:2:5-1
1:7:0:1-1
2:1:2:5-1
6:1:2:1-1
0:6:3:1-1
0:6:2:1-1
1:7:2:1-1
5:1:2:3-1
1:3:3:5-1
2:7:2:2-1
6:1:2:1-1
2:1:2:3-1
5:1:1:0-1
1:6:3:3-1
1:7:3:2-1
1:7:3:4-1
5:1:2:3-1
2:1:1:1-1
1:6:0:1-1
4:1:2:3-1
1:1:2:4-1
5:1:2:1-1
0:5:2:2-1
2:1:2:4-1
1:5:3:5-1
5:1:3:3-1
1:8:3:2-1
1:5:3:3-1
2:1:2:5-1
2:1:1:2-1
2:1:2:5-1
2:1:2:5-1
2:1:2:5-1
2:1:2:5-1
1:8:3:2-1
2:1:2:5-1
1:5:3:3-1
2:1:3:5-1
3:2:2:2-2
4:1:1:1-2
3:3:2:6-2
3:3:4:4-2
2:3:2:3-2
3:3:2:6-2
2:3:3:3-2
2:3:3:3-2
3:5:3:6-2

　　最後一個數字表明這個特徵值的結果，好比3:5:3:6-2，表明若是一個圖片知足3:5:3:6，那麼咱們就認爲這個圖片上的值爲2

　　這樣是有偏差的

　　首先，存在一個特徵值同時輸入多個數字，好比，1:2:3:4可能輸入2，也可能輸入3，這個時候就會出現偏差。（解決方案：取出現頻率最高的結果，可是也會有偏差）

　　其次，可能存在一個特徵值不在咱們的樣本空間。（解決方案：擴大樣本空間）

　5.)驗證

　　完成以上幾部，就能夠進行破解測試了。

　　代碼以下（crackcode是我本身寫的函數）：

　　附錄：

　　crackcode.py

#encoding=utf8
import checknumber
import splitImage
import checkoperation
def getCodeResult(image):
    image1 = splitImage.getNumImage(image,1)
    image2 = splitImage.getNumImage(image,2)
    image3 = splitImage.getNumImage(image,3)
    num1 = checknumber.getnum(image1)
    num2 = checknumber.getnum(image2)
    operation =checkoperation.getoperation(image3)
    # print `num1`+":"+`operation`+":"+`num2`
    if(int(operation) != 2):
       result =  int(num1) + int(num2)
    else:
       result =  int(num1) * int(num2)
    return result

　　checknumber.py　

#encoding=utf8
from PIL import Image
import test
import collections

f = open("../src/school")
lines = f.readlines()
ips={}
for i in range(0,len(lines)):
    ips[i] = lines[i]
def getnum(image):
    # newimage = test.handimage(image)
    newimage = image
    result = `test.yCount1(newimage)`+":"+`test.yCount2(newimage)`+":"+`test.xCount1(newimage)`+":"+`test.xCount2(newimage)`
    result_ips = []
    for x in range(len(ips)):
        if(ips[x].find(result)>-1):
            result_ips.append(ips[x].strip("\n").split('-')[1])
    d = collections.Counter(result_ips)
    if(len(d.most_common(1))==0):
        return -1
    else:
        return d.most_common(1)[0][0]

　　splitImage.py

#encoding=utf8
from PIL import Image

def getNumImage(image,type):
    imgry = image.convert("L")
    threshold = 100
    table = []
    for i in range(256):
        if i < threshold:
            table.append(0)
        else:
            table.append(1)
    out = imgry.point(table,'1')
    if(type == 1):#操做數1
        region = (3,4,16,17)
        result = out.crop(region)
        return result
    elif(type == 2):#操做數2
        region = (33,4,46,17)
        result = out.crop(region)
        return result
    else:#操做符
        region = (18,4,33,17)
        result = out.crop(region)
        return result

    return result

　　checkoperation.py

#encoding=utf8
from PIL import Image
import test
import collections

f = open("../src/operation")
lines = f.readlines()
ips={}
for i in range(0,len(lines)):
    ips[i] = lines[i]
def getoperation(image):
    # newimage = test.handimage(image)
    newimage = image
    result = `test.yCount1(newimage)`+":"+`test.yCount2(newimage)`+":"+`test.xCount1(newimage)`+":"+`test.xCount2(newimage)`
    result_ips = []
    for x in range(len(ips)):
        if(ips[x].find(result)>-1):
            result_ips.append(ips[x].strip("\n").split('-')[1])
    d = collections.Counter(result_ips)
    if(len(d.most_common(1))==0):
        return -1
    else:
        return d.most_common(1)[0][0]

　　test.py

#encoding=utf8
from pytesseract import *
from PIL import Image

def handimage(image):
    height = image.size[1]
    width = image.size[0]
    # print height,width
    for h in range(height):
        for w in range(width):
            pixel = image.getpixel((w,h))
            if(pixel<127):
                image.putpixel((w,h),0)
            else:
                image.putpixel((w,h),255)
    for h in range(height):
        for w in range(width):
            pixel = image.getpixel((w,h))
            # print pixel
    return image
def yCount1(image):
    count = 0;
    x = 3
    for y in range(0,13):
        pixel = image.getpixel((x,y))
        if(pixel==0):
            count = count+1
    return count
def yCount2(image):
    count = 0;
    x = 6
    for y in range(0,13):
        pixel = image.getpixel((x,y))
        if(pixel==0):
            count = count+1
    return count
def xCount1(image):
    count = 0
    y = 2
    for x in range(0,13):
        pixel = image.getpixel((x,y))
        if(pixel==0):
            count = count+1
    return count
def xCount2(image):
    count = 0
    y = 11
    for x in range(0,13):
        pixel = image.getpixel((x,y))
        if(pixel==0):
            count = count+1
    return count

operation和school分別爲操做數和操做符的樣本空間，能夠本身獲取。
驗證碼樣本放在百度雲了，500條：
連接：http://pan.baidu.com/s/1hrv5w7y 密碼：igo6
至此，破解驗證碼的流程就結束了。

　　說明：

　　a).代碼僅供學習交流

　　b).若有錯誤，多多指教

　　c).轉載請註明出處

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。