字符驗證碼識別項目記錄

項目簡介:

最近在作一個有趣的項目,須要對某網站的驗證碼進行識別。git

某網站驗證碼如圖:,像素大小:30x106x3github

經過人工標記的驗證碼數量約爲1000條: 網絡

如今須要經過機器學習方法來進行識別新驗證碼,設計的方案有以下四種:session

  • KNN + 原樣本圖;須要對圖像去噪、二值化、切割等處理。對數據量要求沒CNN高。app

  • CNN + 原樣本圖;缺點:樣本少,優勢:數據質量高。dom

  • CNN + 構造相似驗證碼圖;缺點:構造驗證碼是否和和原驗證碼相似,須要較高技術;優勢:樣本多。機器學習

  • CNN + 單字符樣本圖;優勢:輸入圖像小,且輸出類別少。ide

  • 其餘:如用pytesseract+去噪+二值化等,簡單嘗試了一下準確率很低,pass掉了學習

方案一:KNN + 原樣本圖

步驟:測試

  • 去噪: 原圖:

    class NoiseDel():
    		#去除干擾噪聲
    		def noise_del(self,img):
    			height = img.shape[0]
    			width = img.shape[1]
    			channels = img.shape[2]
    
    			# 清除四周噪點
    			for row in [0,height-1]:
    				for column in range(0, width):
    					if img[row, column, 0] == 0 and img[row, column, 1] == 0:
    						img[row, column, 0] = 255
    						img[row, column, 1] = 255
    			for column in [0,width-1]:
    				for row in range(0, height):
    					if img[row, column, 0] == 0 and img[row, column, 1] == 0:
    						img[row, column, 0] = 255
    						img[row, column, 1] = 255
    
    			# 清除中間區域噪點
    			for row in range(1,height-1):
    				for column in range(1,width-1):
    					if img[row, column, 0] == 0 and img[row, column, 1] == 0:
    						a = img[row - 1, column]  # 上
    						b = img[row + 1, column]  # 下
    						c = img[row, column - 1]  # 左
    						d = img[row, column + 1]  # 右
    						ps = [p for p in [a, b, c, d] if 1 < p[1] < 255]
    						# 若是上下or左右爲白色,設置白色
    						if  (a[1]== 255 and b[1]== 255) or (c[1]== 255 and d[1]== 255):
    							img[row, column, 0] = 255
    							img[row, column, 1] = 255
    						# 設置灰色
    
    						elif len(ps)>1:
    							kk = np.array(ps).mean(axis=0)
    							img[row, column, 0] = kk[0]
    							img[row, column, 1] = kk[1]
    							img[row, column, 2] = kk[2]
    						else:
    							img[row, column, 0] = 255
    							img[row, column, 1] = 255
    			return img
    	    # 灰度化
    		def convert2gray(self,img):
    			if len(img.shape) > 2:
    				gray = np.mean(img, -1)
    				# 上面的轉法較快,正規轉法以下
    				# r, g, b = img[:,:,0], img[:,:,1], img[:,:,2]
    				# gray = 0.2989 * r + 0.5870 * g + 0.1140 * b
    				return gray
    			else:
    				return img
    
    		# 二值化
    		def binarizing(self,img,threshold, cov=False):
    			w, h = img.shape
    			if cov:
    				for y in range(h):
    					for x in range(w):
    						if img[x, y] > threshold:
    							img[x, y] = 0
    						else:
    							img[x, y] = 255
    			else:
    				for y in range(h):
    					for x in range(w):
    						if img[x, y] < threshold:
    							img[x, y] = 0
    						else:
    							img[x, y] = 255
    			return img

    去噪後:

  • 切分最小圖

    def cut_box(img,resize=(64,18)):
    
    		# 灰度,二值化
    		image = nd.convert2gray(img)
    		image = nd.binarizing(image,190, True)
    
    		image = Image.fromarray(image)
    		img0 = Image.fromarray(img)
    		box = image.getbbox()
    		box1 = (box[0]-2,box[1]-2,box[2]+2,box[3]+2)
    		image = img0.crop(box1)
    		image = image.resize(resize)
    
    		return np.array(image)

    切分後:

  • 分割字符串:

    def seg_img(img):
    		h,w,c = img.shape
    		d = int(w/4)
    		img_list = []
    		for i in range(4):
    			img_list.append(img[:,i*d:(i+1)*d])
    		return img_list

    分割後:

    通過對1000張標記好的圖片進行處理,獲得各個字母數字對應的單字符圖片數據集:

  • KNN訓練及預測:

    對圖像進行灰度處理

    import os
    	from PIL import Image
    	import numpy as np
    	from cut_prc import cut_box,seg_img
    	from noise_prc import NoiseDel
    	from sklearn import neighbors, svm,tree,linear_model
    	from sklearn.model_selection import train_test_split
    
    	nd = NoiseDel()
    	def predict_img(img, clf):
    		text = ''
    		image = nd.noise_del(img)
    		image = cut_box(image)
    		image_list = seg_img(image)
    
    		for im in image_list:
    			im = nd.convert2gray(im)
    			im = im.reshape(-1)
    			c = clf.predict([im,])[0]
    			text += c
    		return text
    
    	if __name__=="__main__":
    
    		# 獲取訓練數據
    		path = 'data/png_cut'
    		classes = os.listdir(path)
    		x = []
    		y = []
    		for c in classes:
    			c_f = os.path.join(path, c)
    			if os.path.isdir(c_f):
    				files = os.listdir(c_f)
    				for file in files:
    					img = Image.open(os.path.join(c_f, file))
    					img = np.array(img)
    					img = nd.convert2gray(img)
    					img = img.reshape(-1)
    					x.append(img)
    					y.append(c.replace('_',''))
    
    		x = np.array(x)
    		y = np.array(y)
    
    		# 拆分訓練數據與測試數據
    		x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.02)
    
    		# 訓練KNN分類器
    		clf = neighbors.KNeighborsClassifier(n_neighbors=5)
    		clf.fit(x_train, y_train)
    
    		# 預測測試集
    		test_pre = clf.predict(x_test)
    
    
    		print("KNN accuracy score:", (test_pre == y_test).mean())
    
    		# 預測新樣本集
    		newpath = 'data/png_new'
    		newfiles = os.listdir(newpath)
    		for file in newfiles:
    			image = Image.open(os.path.join(newpath,file))
    			image = np.array(image)
    			text = predict_img(image, clf)
    			print(text)
    
    		# # 訓練svm分類器
    		# clf = svm.LinearSVC()  ###
    		# clf.fit(x_train, y_train)
    		#
    		# test_pre = clf.predict(x_test)
    		# print("SVM accuracy score:", (test_pre == y_test).mean())
    		#
    		# # 訓練dt分類器
    		# clf = tree.DecisionTreeClassifier()
    		# clf.fit(x_train, y_train)
    		#
    		# test_pre = clf.predict(x_test)
    		# print("DT accuracy score:", (test_pre == y_test).mean())
    		#
    		# # 訓練lr分類器
    		# clf = linear_model.LogisticRegression()  ### t
    		# clf.fit(x_train, y_train)
    		#
    		# test_pre = clf.predict(x_test)
    		# print("LR accuracy score:", (test_pre == y_test).mean())
  • 運行結果:(單個字符預測精度),KNN最高,達到80%,而SVM,DT,LR均較低

    KNN accuracy score: 0.8170731707317073
    	SVM accuracy score: 0.6341463414634146
    	DT accuracy score: 0.4878048780487805
    	LR accuracy score: 0.5975609756097561

    KNN 預測圖片:

    mHFM
    	crdN
    	wa5Y
    	swFn
    	ApB9
    	eBrN
    	rJpH
    	fd9e
    	kTVt
    	t7ng

方案二:CNN+原樣本圖

步驟:

  • 處理樣本數據1020張圖,灰度化 ,像素大小30*106,標籤爲小寫字符(標記人員太懶了);

  • 拆分數據:train:80%, val:20%

  • 網絡模型:輸入數據維度30*106,採用三層CNN,每一層輸出特徵維數分別:16,128,16,FC層輸出 512維,最後全鏈接輸出4x63,每行表明預測字符的機率。

  • 結果:驗證集字符準確率最高到達了50%

方案三: CNN+ 構造相似驗證碼圖

第三方庫生成的驗證碼以下所示:

from captcha.image import ImageCaptcha  # pip install captcha

下載相應的字體(比較難找),而後修改第三方庫中image.py文件,修改了第三方庫後生成的驗證碼:

效果和咱們須要的驗證碼比較類似了,但仍是有區別。

fonts = ["font/TruenoBdOlIt.otf", "font/Euro Bold.ttf", "STCAIYUN.TTF"]
    image = ImageCaptcha(width=106, height=30,fonts=[fonts[0]],font_sizes=(18,18,18))
    captcha = image.generate(captcha_text)

image.py

略..

採用自動生成的驗證碼,用於CNN訓練模型,訓練和驗證精度都達到了98%,但測試原圖1000樣本的字符精度最高只有40%,因而可知,生成的驗證碼仍是與目標驗證碼相差較大。

step: 18580/20000...  loss: 0.0105... 
step: 18600/20000...  loss: 0.0121... 
step: 18600/20000...  --------- val_acc: 0.9675     best: 0.9775  --------- test_acc2: 0.4032 
step: 18620/20000...  loss: 0.0131... 
step: 18640/20000...  loss: 0.0139... 
step: 18660/20000...  loss: 0.0135... 
step: 18680/20000...  loss: 0.0156... 
step: 18700/20000...  loss: 0.0109... 
step: 18700/20000...  --------- val_acc: 0.9625     best: 0.9775  --------- test_acc2: 0.3995

方案四: CNN+ 字符樣本集

因爲只有1000樣本,直接通過CNN端到端輸出字符序列,很難到達精度要求,爲此方案三採用自動建立樣本集的方法,但樣本質量和真實樣本之間存在差別,致使預測不許。爲此,將原1000樣本進行分割處理爲單字符集,樣本數量約4000左右,且輸入維度減少不少,同時輸出類別也減少不少。分析後改方案有必定可行性。

樣本集處理與以前KNN同樣: 通過對1000張標記好的圖片進行處理,獲得各個字母數字對應的單字符圖片數據集:

import os
import numpy as np
import tensorflow as tf
from PIL import Image

from noise_prc import NoiseDel
from cut_prc import cut_box,seg_img
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer

nd = NoiseDel()

class Config():
    file_name = 'char_02'  # 存放模型文件夾

    w_alpha = 0.01  #cnn權重係數
    b_alpha = 0.1  #cnn偏執係數
    image_h = 18
    image_w = 16


    cnn_f = [16,32,32,512]  # [cov1輸出特徵維度,cov2輸出特徵維度,cov3輸出特徵維度,全鏈接層輸出維度]

    max_captcha = 1  #驗證碼最大長度
    char_set_len = 50  #字符集長度

    lr = 0.001
    batch_size = 32  # 每批訓練大小
    max_steps = 200000  # 總迭代batch數

    log_every_n = 20  # 每多少輪輸出一次結果
    save_every_n = 100  # 每多少輪校驗模型並保存


class Model():

    def __init__(self, config):
        self.config = config

        self.input()
        self.cnn()

        # 初始化session
        self.saver = tf.train.Saver()
        self.session = tf.Session()
        self.session.run(tf.global_variables_initializer())

    def input(self):
        self.X = tf.placeholder(tf.float32, [None, self.config.image_h * self.config.image_w])
        self.Y = tf.placeholder(tf.float32, [None, self.config.max_captcha * self.config.char_set_len])
        self.keep_prob = tf.placeholder(tf.float32)  # dropout

        # 兩個全局變量
        self.global_step = tf.Variable(0, trainable=False, name="global_step")
        self.global_loss = tf.Variable(0, dtype=tf.float32, trainable=False, name="global_loss")

    def cnn(self):
        x = tf.reshape(self.X, shape=[-1, self.config.image_h , self.config.image_w, 1])

        # 3 conv layer
        w_c1 = tf.Variable(self.config.w_alpha * tf.random_normal([3, 3, 1, self.config.cnn_f[0]]))
        b_c1 = tf.Variable(self.config.b_alpha * tf.random_normal([self.config.cnn_f[0]]))
        conv1 = tf.nn.relu(tf.nn.bias_add(tf.nn.conv2d(x, w_c1, strides=[1, 1, 1, 1], padding='SAME'), b_c1))
        conv1 = tf.nn.max_pool(conv1, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')
        conv1 = tf.nn.dropout(conv1, self.keep_prob)

        w_c2 = tf.Variable(self.config.w_alpha * tf.random_normal([3, 3, self.config.cnn_f[0], self.config.cnn_f[1]]))
        b_c2 = tf.Variable(self.config.b_alpha * tf.random_normal([self.config.cnn_f[1]]))
        conv2 = tf.nn.relu(tf.nn.bias_add(tf.nn.conv2d(conv1, w_c2, strides=[1, 1, 1, 1], padding='SAME'), b_c2))
        conv2 = tf.nn.max_pool(conv2, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')
        conv2 = tf.nn.dropout(conv2, self.keep_prob)

        w_c3 = tf.Variable(self.config.w_alpha * tf.random_normal([3, 3, self.config.cnn_f[1], self.config.cnn_f[2]]))
        b_c3 = tf.Variable(self.config.b_alpha * tf.random_normal([self.config.cnn_f[2]]))
        conv3 = tf.nn.relu(tf.nn.bias_add(tf.nn.conv2d(conv2, w_c3, strides=[1, 1, 1, 1], padding='SAME'), b_c3))
        conv3 = tf.nn.max_pool(conv3, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')
        conv3 = tf.nn.dropout(conv3, self.keep_prob)

        # Fully connected layer
        h =tf.cast( conv3.shape[1],tf.int32)
        w = tf.cast( conv3.shape[2],tf.int32)
        f = tf.cast( conv3.shape[3],tf.int32)

        w_d = tf.Variable(self.config.w_alpha * tf.random_normal([h* w * f, self.config.cnn_f[3]]))
        b_d = tf.Variable(self.config.b_alpha * tf.random_normal([self.config.cnn_f[3]]))
        dense = tf.reshape(conv3, [-1, w_d.get_shape().as_list()[0]])
        dense = tf.nn.relu(tf.add(tf.matmul(dense, w_d), b_d))
        dense = tf.nn.dropout(dense, self.keep_prob)

        w_out = tf.Variable(self.config.w_alpha * tf.random_normal([self.config.cnn_f[3], self.config.max_captcha * self.config.char_set_len]))
        b_out = tf.Variable(self.config.b_alpha * tf.random_normal([self.config.max_captcha * self.config.char_set_len]))
        out = tf.add(tf.matmul(dense, w_out), b_out)
        # out = tf.nn.softmax(out)

        # loss
        # loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(output, Y))
        self.loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=out, labels=self.Y))

        # optimizer 爲了加快訓練 learning_rate應該開始大,而後慢慢衰
        self.optimizer = tf.train.AdamOptimizer(learning_rate=self.config.lr).minimize(self.loss,global_step=self.global_step)

        predict = tf.reshape(out, [-1, self.config.max_captcha, self.config.char_set_len])
        self.max_idx_p = tf.argmax(predict, 2)
        max_idx_l = tf.argmax(tf.reshape(self.Y, [-1, self.config.max_captcha, self.config.char_set_len]), 2)
        correct_pred = tf.equal(self.max_idx_p, max_idx_l)
        self.accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))


    def load(self, checkpoint):
        self.saver.restore(self.session, checkpoint)
        print('Restored from: {}'.format(checkpoint))


    def train(self, get_next_batch, model_path, x_train, y_train, x_test, y_test):
        with self.session as sess:
            while True:
                batch_x, batch_y = get_next_batch(x_train, y_train,self.config.batch_size)
                _, loss_ = sess.run([self.optimizer, self.loss], feed_dict={self.X: batch_x, self.Y: batch_y, self.keep_prob: 0.75})

                if self.global_step.eval() % self.config.log_every_n == 0:
                    print('step: {}/{}... '.format(self.global_step.eval(), self.config.max_steps),
                          'loss: {:.4f}... '.format(loss_))

                # 每100 step計算一次準確率
                if self.global_step.eval() % self.config.save_every_n == 0:
                    # batch_x_test, batch_y_test = get_next_batch(100)
                    acc = sess.run(self.accuracy, feed_dict={self.X: x_test, self.Y: y_test, self.keep_prob: 1.})
                    print('step: {}/{}... '.format(self.global_step.eval(), self.config.max_steps),
                          '--------- acc: {:.4f} '.format(acc),
                          '   best: {:.4f} '.format(self.global_loss.eval()))

                    if acc > self.global_loss.eval():
                        print('save best model...')
                        update = tf.assign(self.global_loss, acc)  # 更新最優值
                        sess.run(update)
                        self.saver.save(sess, os.path.join(model_path, 'model'), global_step=self.global_step)
                    if self.global_step.eval() >= self.config.max_steps:
                        #self.saver.save(sess, os.path.join(model_path, 'model_last'), global_step=self.global_step)
                        break

    def test(self, batch_x_test):
        sess = self.session
        max_idx_p = sess.run(self.max_idx_p, feed_dict={self.X: batch_x_test, self.keep_prob: 1.})
        return max_idx_p


def get_next_batch( train_x, train_y, batch_size=32):
    n = train_x.shape[0]
    chi_list = np.random.choice(n, batch_size)
    return train_x[chi_list],train_y[chi_list]


def img_cut_to_arry(img):
    imgs = []
    image = nd.noise_del(img)
    image = cut_box(image)
    image_list = seg_img(image)
    for im in image_list:
        im = nd.convert2gray(im)
        im = im.reshape(-1)
        imgs.append(im)
    return imgs


if __name__=="__main__":
    # nd = NoiseDel()
    # 獲取訓練數據
    path = 'data/png_cut'
    classes = os.listdir(path)
    x = []
    y = []
    for c in classes:
        c_f = os.path.join(path, c)
        if os.path.isdir(c_f):
            files = os.listdir(c_f)
            for file in files:
                img = Image.open(os.path.join(c_f, file))
                img = np.array(img)
                img = nd.convert2gray(img)
                img = img.reshape(-1)
                x.append(img)
                y.append(c.replace('_',''))

    lb = LabelBinarizer()
    ly = lb.fit_transform(y)  # one-hot
    x = np.array(x)
    y = np.array(ly)

    # 拆分訓練數據與測試數據
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.02)


    # 建立模型目錄
    model_path = os.path.join('c_models', Config.file_name)
    if os.path.exists(model_path) is False:
        os.makedirs(model_path)

    # 加載上一次保存的模型
    model = Model(Config)
    checkpoint_path = tf.train.latest_checkpoint(model_path)
    if checkpoint_path:
        model.load(checkpoint_path)

    # train and val
    print('start to training...')
    model.train(get_next_batch, model_path, x_train,y_train, x_test, y_test)


    # test

    # 預測新樣本集
    newpath = 'data/png_new'
    newfiles = os.listdir(newpath)
    for file in newfiles:
        pre_text=''
        image = Image.open(os.path.join(newpath,file))
        image = np.array(image)
        imgs= img_cut_to_arry(image)
        for img in imgs:
            p = model.test([img,])
            p_arr = np.zeros([1,50])
            p_arr[0,p] =1
            c = lb.inverse_transform(p_arr)
            pre_text += c[0]


        print(pre_text)
  • 運行結果:字符預測精度95%以上
step: 2500/200000...  loss: 0.0803... 
step: 2500/200000...  --------- acc: 0.0854     best: 0.1341 
step: 2520/200000...  loss: 0.0818... 
step: 2540/200000...  loss: 0.0844... 
step: 2560/200000...  loss: 0.0827... 
step: 2580/200000...  loss: 0.0794... 
step: 2600/200000...  loss: 0.0823... 
step: 2600/200000...  --------- acc: 0.1951     best: 0.1341 
save best model...
step: 2620/200000...  loss: 0.0775... 
step: 2640/200000...  loss: 0.0754... 
step: 2660/200000...  loss: 0.0823... 
step: 2680/200000...  loss: 0.0678... 
step: 2700/200000...  loss: 0.0763... 
step: 2700/200000...  --------- acc: 0.3049     best: 0.1951 
.
.
.
.
step: 41400/200000...  --------- acc: 0.8659     best: 0.9512 
step: 41450/200000...  loss: 0.0091... 
step: 41500/200000...  loss: 0.0134... 
step: 41550/200000...  loss: 0.0151... 
step: 41600/200000...  loss: 0.0256... 
step: 41600/200000...  --------- acc: 0.9390     best: 0.9512

預測圖片:

mHPM
srdN
wa5Y
eWpn
AgB9
eHr8
rJpH
fd9e
bTYt
tTwg

最後,附上源碼地址:github

相關文章
相關標籤/搜索