在文章CNN大戰驗證碼中,咱們利用TensorFlow搭建了簡單的CNN模型來破解某個網站的驗證碼。驗證碼以下:python
在本文中,咱們將會用Keras來搭建一個稍微複雜的CNN模型來破解以上的驗證碼。git
對於驗證碼圖片的處理過程在本文中將再也不具體敘述,有興趣的讀者能夠參考文章CNN大戰驗證碼。
在這個項目中,咱們如今的樣本一共是1668個樣本,每一個樣本都是一個字符圖片,字符圖片的大小爲16*20。樣本的特徵爲字符圖片的像素,0表明白色,1表明黑色,每一個樣本爲320個特徵,取值爲0或1,特徵變量名稱爲v1到v320,樣本的類別標籤即爲該字符。整個數據集的部分以下:github
利用Keras能夠快速方便地搭建CNN模型,本文搭建的CNN模型以下:web
將數據集分爲訓練集和測試集,佔比爲8:2,該模型訓練的代碼以下:算法
# -*- coding: utf-8 -*- import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from matplotlib import pyplot as plt from keras.utils import np_utils, plot_model from keras.models import Sequential from keras.layers.core import Dense, Dropout, Activation, Flatten from keras.callbacks import EarlyStopping from keras.layers import Conv2D, MaxPooling2D # 讀取數據 df = pd.read_csv('F://verifycode_data/data.csv') # 標籤值 vals = range(31) keys = ['1','2','3','4','5','6','7','8','9','A','B','C','D','E','F','G','H','J','K','L','N','P','Q','R','S','T','U','V','X','Y','Z'] label_dict = dict(zip(keys, vals)) x_data = df[['v'+str(i+1) for i in range(320)]] y_data = pd.DataFrame({'label':df['label']}) y_data['class'] = y_data['label'].apply(lambda x: label_dict[x]) # 將數據分爲訓練集和測試集 X_train, X_test, Y_train, Y_test = train_test_split(x_data, y_data['class'], test_size=0.3, random_state=42) x_train = np.array(X_train).reshape((1167, 20, 16, 1)) x_test = np.array(X_test).reshape((501, 20, 16, 1)) # 對標籤值進行one-hot encoding n_classes = 31 y_train = np_utils.to_categorical(Y_train, n_classes) y_val = np_utils.to_categorical(Y_test, n_classes) input_shape = x_train[0].shape # CNN模型 model = Sequential() # 卷積層和池化層 model.add(Conv2D(32, kernel_size=(3, 3), input_shape=input_shape, padding='same')) model.add(Activation('relu')) model.add(Conv2D(32, kernel_size=(3, 3), padding='same')) model.add(Activation('relu')) model.add(MaxPooling2D(pool_size=(2, 2), padding='same')) # Dropout層 model.add(Dropout(0.25)) model.add(Conv2D(64, kernel_size=(3, 3), padding='same')) model.add(Activation('relu')) model.add(Conv2D(64, kernel_size=(3, 3), padding='same')) model.add(Activation('relu')) model.add(MaxPooling2D(pool_size=(2, 2), padding='same')) model.add(Dropout(0.25)) model.add(Conv2D(128, kernel_size=(3, 3), padding='same')) model.add(Activation('relu')) model.add(Conv2D(128, kernel_size=(3, 3), padding='same')) model.add(Activation('relu')) model.add(MaxPooling2D(pool_size=(2, 2), padding='same')) model.add(Dropout(0.25)) model.add(Flatten()) # 全鏈接層 model.add(Dense(256, activation='relu')) model.add(Dropout(0.5)) model.add(Dense(128, activation='relu')) model.add(Dense(n_classes, activation='softmax')) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) # plot model plot_model(model, to_file=r'./model.png', show_shapes=True) # 模型訓練 callbacks = [EarlyStopping(monitor='val_acc', patience=5, verbose=1)] batch_size = 64 n_epochs = 100 history = model.fit(x_train, y_train, batch_size=batch_size, epochs=n_epochs, \ verbose=1, validation_data=(x_test, y_val), callbacks=callbacks) mp = 'F://verifycode_data/verifycode_Keras.h5' model.save(mp) # 繪製驗證集上的準確率曲線 val_acc = history.history['val_acc'] plt.plot(range(len(val_acc)), val_acc, label='CNN model') plt.title('Validation accuracy on verifycode dataset') plt.xlabel('epochs') plt.ylabel('accuracy') plt.legend() plt.show()
在上述代碼中,咱們訓練模型的時候採用了early stopping技巧。early stopping是用於提早中止訓練的callbacks。具體地,能夠達到當訓練集上的loss不在減少(即減少的程度小於某個閾值)的時候中止繼續訓練。windows
運行上述模型訓練代碼,輸出的結果以下:微信
......(忽略以前的輸出) Epoch 22/100 64/1167 [>.............................] - ETA: 3s - loss: 0.0399 - acc: 1.0000 128/1167 [==>...........................] - ETA: 3s - loss: 0.1195 - acc: 0.9844 192/1167 [===>..........................] - ETA: 2s - loss: 0.1085 - acc: 0.9792 256/1167 [=====>........................] - ETA: 2s - loss: 0.1132 - acc: 0.9727 320/1167 [=======>......................] - ETA: 2s - loss: 0.1045 - acc: 0.9750 384/1167 [========>.....................] - ETA: 2s - loss: 0.1006 - acc: 0.9740 448/1167 [==========>...................] - ETA: 2s - loss: 0.1522 - acc: 0.9643 512/1167 [============>.................] - ETA: 1s - loss: 0.1450 - acc: 0.9648 576/1167 [=============>................] - ETA: 1s - loss: 0.1368 - acc: 0.9653 640/1167 [===============>..............] - ETA: 1s - loss: 0.1353 - acc: 0.9641 704/1167 [=================>............] - ETA: 1s - loss: 0.1280 - acc: 0.9659 768/1167 [==================>...........] - ETA: 1s - loss: 0.1243 - acc: 0.9674 832/1167 [====================>.........] - ETA: 0s - loss: 0.1577 - acc: 0.9639 896/1167 [======================>.......] - ETA: 0s - loss: 0.1488 - acc: 0.9665 960/1167 [=======================>......] - ETA: 0s - loss: 0.1488 - acc: 0.9656 1024/1167 [=========================>....] - ETA: 0s - loss: 0.1427 - acc: 0.9668 1088/1167 [==========================>...] - ETA: 0s - loss: 0.1435 - acc: 0.9669 1152/1167 [============================>.] - ETA: 0s - loss: 0.1383 - acc: 0.9688 1167/1167 [==============================] - 4s 3ms/step - loss: 0.1380 - acc: 0.9683 - val_loss: 0.0835 - val_acc: 0.9760 Epoch 00022: early stopping
能夠看到,一共訓練了21次,最近一次的訓練後,在測試集上的準確率爲96.83%。在測試集的準確率曲線以下圖:app
模型訓練完後,咱們對新的驗證碼進行預測。新的100張驗證碼以下圖:dom
使用訓練好的CNN模型,對這些新的驗證碼進行預測,預測的Python代碼以下:測試
# -*- coding: utf-8 -*- import os import cv2 import numpy as np def split_picture(imagepath): # 以灰度模式讀取圖片 gray = cv2.imread(imagepath, 0) # 將圖片的邊緣變爲白色 height, width = gray.shape for i in range(width): gray[0, i] = 255 gray[height-1, i] = 255 for j in range(height): gray[j, 0] = 255 gray[j, width-1] = 255 # 中值濾波 blur = cv2.medianBlur(gray, 3) #模板大小3*3 # 二值化 ret,thresh1 = cv2.threshold(blur, 200, 255, cv2.THRESH_BINARY) # 提取單個字符 chars_list = [] image, contours, hierarchy = cv2.findContours(thresh1, 2, 2) for cnt in contours: # 最小的外接矩形 x, y, w, h = cv2.boundingRect(cnt) if x != 0 and y != 0 and w*h >= 100: chars_list.append((x,y,w,h)) sorted_chars_list = sorted(chars_list, key=lambda x:x[0]) for i,item in enumerate(sorted_chars_list): x, y, w, h = item cv2.imwrite('F://test_verifycode/chars/%d.jpg'%(i+1), thresh1[y:y+h, x:x+w]) def remove_edge_picture(imagepath): image = cv2.imread(imagepath, 0) height, width = image.shape corner_list = [image[0,0] < 127, image[height-1, 0] < 127, image[0, width-1]<127, image[ height-1, width-1] < 127 ] if sum(corner_list) >= 3: os.remove(imagepath) def resplit_with_parts(imagepath, parts): image = cv2.imread(imagepath, 0) os.remove(imagepath) height, width = image.shape file_name = imagepath.split('/')[-1].split(r'.')[0] # 將圖片從新分裂成parts部分 step = width//parts # 步長 start = 0 # 起始位置 for i in range(parts): cv2.imwrite('F://test_verifycode/chars/%s.jpg'%(file_name+'-'+str(i)), \ image[:, start:start+step]) start += step def resplit(imagepath): image = cv2.imread(imagepath, 0) height, width = image.shape if width >= 64: resplit_with_parts(imagepath, 4) elif width >= 48: resplit_with_parts(imagepath, 3) elif width >= 26: resplit_with_parts(imagepath, 2) # rename and convert to 16*20 size def convert(dir, file): imagepath = dir+'/'+file # 讀取圖片 image = cv2.imread(imagepath, 0) # 二值化 ret, thresh = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY) img = cv2.resize(thresh, (16, 20), interpolation=cv2.INTER_AREA) # 保存圖片 cv2.imwrite('%s/%s' % (dir, file), img) # 讀取圖片的數據,並轉化爲0-1值 def Read_Data(dir, file): imagepath = dir+'/'+file # 讀取圖片 image = cv2.imread(imagepath, 0) # 二值化 ret, thresh = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY) # 顯示圖片 bin_values = [1 if pixel==255 else 0 for pixel in thresh.ravel()] return bin_values def predict(VerifyCodePath): dir = 'F://test_verifycode/chars' files = os.listdir(dir) # 清空原有的文件 if files: for file in files: os.remove(dir + '/' + file) split_picture(VerifyCodePath) files = os.listdir(dir) if not files: print('查看的文件夾爲空!') else: # 去除噪聲圖片 for file in files: remove_edge_picture(dir + '/' + file) # 對黏連圖片進行重分割 for file in os.listdir(dir): resplit(dir + '/' + file) # 將圖片統一調整至16*20大小 for file in os.listdir(dir): convert(dir, file) # 圖片中的字符表明的向量 files = sorted(os.listdir(dir), key=lambda x: x[0]) table = np.array([Read_Data(dir, file) for file in files]).reshape(-1,20,16,1) # 模型保存地址 mp = 'F://verifycode_data/verifycode_Keras.h5' # 載入模型 from keras.models import load_model cnn = load_model(mp) # 模型預測 y_pred = cnn.predict(table) predictions = np.argmax(y_pred, axis=1) # 標籤字典 keys = range(31) vals = ['1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'J', 'K', 'L', 'N', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'X', 'Y', 'Z'] label_dict = dict(zip(keys, vals)) return ''.join([label_dict[pred] for pred in predictions]) def main(): dir = 'F://VerifyCode/' correct = 0 for i, file in enumerate(os.listdir(dir)): true_label = file.split('.')[0] VerifyCodePath = dir+file pred = predict(VerifyCodePath) if true_label == pred: correct += 1 print(i+1, (true_label, pred), true_label == pred, correct) total = len(os.listdir(dir)) print('\n總共圖片:%d張\n識別正確:%d張\n識別準確率:%.2f%%.'\ %(total, correct, correct*100/total)) main()
如下是該CNN模型的預測結果:
Using TensorFlow backend. 2018-10-25 15:13:50.390130: I C:\tf_jenkins\workspace\rel-win\M\windows\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 1 ('ZK6N', 'ZK6N') True 1 2 ('4JPX', '4JPX') True 2 3 ('5GP5', '5GP5') True 3 4 ('5RQ8', '5RQ8') True 4 5 ('5TQP', '5TQP') True 5 6 ('7S62', '7S62') True 6 7 ('8R2Z', '8R2Z') True 7 8 ('8RFV', '8RFV') True 8 9 ('9BBT', '9BBT') True 9 10 ('9LNE', '9LNE') True 10 11 ('67UH', '67UH') True 11 12 ('74UK', '74UK') True 12 13 ('A5T2', 'A5T2') True 13 14 ('AHYV', 'AHYV') True 14 15 ('ASEY', 'ASEY') True 15 16 ('B371', 'B371') True 16 17 ('CCQL', 'CCQL') True 17 18 ('CFD5', 'GFD5') False 17 19 ('CJLJ', 'CJLJ') True 18 20 ('D4QV', 'D4QV') True 19 21 ('DFQ8', 'DFQ8') True 20 22 ('DP18', 'DP18') True 21 23 ('E3HC', 'E3HC') True 22 24 ('E8VB', 'E8VB') True 23 25 ('DE1U', 'DE1U') True 24 26 ('FK1R', 'FK1R') True 25 27 ('FK91', 'FK91') True 26 28 ('FSKP', 'FSKP') True 27 29 ('FVZP', 'FVZP') True 28 30 ('GC6H', 'GC6H') True 29 31 ('GH62', 'GH62') True 30 32 ('H9FQ', 'H9FQ') True 31 33 ('H67Q', 'H67Q') True 32 34 ('HEKC', 'HEKC') True 33 35 ('HV2B', 'HV2B') True 34 36 ('J65Z', 'J65Z') True 35 37 ('JZCX', 'JZCX') True 36 38 ('KH5D', 'KH5D') True 37 39 ('KXD2', 'KXD2') True 38 40 ('1GDH', '1GDH') True 39 41 ('LCL3', 'LCL3') True 40 42 ('LNZR', 'LNZR') True 41 43 ('LZU5', 'LZU5') True 42 44 ('N5AK', 'N5AK') True 43 45 ('N5Q3', 'N5Q3') True 44 46 ('N96Z', 'N96Z') True 45 47 ('NCDG', 'NCDG') True 46 48 ('NELS', 'NELS') True 47 49 ('P96U', 'P96U') True 48 50 ('PD42', 'PD42') True 49 51 ('PECG', 'PEQG') False 49 52 ('PPZF', 'PPZF') True 50 53 ('PUUL', 'PUUL') True 51 54 ('Q2DN', 'D2DN') False 51 55 ('QCQ9', 'QCQ9') True 52 56 ('QDB1', 'QDBJ') False 52 57 ('QZUD', 'QZUD') True 53 58 ('R3T5', 'R3T5') True 54 59 ('S1YT', 'S1YT') True 55 60 ('SP7L', 'SP7L') True 56 61 ('SR2K', 'SR2K') True 57 62 ('SUP5', 'SVP5') False 57 63 ('T2SP', 'T2SP') True 58 64 ('U6V9', 'U6V9') True 59 65 ('UC9P', 'UC9P') True 60 66 ('UFYD', 'UFYD') True 61 67 ('V9NJ', 'V9NH') False 61 68 ('V35X', 'V35X') True 62 69 ('V98F', 'V98F') True 63 70 ('VD28', 'VD28') True 64 71 ('YGHE', 'YGHE') True 65 72 ('YNKD', 'YNKD') True 66 73 ('YVXV', 'YVXV') True 67 74 ('ZFBS', 'ZFBS') True 68 75 ('ET6X', 'ET6X') True 69 76 ('TKVC', 'TKVC') True 70 77 ('2UCU', '2UCU') True 71 78 ('HNBK', 'HNBK') True 72 79 ('X8FD', 'X8FD') True 73 80 ('ZGNX', 'ZGNX') True 74 81 ('LQCU', 'LQCU') True 75 82 ('JNZY', 'JNZVY') False 75 83 ('RX34', 'RX34') True 76 84 ('811E', '811E') True 77 85 ('ETDX', 'ETDX') True 78 86 ('4CPR', '4CPR') True 79 87 ('FE91', 'FE91') True 80 88 ('B7XH', 'B7XH') True 81 89 ('1RUA', '1RUA') True 82 90 ('UBCX', 'UBCX') True 83 91 ('KVT5', 'KVT5') True 84 92 ('HZ3A', 'HZ3A') True 85 93 ('3XLR', '3XLR') True 86 94 ('VC7T', 'VC7T') True 87 95 ('7PG1', '7PQ1') False 87 96 ('4F21', '4F21') True 88 97 ('3HLJ', '3HLJ') True 89 98 ('1KT7', '1KT7') True 90 99 ('1RHE', '1RHE') True 91 100 ('1TTA', '1TTA') True 92 總共圖片:100張 識別正確:92張 識別準確率:92.00%.
能夠看到,該訓練後的CNN模型,其預測新驗證的準確率在90%以上。
在文章CNN大戰驗證碼中,筆者使用TensorFlow搭建了CNN模型,代碼較長,訓練時間在兩個小時以上,而使用Keras搭建該模型,代碼簡潔,且使用early stopping技巧後能縮短訓練時間,同時保證模型的準確率,因而可知Keras的優點所在。
該項目已開源,Github地址爲:https://github.com/percent4/C...。
注意:本人現已開通微信公衆號: Python爬蟲與算法(微信號爲:easy_web_scrape), 歡迎你們關注哦~~