最近在作一個有趣的項目,須要對某網站的驗證碼進行識別。git
某網站驗證碼如圖:,像素大小:30x106x3github
經過人工標記的驗證碼數量約爲1000條: 網絡
如今須要經過機器學習方法來進行識別新驗證碼,設計的方案有以下四種:session
KNN + 原樣本圖;須要對圖像去噪、二值化、切割等處理。對數據量要求沒CNN高。app
CNN + 原樣本圖;缺點:樣本少,優勢:數據質量高。dom
CNN + 構造相似驗證碼圖;缺點:構造驗證碼是否和和原驗證碼相似,須要較高技術;優勢:樣本多。機器學習
CNN + 單字符樣本圖;優勢:輸入圖像小,且輸出類別少。ide
其餘:如用pytesseract+去噪+二值化等,簡單嘗試了一下準確率很低,pass掉了學習
步驟:測試
去噪: 原圖:
class NoiseDel(): #去除干擾噪聲 def noise_del(self,img): height = img.shape[0] width = img.shape[1] channels = img.shape[2] # 清除四周噪點 for row in [0,height-1]: for column in range(0, width): if img[row, column, 0] == 0 and img[row, column, 1] == 0: img[row, column, 0] = 255 img[row, column, 1] = 255 for column in [0,width-1]: for row in range(0, height): if img[row, column, 0] == 0 and img[row, column, 1] == 0: img[row, column, 0] = 255 img[row, column, 1] = 255 # 清除中間區域噪點 for row in range(1,height-1): for column in range(1,width-1): if img[row, column, 0] == 0 and img[row, column, 1] == 0: a = img[row - 1, column] # 上 b = img[row + 1, column] # 下 c = img[row, column - 1] # 左 d = img[row, column + 1] # 右 ps = [p for p in [a, b, c, d] if 1 < p[1] < 255] # 若是上下or左右爲白色,設置白色 if (a[1]== 255 and b[1]== 255) or (c[1]== 255 and d[1]== 255): img[row, column, 0] = 255 img[row, column, 1] = 255 # 設置灰色 elif len(ps)>1: kk = np.array(ps).mean(axis=0) img[row, column, 0] = kk[0] img[row, column, 1] = kk[1] img[row, column, 2] = kk[2] else: img[row, column, 0] = 255 img[row, column, 1] = 255 return img # 灰度化 def convert2gray(self,img): if len(img.shape) > 2: gray = np.mean(img, -1) # 上面的轉法較快,正規轉法以下 # r, g, b = img[:,:,0], img[:,:,1], img[:,:,2] # gray = 0.2989 * r + 0.5870 * g + 0.1140 * b return gray else: return img # 二值化 def binarizing(self,img,threshold, cov=False): w, h = img.shape if cov: for y in range(h): for x in range(w): if img[x, y] > threshold: img[x, y] = 0 else: img[x, y] = 255 else: for y in range(h): for x in range(w): if img[x, y] < threshold: img[x, y] = 0 else: img[x, y] = 255 return img
去噪後:
切分最小圖
def cut_box(img,resize=(64,18)): # 灰度,二值化 image = nd.convert2gray(img) image = nd.binarizing(image,190, True) image = Image.fromarray(image) img0 = Image.fromarray(img) box = image.getbbox() box1 = (box[0]-2,box[1]-2,box[2]+2,box[3]+2) image = img0.crop(box1) image = image.resize(resize) return np.array(image)
切分後:
分割字符串:
def seg_img(img): h,w,c = img.shape d = int(w/4) img_list = [] for i in range(4): img_list.append(img[:,i*d:(i+1)*d]) return img_list
分割後:,,,
通過對1000張標記好的圖片進行處理,獲得各個字母數字對應的單字符圖片數據集:
KNN訓練及預測:
對圖像進行灰度處理
import os from PIL import Image import numpy as np from cut_prc import cut_box,seg_img from noise_prc import NoiseDel from sklearn import neighbors, svm,tree,linear_model from sklearn.model_selection import train_test_split nd = NoiseDel() def predict_img(img, clf): text = '' image = nd.noise_del(img) image = cut_box(image) image_list = seg_img(image) for im in image_list: im = nd.convert2gray(im) im = im.reshape(-1) c = clf.predict([im,])[0] text += c return text if __name__=="__main__": # 獲取訓練數據 path = 'data/png_cut' classes = os.listdir(path) x = [] y = [] for c in classes: c_f = os.path.join(path, c) if os.path.isdir(c_f): files = os.listdir(c_f) for file in files: img = Image.open(os.path.join(c_f, file)) img = np.array(img) img = nd.convert2gray(img) img = img.reshape(-1) x.append(img) y.append(c.replace('_','')) x = np.array(x) y = np.array(y) # 拆分訓練數據與測試數據 x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.02) # 訓練KNN分類器 clf = neighbors.KNeighborsClassifier(n_neighbors=5) clf.fit(x_train, y_train) # 預測測試集 test_pre = clf.predict(x_test) print("KNN accuracy score:", (test_pre == y_test).mean()) # 預測新樣本集 newpath = 'data/png_new' newfiles = os.listdir(newpath) for file in newfiles: image = Image.open(os.path.join(newpath,file)) image = np.array(image) text = predict_img(image, clf) print(text) # # 訓練svm分類器 # clf = svm.LinearSVC() ### # clf.fit(x_train, y_train) # # test_pre = clf.predict(x_test) # print("SVM accuracy score:", (test_pre == y_test).mean()) # # # 訓練dt分類器 # clf = tree.DecisionTreeClassifier() # clf.fit(x_train, y_train) # # test_pre = clf.predict(x_test) # print("DT accuracy score:", (test_pre == y_test).mean()) # # # 訓練lr分類器 # clf = linear_model.LogisticRegression() ### t # clf.fit(x_train, y_train) # # test_pre = clf.predict(x_test) # print("LR accuracy score:", (test_pre == y_test).mean())
運行結果:(單個字符預測精度),KNN最高,達到80%,而SVM,DT,LR均較低
KNN accuracy score: 0.8170731707317073 SVM accuracy score: 0.6341463414634146 DT accuracy score: 0.4878048780487805 LR accuracy score: 0.5975609756097561
KNN 預測圖片:
mHFM crdN wa5Y swFn ApB9 eBrN rJpH fd9e kTVt t7ng
步驟:
處理樣本數據1020張圖,灰度化 ,像素大小30*106,標籤爲小寫字符(標記人員太懶了);
拆分數據:train:80%, val:20%
網絡模型:輸入數據維度30*106,採用三層CNN,每一層輸出特徵維數分別:16,128,16,FC層輸出 512維,最後全鏈接輸出4x63,每行表明預測字符的機率。
結果:驗證集字符準確率最高到達了50%
第三方庫生成的驗證碼以下所示:
from captcha.image import ImageCaptcha # pip install captcha
下載相應的字體(比較難找),而後修改第三方庫中image.py文件,修改了第三方庫後生成的驗證碼:
效果和咱們須要的驗證碼比較類似了,但仍是有區別。
fonts = ["font/TruenoBdOlIt.otf", "font/Euro Bold.ttf", "STCAIYUN.TTF"] image = ImageCaptcha(width=106, height=30,fonts=[fonts[0]],font_sizes=(18,18,18)) captcha = image.generate(captcha_text)
image.py
略..
採用自動生成的驗證碼,用於CNN訓練模型,訓練和驗證精度都達到了98%,但測試原圖1000樣本的字符精度最高只有40%,因而可知,生成的驗證碼仍是與目標驗證碼相差較大。
step: 18580/20000... loss: 0.0105... step: 18600/20000... loss: 0.0121... step: 18600/20000... --------- val_acc: 0.9675 best: 0.9775 --------- test_acc2: 0.4032 step: 18620/20000... loss: 0.0131... step: 18640/20000... loss: 0.0139... step: 18660/20000... loss: 0.0135... step: 18680/20000... loss: 0.0156... step: 18700/20000... loss: 0.0109... step: 18700/20000... --------- val_acc: 0.9625 best: 0.9775 --------- test_acc2: 0.3995
因爲只有1000樣本,直接通過CNN端到端輸出字符序列,很難到達精度要求,爲此方案三採用自動建立樣本集的方法,但樣本質量和真實樣本之間存在差別,致使預測不許。爲此,將原1000樣本進行分割處理爲單字符集,樣本數量約4000左右,且輸入維度減少不少,同時輸出類別也減少不少。分析後改方案有必定可行性。
樣本集處理與以前KNN同樣: 通過對1000張標記好的圖片進行處理,獲得各個字母數字對應的單字符圖片數據集:
import os import numpy as np import tensorflow as tf from PIL import Image from noise_prc import NoiseDel from cut_prc import cut_box,seg_img from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelBinarizer nd = NoiseDel() class Config(): file_name = 'char_02' # 存放模型文件夾 w_alpha = 0.01 #cnn權重係數 b_alpha = 0.1 #cnn偏執係數 image_h = 18 image_w = 16 cnn_f = [16,32,32,512] # [cov1輸出特徵維度,cov2輸出特徵維度,cov3輸出特徵維度,全鏈接層輸出維度] max_captcha = 1 #驗證碼最大長度 char_set_len = 50 #字符集長度 lr = 0.001 batch_size = 32 # 每批訓練大小 max_steps = 200000 # 總迭代batch數 log_every_n = 20 # 每多少輪輸出一次結果 save_every_n = 100 # 每多少輪校驗模型並保存 class Model(): def __init__(self, config): self.config = config self.input() self.cnn() # 初始化session self.saver = tf.train.Saver() self.session = tf.Session() self.session.run(tf.global_variables_initializer()) def input(self): self.X = tf.placeholder(tf.float32, [None, self.config.image_h * self.config.image_w]) self.Y = tf.placeholder(tf.float32, [None, self.config.max_captcha * self.config.char_set_len]) self.keep_prob = tf.placeholder(tf.float32) # dropout # 兩個全局變量 self.global_step = tf.Variable(0, trainable=False, name="global_step") self.global_loss = tf.Variable(0, dtype=tf.float32, trainable=False, name="global_loss") def cnn(self): x = tf.reshape(self.X, shape=[-1, self.config.image_h , self.config.image_w, 1]) # 3 conv layer w_c1 = tf.Variable(self.config.w_alpha * tf.random_normal([3, 3, 1, self.config.cnn_f[0]])) b_c1 = tf.Variable(self.config.b_alpha * tf.random_normal([self.config.cnn_f[0]])) conv1 = tf.nn.relu(tf.nn.bias_add(tf.nn.conv2d(x, w_c1, strides=[1, 1, 1, 1], padding='SAME'), b_c1)) conv1 = tf.nn.max_pool(conv1, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME') conv1 = tf.nn.dropout(conv1, self.keep_prob) w_c2 = tf.Variable(self.config.w_alpha * tf.random_normal([3, 3, self.config.cnn_f[0], self.config.cnn_f[1]])) b_c2 = tf.Variable(self.config.b_alpha * tf.random_normal([self.config.cnn_f[1]])) conv2 = tf.nn.relu(tf.nn.bias_add(tf.nn.conv2d(conv1, w_c2, strides=[1, 1, 1, 1], padding='SAME'), b_c2)) conv2 = tf.nn.max_pool(conv2, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME') conv2 = tf.nn.dropout(conv2, self.keep_prob) w_c3 = tf.Variable(self.config.w_alpha * tf.random_normal([3, 3, self.config.cnn_f[1], self.config.cnn_f[2]])) b_c3 = tf.Variable(self.config.b_alpha * tf.random_normal([self.config.cnn_f[2]])) conv3 = tf.nn.relu(tf.nn.bias_add(tf.nn.conv2d(conv2, w_c3, strides=[1, 1, 1, 1], padding='SAME'), b_c3)) conv3 = tf.nn.max_pool(conv3, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME') conv3 = tf.nn.dropout(conv3, self.keep_prob) # Fully connected layer h =tf.cast( conv3.shape[1],tf.int32) w = tf.cast( conv3.shape[2],tf.int32) f = tf.cast( conv3.shape[3],tf.int32) w_d = tf.Variable(self.config.w_alpha * tf.random_normal([h* w * f, self.config.cnn_f[3]])) b_d = tf.Variable(self.config.b_alpha * tf.random_normal([self.config.cnn_f[3]])) dense = tf.reshape(conv3, [-1, w_d.get_shape().as_list()[0]]) dense = tf.nn.relu(tf.add(tf.matmul(dense, w_d), b_d)) dense = tf.nn.dropout(dense, self.keep_prob) w_out = tf.Variable(self.config.w_alpha * tf.random_normal([self.config.cnn_f[3], self.config.max_captcha * self.config.char_set_len])) b_out = tf.Variable(self.config.b_alpha * tf.random_normal([self.config.max_captcha * self.config.char_set_len])) out = tf.add(tf.matmul(dense, w_out), b_out) # out = tf.nn.softmax(out) # loss # loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(output, Y)) self.loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=out, labels=self.Y)) # optimizer 爲了加快訓練 learning_rate應該開始大,而後慢慢衰 self.optimizer = tf.train.AdamOptimizer(learning_rate=self.config.lr).minimize(self.loss,global_step=self.global_step) predict = tf.reshape(out, [-1, self.config.max_captcha, self.config.char_set_len]) self.max_idx_p = tf.argmax(predict, 2) max_idx_l = tf.argmax(tf.reshape(self.Y, [-1, self.config.max_captcha, self.config.char_set_len]), 2) correct_pred = tf.equal(self.max_idx_p, max_idx_l) self.accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32)) def load(self, checkpoint): self.saver.restore(self.session, checkpoint) print('Restored from: {}'.format(checkpoint)) def train(self, get_next_batch, model_path, x_train, y_train, x_test, y_test): with self.session as sess: while True: batch_x, batch_y = get_next_batch(x_train, y_train,self.config.batch_size) _, loss_ = sess.run([self.optimizer, self.loss], feed_dict={self.X: batch_x, self.Y: batch_y, self.keep_prob: 0.75}) if self.global_step.eval() % self.config.log_every_n == 0: print('step: {}/{}... '.format(self.global_step.eval(), self.config.max_steps), 'loss: {:.4f}... '.format(loss_)) # 每100 step計算一次準確率 if self.global_step.eval() % self.config.save_every_n == 0: # batch_x_test, batch_y_test = get_next_batch(100) acc = sess.run(self.accuracy, feed_dict={self.X: x_test, self.Y: y_test, self.keep_prob: 1.}) print('step: {}/{}... '.format(self.global_step.eval(), self.config.max_steps), '--------- acc: {:.4f} '.format(acc), ' best: {:.4f} '.format(self.global_loss.eval())) if acc > self.global_loss.eval(): print('save best model...') update = tf.assign(self.global_loss, acc) # 更新最優值 sess.run(update) self.saver.save(sess, os.path.join(model_path, 'model'), global_step=self.global_step) if self.global_step.eval() >= self.config.max_steps: #self.saver.save(sess, os.path.join(model_path, 'model_last'), global_step=self.global_step) break def test(self, batch_x_test): sess = self.session max_idx_p = sess.run(self.max_idx_p, feed_dict={self.X: batch_x_test, self.keep_prob: 1.}) return max_idx_p def get_next_batch( train_x, train_y, batch_size=32): n = train_x.shape[0] chi_list = np.random.choice(n, batch_size) return train_x[chi_list],train_y[chi_list] def img_cut_to_arry(img): imgs = [] image = nd.noise_del(img) image = cut_box(image) image_list = seg_img(image) for im in image_list: im = nd.convert2gray(im) im = im.reshape(-1) imgs.append(im) return imgs if __name__=="__main__": # nd = NoiseDel() # 獲取訓練數據 path = 'data/png_cut' classes = os.listdir(path) x = [] y = [] for c in classes: c_f = os.path.join(path, c) if os.path.isdir(c_f): files = os.listdir(c_f) for file in files: img = Image.open(os.path.join(c_f, file)) img = np.array(img) img = nd.convert2gray(img) img = img.reshape(-1) x.append(img) y.append(c.replace('_','')) lb = LabelBinarizer() ly = lb.fit_transform(y) # one-hot x = np.array(x) y = np.array(ly) # 拆分訓練數據與測試數據 x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.02) # 建立模型目錄 model_path = os.path.join('c_models', Config.file_name) if os.path.exists(model_path) is False: os.makedirs(model_path) # 加載上一次保存的模型 model = Model(Config) checkpoint_path = tf.train.latest_checkpoint(model_path) if checkpoint_path: model.load(checkpoint_path) # train and val print('start to training...') model.train(get_next_batch, model_path, x_train,y_train, x_test, y_test) # test # 預測新樣本集 newpath = 'data/png_new' newfiles = os.listdir(newpath) for file in newfiles: pre_text='' image = Image.open(os.path.join(newpath,file)) image = np.array(image) imgs= img_cut_to_arry(image) for img in imgs: p = model.test([img,]) p_arr = np.zeros([1,50]) p_arr[0,p] =1 c = lb.inverse_transform(p_arr) pre_text += c[0] print(pre_text)
step: 2500/200000... loss: 0.0803... step: 2500/200000... --------- acc: 0.0854 best: 0.1341 step: 2520/200000... loss: 0.0818... step: 2540/200000... loss: 0.0844... step: 2560/200000... loss: 0.0827... step: 2580/200000... loss: 0.0794... step: 2600/200000... loss: 0.0823... step: 2600/200000... --------- acc: 0.1951 best: 0.1341 save best model... step: 2620/200000... loss: 0.0775... step: 2640/200000... loss: 0.0754... step: 2660/200000... loss: 0.0823... step: 2680/200000... loss: 0.0678... step: 2700/200000... loss: 0.0763... step: 2700/200000... --------- acc: 0.3049 best: 0.1951 . . . . step: 41400/200000... --------- acc: 0.8659 best: 0.9512 step: 41450/200000... loss: 0.0091... step: 41500/200000... loss: 0.0134... step: 41550/200000... loss: 0.0151... step: 41600/200000... loss: 0.0256... step: 41600/200000... --------- acc: 0.9390 best: 0.9512
預測圖片:
mHPM srdN wa5Y eWpn AgB9 eHr8 rJpH fd9e bTYt tTwg
最後,附上源碼地址:github