【OCR技術系列之四】基於深度學習的文字識別（3755個漢字）

時間 2019-11-24

標籤 OCR技術系列之四基於深度學習文字識別漢字简体版

原文原文鏈接

上一篇提到文字數據集的合成，如今咱們手頭上已經獲得了3755個漢字（一級字庫）的印刷體圖像數據集，咱們能夠利用它們進行接下來的3755個漢字的識別系統的搭建。用深度學習作文字識別，用的網絡固然是CNN，那具體使用哪一個經典網絡？VGG?RESNET？仍是其餘？我想了下，越深的網絡訓練獲得的模型應該會更好，可是想到訓練的難度以及之後線上部署時預測的速度，我以爲首先創建一個比較淺的網絡（基於LeNet的改進）作基本的文字識別，而後再根據項目需求，再嘗試其餘的網絡結構。此次任務所使用的深度學習框架是強大的Tensorflow。python

網絡搭建

第一步固然是搭建網絡和計算圖git

其實文字識別就是一個多分類任務，好比這個3755文字識別就是3755個類別的分類任務。咱們定義的網絡很是簡單，基本就是LeNet的改進版，值得注意的是咱們加入了batch normalization。另外咱們的損失函數選擇sparse_softmax_cross_entropy_with_logits，優化器選擇了Adam，學習率設爲0.1github

#network: conv2d->max_pool2d->conv2d->max_pool2d->conv2d->max_pool2d->conv2d->conv2d->max_pool2d->fully_connected->fully_connected

def build_graph(top_k):
    keep_prob = tf.placeholder(dtype=tf.float32, shape=[], name='keep_prob')
    images = tf.placeholder(dtype=tf.float32, shape=[None, 64, 64, 1], name='image_batch')
    labels = tf.placeholder(dtype=tf.int64, shape=[None], name='label_batch')
    is_training = tf.placeholder(dtype=tf.bool, shape=[], name='train_flag')
    with tf.device('/gpu:5'):
        #給slim.conv2d和slim.fully_connected準備了默認參數：batch_norm
        with slim.arg_scope([slim.conv2d, slim.fully_connected],
                            normalizer_fn=slim.batch_norm,
                            normalizer_params={'is_training': is_training}):
            conv3_1 = slim.conv2d(images, 64, [3, 3], 1, padding='SAME', scope='conv3_1')
            max_pool_1 = slim.max_pool2d(conv3_1, [2, 2], [2, 2], padding='SAME', scope='pool1')
            conv3_2 = slim.conv2d(max_pool_1, 128, [3, 3], padding='SAME', scope='conv3_2')
            max_pool_2 = slim.max_pool2d(conv3_2, [2, 2], [2, 2], padding='SAME', scope='pool2')
            conv3_3 = slim.conv2d(max_pool_2, 256, [3, 3], padding='SAME', scope='conv3_3')
            max_pool_3 = slim.max_pool2d(conv3_3, [2, 2], [2, 2], padding='SAME', scope='pool3')
            conv3_4 = slim.conv2d(max_pool_3, 512, [3, 3], padding='SAME', scope='conv3_4')
            conv3_5 = slim.conv2d(conv3_4, 512, [3, 3], padding='SAME', scope='conv3_5')
            max_pool_4 = slim.max_pool2d(conv3_5, [2, 2], [2, 2], padding='SAME', scope='pool4')

            flatten = slim.flatten(max_pool_4)
            fc1 = slim.fully_connected(slim.dropout(flatten, keep_prob), 1024,
                                       activation_fn=tf.nn.relu, scope='fc1')
            logits = slim.fully_connected(slim.dropout(fc1, keep_prob), FLAGS.charset_size, activation_fn=None,
                                          scope='fc2')
        # 由於咱們沒有作熱編碼，因此使用sparse_softmax_cross_entropy_with_logits
        loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=labels))
        accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(logits, 1), labels), tf.float32))

        update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
        if update_ops:
            updates = tf.group(*update_ops)
            loss = control_flow_ops.with_dependencies([updates], loss)

        global_step = tf.get_variable("step", [], initializer=tf.constant_initializer(0.0), trainable=False)
        optimizer = tf.train.AdamOptimizer(learning_rate=0.1)
        train_op = slim.learning.create_train_op(loss, optimizer, global_step=global_step)
        probabilities = tf.nn.softmax(logits)

        # 繪製loss accuracy曲線
        tf.summary.scalar('loss', loss)
        tf.summary.scalar('accuracy', accuracy)
        merged_summary_op = tf.summary.merge_all()
        # 返回top k 個預測結果及其機率；返回top K accuracy
        predicted_val_top_k, predicted_index_top_k = tf.nn.top_k(probabilities, k=top_k)
        accuracy_in_top_k = tf.reduce_mean(tf.cast(tf.nn.in_top_k(probabilities, labels, top_k), tf.float32))

    return {'images': images,
            'labels': labels,
            'keep_prob': keep_prob,
            'top_k': top_k,
            'global_step': global_step,
            'train_op': train_op,
            'loss': loss,
            'is_training': is_training,
            'accuracy': accuracy,
            'accuracy_top_k': accuracy_in_top_k,
            'merged_summary_op': merged_summary_op,
            'predicted_distribution': probabilities,
            'predicted_index_top_k': predicted_index_top_k,
            'predicted_val_top_k': predicted_val_top_k}

模型訓練

訓練以前咱們應設計好數據怎麼樣才能高效地餵給網絡訓練。算法

首先，咱們先建立數據流圖，這個數據流圖由一些流水線的階段組成，階段間用隊列鏈接在一塊兒。第一階段將生成文件名，咱們讀取這些文件名而且把他們排到文件名隊列中。第二階段從文件中讀取數據（使用Reader），產生樣本，並且把樣本放在一個樣本隊列中。根據你的設置，實際上也能夠拷貝第二階段的樣本，使得他們相互獨立，這樣就能夠從多個文件中並行讀取。在第二階段的最後是一個排隊操做，就是入隊到隊列中去，在下一階段出隊。由於咱們是要開始運行這些入隊操做的線程，因此咱們的訓練循環會使得樣本隊列中的樣本不斷地出隊。網絡

盜個圖說明一下具體的數據讀入流程：多線程

入隊操做都在主線程中進行,Session中能夠多個線程一塊兒運行。在數據輸入的應用場景中，入隊操做是從硬盤中讀取輸入，放到內存當中，速度較慢。使用QueueRunner能夠建立一系列新的線程進行入隊操做，讓主線程繼續使用數據。若是在訓練神經網絡的場景中，就是訓練網絡和讀取數據是異步的，主線程在訓練網絡，另外一個線程在將數據從硬盤讀入內存。架構

# batch的生成
def input_pipeline(self, batch_size, num_epochs=None, aug=False):
    # numpy array 轉 tensor
    images_tensor = tf.convert_to_tensor(self.image_names, dtype=tf.string)
    labels_tensor = tf.convert_to_tensor(self.labels, dtype=tf.int64)
    # 將image_list ,label_list作一個slice處理
    input_queue = tf.train.slice_input_producer([images_tensor, labels_tensor], num_epochs=num_epochs)

    labels = input_queue[1]
    images_content = tf.read_file(input_queue[0])
    images = tf.image.convert_image_dtype(tf.image.decode_png(images_content, channels=1), tf.float32)
    if aug:
        images = self.data_augmentation(images)
    new_size = tf.constant([FLAGS.image_size, FLAGS.image_size], dtype=tf.int32)
    images = tf.image.resize_images(images, new_size)
    image_batch, label_batch = tf.train.shuffle_batch([images, labels], batch_size=batch_size, capacity=50000,
                                                      min_after_dequeue=10000)
    # print 'image_batch', image_batch.get_shape()
    return image_batch, label_batch

訓練時數據讀取的模式如上面所述，那訓練代碼則根據該架構設計以下：app

def train():
    print('Begin training')
    # 填好數據讀取的路徑
    train_feeder = DataIterator(data_dir='./dataset/train/')
    test_feeder = DataIterator(data_dir='./dataset/test/')
    model_name = 'chinese-rec-model'
    with tf.Session(config=tf.ConfigProto(gpu_options=gpu_options, allow_soft_placement=True)) as sess:
        # batch data 獲取
        train_images, train_labels = train_feeder.input_pipeline(batch_size=FLAGS.batch_size, aug=True)
        test_images, test_labels = test_feeder.input_pipeline(batch_size=FLAGS.batch_size)
        graph = build_graph(top_k=1)  # 訓練時top k = 1
        saver = tf.train.Saver()
        sess.run(tf.global_variables_initializer())
        # 設置多線程協調器
        coord = tf.train.Coordinator()
        threads = tf.train.start_queue_runners(sess=sess, coord=coord)

        train_writer = tf.summary.FileWriter(FLAGS.log_dir + '/train', sess.graph)
        test_writer = tf.summary.FileWriter(FLAGS.log_dir + '/val')
        start_step = 0
        # 能夠從某個step下的模型繼續訓練
        if FLAGS.restore:
            ckpt = tf.train.latest_checkpoint(FLAGS.checkpoint_dir)
            if ckpt:
                saver.restore(sess, ckpt)
                print("restore from the checkpoint {0}".format(ckpt))
                start_step += int(ckpt.split('-')[-1])

        logger.info(':::Training Start:::')
        try:
            i = 0
            while not coord.should_stop():
                i += 1
                start_time = time.time()
                train_images_batch, train_labels_batch = sess.run([train_images, train_labels])
                feed_dict = {graph['images']: train_images_batch,
                             graph['labels']: train_labels_batch,
                             graph['keep_prob']: 0.8,
                             graph['is_training']: True}
                _, loss_val, train_summary, step = sess.run(
                    [graph['train_op'], graph['loss'], graph['merged_summary_op'], graph['global_step']],
                    feed_dict=feed_dict)
                train_writer.add_summary(train_summary, step)
                end_time = time.time()
                logger.info("the step {0} takes {1} loss {2}".format(step, end_time - start_time, loss_val))
                if step > FLAGS.max_steps:
                    break
                if step % FLAGS.eval_steps == 1:
                    test_images_batch, test_labels_batch = sess.run([test_images, test_labels])
                    feed_dict = {graph['images']: test_images_batch,
                                 graph['labels']: test_labels_batch,
                                 graph['keep_prob']: 1.0,
                                 graph['is_training']: False}
                    accuracy_test, test_summary = sess.run([graph['accuracy'], graph['merged_summary_op']],
                                                           feed_dict=feed_dict)
                    if step > 300:
                        test_writer.add_summary(test_summary, step)
                    logger.info('===============Eval a batch=======================')
                    logger.info('the step {0} test accuracy: {1}'
                                .format(step, accuracy_test))
                    logger.info('===============Eval a batch=======================')
                if step % FLAGS.save_steps == 1:
                    logger.info('Save the ckpt of {0}'.format(step))
                    saver.save(sess, os.path.join(FLAGS.checkpoint_dir, model_name),
                               global_step=graph['global_step'])
        except tf.errors.OutOfRangeError:
            logger.info('==================Train Finished================')
            saver.save(sess, os.path.join(FLAGS.checkpoint_dir, model_name), global_step=graph['global_step'])
        finally:
            # 達到最大訓練迭代數的時候清理關閉線程
            coord.request_stop()
        coord.join(threads)

執行如下指令進行模型訓練。由於我使用的是TITAN X，因此感受訓練時間不長，大概1個小時能夠訓練完畢。訓練過程的loss和accuracy變換曲線以下圖所示框架

而後執行指令，設置最大迭代步數爲16002，每100步進行一次驗證，每500步存儲一次模型。異步

python Chinese_OCR.py --mode=train --max_steps=16002 --eval_steps=100 --save_steps=500

模型性能評估

咱們的須要對模模型進行評估，咱們須要計算模型的top 1 和top 5的準確率。

執行指令

python Chinese_OCR.py --mode=validation

驗證開始

最後給出預測的top1 和top5正確率以下：

def validation():
    print('Begin validation')
    test_feeder = DataIterator(data_dir='./dataset/test/')

    final_predict_val = []
    final_predict_index = []
    groundtruth = []

    with tf.Session(config=tf.ConfigProto(gpu_options=gpu_options,allow_soft_placement=True)) as sess:
        test_images, test_labels = test_feeder.input_pipeline(batch_size=FLAGS.batch_size, num_epochs=1)
        graph = build_graph(top_k=5)
        saver = tf.train.Saver()

        sess.run(tf.global_variables_initializer())
        sess.run(tf.local_variables_initializer())  # initialize test_feeder's inside state

        coord = tf.train.Coordinator()
        threads = tf.train.start_queue_runners(sess=sess, coord=coord)

        ckpt = tf.train.latest_checkpoint(FLAGS.checkpoint_dir)
        if ckpt:
            saver.restore(sess, ckpt)
            print("restore from the checkpoint {0}".format(ckpt))

        logger.info(':::Start validation:::')
        try:
            i = 0
            acc_top_1, acc_top_k = 0.0, 0.0
            while not coord.should_stop():
                i += 1
                start_time = time.time()
                test_images_batch, test_labels_batch = sess.run([test_images, test_labels])
                feed_dict = {graph['images']: test_images_batch,
                             graph['labels']: test_labels_batch,
                             graph['keep_prob']: 1.0,
                             graph['is_training']: False}
                batch_labels, probs, indices, acc_1, acc_k = sess.run([graph['labels'],
                                                                       graph['predicted_val_top_k'],
                                                                       graph['predicted_index_top_k'],
                                                                       graph['accuracy'],
                                                                       graph['accuracy_top_k']], feed_dict=feed_dict)
                final_predict_val += probs.tolist()
                final_predict_index += indices.tolist()
                groundtruth += batch_labels.tolist()
                acc_top_1 += acc_1
                acc_top_k += acc_k
                end_time = time.time()
                logger.info("the batch {0} takes {1} seconds, accuracy = {2}(top_1) {3}(top_k)"
                            .format(i, end_time - start_time, acc_1, acc_k))

        except tf.errors.OutOfRangeError:
            logger.info('==================Validation Finished================')
            acc_top_1 = acc_top_1 * FLAGS.batch_size / test_feeder.size
            acc_top_k = acc_top_k * FLAGS.batch_size / test_feeder.size
            logger.info('top 1 accuracy {0} top k accuracy {1}'.format(acc_top_1, acc_top_k))
        finally:
            coord.request_stop()
        coord.join(threads)
    return {'prob': final_predict_val, 'indices': final_predict_index, 'groundtruth': groundtruth}

文字預測

剛剛作的那一步只是使用了咱們生成的數據集做爲測試集來檢驗模型性能，這種檢驗是不大準確的，由於咱們平常須要識別的文字樣本不會像是本身合成的文字那樣的穩定和規則。那咱們嘗試使用該模型對一些實際場景的文字進行識別，真正考察模型的泛化能力。

首先先編寫好預測的代碼

def inference(name_list):
    print('inference')
    image_set=[]
    # 對每張圖進行尺寸標準化和歸一化
    for image in name_list:
        temp_image = Image.open(image).convert('L')
        temp_image = temp_image.resize((FLAGS.image_size, FLAGS.image_size), Image.ANTIALIAS)
        temp_image = np.asarray(temp_image) / 255.0
        temp_image = temp_image.reshape([-1, 64, 64, 1])
        image_set.append(temp_image)
        
    # allow_soft_placement 若是你指定的設備不存在，容許TF自動分配設備
    with tf.Session(config=tf.ConfigProto(gpu_options=gpu_options,allow_soft_placement=True)) as sess:
        logger.info('========start inference============')
        # images = tf.placeholder(dtype=tf.float32, shape=[None, 64, 64, 1])
        # Pass a shadow label 0. This label will not affect the computation graph.
        graph = build_graph(top_k=3)
        saver = tf.train.Saver()
        # 自動獲取最後一次保存的模型
        ckpt = tf.train.latest_checkpoint(FLAGS.checkpoint_dir)
        if ckpt:       
            saver.restore(sess, ckpt)
        val_list=[]
        idx_list=[]
        # 預測每一張圖
        for item in image_set:
            temp_image = item
            predict_val, predict_index = sess.run([graph['predicted_val_top_k'], graph['predicted_index_top_k']],
                                              feed_dict={graph['images']: temp_image,
                                                         graph['keep_prob']: 1.0,
                                                         graph['is_training']: False})
            val_list.append(predict_val)
            idx_list.append(predict_index)
    #return predict_val, predict_index
    return val_list,idx_list

這裏須要說明一下，我會把我要識別的文字圖像存入一個叫作tmp的文件夾內，裏面的圖像按照順序依次編號，咱們識別時就從該目錄下讀取全部圖片僅內存進行逐一識別。

# 獲待預測圖像文件夾內的圖像名字
def get_file_list(path):
    list_name=[]
    files = os.listdir(path)
    files.sort()
    for file in files:
        file_path = os.path.join(path, file)
        list_name.append(file_path)
    return list_name

那咱們使用訓練好的模型進行漢字預測，觀察效果。首先我從一篇論文pdf上用截圖工具截取了一段文字，而後使用文字切割算法把文字段落切割爲單字，以下圖，由於有少許文字切割失敗，因此丟棄了一些單字。

從論文中用截圖工具截取文字段落。

切割出來的單字，黑底白字。

執行指令,開始文字識別。

python Chinese_OCR.py --mode=inference

由於我使用的是GPU，預測速度很是快，除去系統初始化時間，所有圖像預測完成所花費的時間不超過1秒。

其中打印日誌的信息分別是：當前識別的圖片路徑、模型預測出的top 3漢字（置信度由高到低排列）、對應的漢字id、對應的機率。

最後將全部的識別文字按順序組合成段落，能夠看出，漢字識別徹底正確，說明咱們的基於深度學習的OCR系統仍是至關給力！

總結

至此，支持3755個漢字識別的OCR系統已經搭建完畢，通過測試，效果仍是很不錯。這是一個沒有通過太多優化的模型，在模型評估上top 1的正確率達到了99.9%，這是一個至關優秀的效果了，因此說在一些比較理想的環境下的文字識別的效果仍是比較給力，可是對於複雜場景的或是一些干擾比較大的文字圖像，識別起來的效果可能不會太理想，這就須要針對特定場景作進一步優化。

完整代碼在個人github獲取。