【Tensorflow系列】使用Inception_resnet_v2訓練本身的數據集並用Tensorboard監控

時間 2020-04-24

標籤 Tensorflow系列使用 inception resnet v2 訓練本身數據並用 tensorboard 監控简体版

原文原文鏈接

【寫在前面】

用Tensorflow(TF)已實現好的卷積神經網絡（CNN）模型來訓練本身的數據集，驗證目前較成熟模型在不一樣數據集上的準確度，如Inception_V3, VGG16，Inception_resnet_v2等模型。本文驗證Inception_resnet_v2基於菜場實拍數據的準確性，測試數據爲芹菜、雞毛菜、青菜，各種別樣本約600張，多個菜場拍攝，不一樣數據源。python

補充：本身當初的計劃是用別人預訓練好的模型來再訓練本身的數據集已使能夠完成新的分類任務，但必需要修改代碼改網絡結構，並使用遷移學習（Fine-tune）git

本文記錄了其間的工做過程，相信也會有一些幫助的 : )github

測試環境：Centos7.3-64位 python3.5.4(Anaconda) json

目錄

一.準備

1.安裝python
2.安裝tensorflow
3.下載TF-slim圖像庫
4.準備數據
5.下載模型vim

二.訓練

1.讀入數據
2.構建模型
3.開始訓練
4.執行腳本，訓練本身的數據
5.可視化log
【問題】 tensorboard版本已更新，找不到對應包網絡

三.驗證

四.測試

一.準備

1.安裝python

推薦Anaconda，可建立虛擬環境，用conda命令易實現虛擬環境管理、包管理，安裝包時會查出全部依賴包並一共一鍵安裝，連接：https://www.anaconda.com/download/架構

2.安裝tensorflow

進入當下Anaconda的運行環境，我安裝的是python2.7版，並建立3.5虛擬環境app

conda create -n py35 python=3.5 【py35是虛擬環境的名稱; 輸入y 安裝】dom

source activate py35 【激活py35環境】python2.7

conda install tensorflow 【安裝tensorflow-cpu版，有GPU可安裝cpu版】

3.下載TF-slim代碼庫

cd  $WORKSPACE   【目錄跳轉到本身的工做目錄下】
git clone https://github.com/tensorflow/models/

4.準備數據

對全部訓練樣本按不一樣樣本類別存在不一樣文件夾下

zsy_train
|---jimaocai
　　|---  0.jpg
　　|---  ...
|---qc
|---qingcai

下面的代碼是爲了生成list.txt ，把不一樣文件夾下的圖片和數字label對應起來

 1 import os
 2 class_names_to_ids = {'jimaocai': 0, 'qc': 1, 'qingcai': 2}
 3 data_dir = 'flower_photos/'
 4 output_path = 'list.txt'
 5 fd = open(output_path, 'w')
 6 for class_name in class_names_to_ids.keys():
 7     images_list = os.listdir(data_dir + class_name)
 8     for image_name in images_list:
 9         fd.write('{}/{} {}\n'.format(class_name, image_name, class_names_to_ids[class_name]))
10 fd.close()

爲了方便後期查看label標籤，也可定義labels.txt

jimaocai
qc
qingcai

隨機生成訓練集和驗證集(在總量中隨機選取350個樣本做爲驗證集）

 1 import random
 2 _NUM_VALIDATION = 350
 3 _RANDOM_SEED = 0
 4 list_path = 'list.txt'
 5 train_list_path = 'list_train.txt'
 6 val_list_path = 'list_val.txt'
 7 fd = open(list_path)
 8 lines = fd.readlines()
 9 fd.close()
10 random.seed(_RANDOM_SEED)
11 random.shuffle(lines)
12 fd = open(train_list_path, 'w')
13 for line in lines[_NUM_VALIDATION:]:
14     fd.write(line)
15 fd.close()
16 fd = open(val_list_path, 'w')
17 for line in lines[:_NUM_VALIDATION]:
18     fd.write(line)
19 fd.close()

生成TFRecord數據

import sys
# sys.path.insert(0, '../models/slim/')  models-master research
sys.path.insert(0, './models/research/slim/') #把後面的路徑插入到系統路徑中 idx=0
from datasets import dataset_utils
import math
import os
import tensorflow as tf

#  根據list路徑  把數據轉化爲TFRecord
# def convert_dataset(list_path, data_dir, output_dir, _NUM_SHARDS=5):  
def convert_dataset(list_path, data_dir, output_dir, _NUM_SHARDS=3):      
    fd = open(list_path)
    lines = [line.split() for line in fd]
    fd.close()
    num_per_shard = int(math.ceil(len(lines) / float(_NUM_SHARDS)))
    with tf.Graph().as_default():
        decode_jpeg_data = tf.placeholder(dtype=tf.string)
        decode_jpeg = tf.image.decode_jpeg(decode_jpeg_data, channels=3)
        with tf.Session('') as sess:
            for shard_id in range(_NUM_SHARDS):
                output_path = os.path.join(output_dir,
#                     'data_{:05}-of-{:05}.tfrecord'.format(shard_id, _NUM_SHARDS))
                    'data_{:03}-of-{:03}.tfrecord'.format(shard_id, _NUM_SHARDS))
                tfrecord_writer = tf.python_io.TFRecordWriter(output_path)
                start_ndx = shard_id * num_per_shard
                end_ndx = min((shard_id + 1) * num_per_shard, len(lines))
                for i in range(start_ndx, end_ndx):
                    sys.stdout.write('\r>> Converting image {}/{} shard {}'.format(
                        i + 1, len(lines), shard_id))
                    sys.stdout.flush()
                    image_data = tf.gfile.FastGFile(os.path.join(data_dir, lines[i][0]), 'rb').read()
                    image = sess.run(decode_jpeg, feed_dict={decode_jpeg_data: image_data})
                    height, width = image.shape[0], image.shape[1]
                    example = dataset_utils.image_to_tfexample(
                        image_data, b'jpg', height, width, int(lines[i][1]))
                    tfrecord_writer.write(example.SerializeToString())
                tfrecord_writer.close()
    sys.stdout.write('\n')
    sys.stdout.flush()
    
os.system('mkdir -p train')
convert_dataset('list_train.txt', 'zsy_train', 'train/')
os.system('mkdir -p val')
convert_dataset('list_val.txt', 'zsy_train', 'val/')

獲得的文件夾結構以下

WORKSPACE
├── zsy_train
├── labels.txt
├── list_train.txt
├── list.txt
├── list_val.txt
├── train
│   ├── data_000-of-003.tfrecord
│   ├── ...
│   └── data_002-of-003.tfrecord
└── val
    ├── data_000-of-003.tfrecord
    ├── ...
    └── data_002-of-003.tfrecord

5.下載模型

官方提供了預訓練，這裏以Inception-ResNet-v2以例

cd $WORKSPACE/checkpoints
wget http://download.tensorflow.org/models/inception_resnet_v2_2016_08_30.tar.gz
tar zxf inception_resnet_v2_2016_08_30.tar.gz

二.訓練

1.讀入數據

讀入本身的數據，須要把下面代碼寫入models/slim/datasets/dataset_classification.py

import os
import tensorflow as tf
slim = tf.contrib.slim

def get_dataset(dataset_dir, num_samples, num_classes, labels_to_names_path=None, file_pattern='*.tfrecord'):
    file_pattern = os.path.join(dataset_dir, file_pattern)
    keys_to_features = {
        'image/encoded': tf.FixedLenFeature((), tf.string, default_value=''),
        'image/format': tf.FixedLenFeature((), tf.string, default_value='jpg'),
        'image/class/label': tf.FixedLenFeature(
            [], tf.int64, default_value=tf.zeros([], dtype=tf.int64)),
    }
    items_to_handlers = {
        'image': slim.tfexample_decoder.Image(),
        'label': slim.tfexample_decoder.Tensor('image/class/label'),
    }
    decoder = slim.tfexample_decoder.TFExampleDecoder(keys_to_features, items_to_handlers)
    items_to_descriptions = {
        'image': 'A color image of varying size.',
        'label': 'A single integer between 0 and ' + str(num_classes - 1),
    }
    labels_to_names = None
    if labels_to_names_path is not None:
        fd = open(labels_to_names_path)
        labels_to_names = {i : line.strip() for i, line in enumerate(fd)}
        fd.close()
    return slim.dataset.Dataset(
            data_sources=file_pattern,
            reader=tf.TFRecordReader,
            decoder=decoder,
            num_samples=num_samples,
            items_to_descriptions=items_to_descriptions,
            num_classes=num_classes,
            labels_to_names=labels_to_names)

2.構建模型

構建模型取決於我的欲構建什麼樣的模型，官方都有對應模型的下載連接，只需把對應下載（下載連接：https://github.com/tensorflow/models/tree/master/research/slim）好的模型解壓放入到checkpoints中便可

3.開始訓練

因爲是用已有模型訓練本身的數據集，故需對原工程代碼作適當調整。

把

from datasets import dataset_factory

改成：

from datasets import dataset_classification

把

dataset = dataset_factory.get_dataset(
    FLAGS.dataset_name, FLAGS.dataset_split_name, FLAGS.dataset_dir)

改成：

dataset = dataset_classification.get_dataset(
    FLAGS.dataset_dir, FLAGS.num_samples, FLAGS.num_classes, FLAGS.labels_to_names_path)

在

tf.app.flags.DEFINE_string(
    'dataset_dir', None, 'The directory where the dataset files are stored.')

後加入：

tf.app.flags.DEFINE_integer(
    'num_samples', 1781, 'Number of samples.')
tf.app.flags.DEFINE_integer(
    'num_classes', 3, 'Number of classes.')
tf.app.flags.DEFINE_string(
    'labels_to_names_path', None, 'Label names file path.')

4.執行腳本，訓練本身的數據

cd $WORKSPACE/models/slim    #跳轉到工做環境目錄
python train_image_classifier.py \     #運行腳本，後面跟的系統參數
    --train_dir=/root/workspace_mrt/model_lab/train_logs \   #train_log目錄，當模型訓練時，可用tensorboard命令指定該目錄，動態監測
    --dataset_dir=../../../train \    #訓練數據集   裏面是轉換好的TFRecord格式
    --num_samples=1781 \     　　　　　　#訓練樣本數，即值train_set中的總樣本數，不包括valid中隨機抽取350個樣本
    --num_classes=3 \　　　　　　　　　　　　#樣本類別數
    --labels_to_names_path=../../../labels.txt \   
    --model_name=inception_resnet_v2 \
    --checkpoint_path=../../../checkpoints/inception_resnet_v2_2016_08_30.ckpt \    　　#指定模型位置
    --checkpoint_exclude_scopes=InceptionResnetV2/Logits,InceptionResnetV2/AuxLogits \
    --trainable_scopes=InceptionResnetV2/Logits,InceptionResnetV2/AuxLogits \
    --clone_on_cpu=True　　　　　　　　#cpu訓練必須加上該參數

#fine-tune要把 --checkpoint_path,--checkpoint_exclude_scopes，--trainable_scopes 加上

5.可視化log

爲了可視化訓練時的loss或其餘指標，可用tensorboard，以下命令

tensorboard --logdir=${TRAIN_DIR}
在本教程中，對應執行下面命令
tensorboard --logdir=/root/workspace_mrt/model_lab/train_logs

【問題】 tensorboard版本已更新，找不到對應包

當執行

tensorboard --logdir=/root/workspace_mrt/model_lab/train_logs

時，獲得以下錯誤

ImportError: No module named 'tensorflow.tensorboard.tensorboard'

究其緣由，是由於在tensorflow更新時，包的位置和所屬關係改變了。執行如下代碼，可解決該問題。

cd /root/anaconda2/envs/py35/bin    #跳轉到對應python環境的bin目錄下，修改tensorboard執行腳本代碼，使之適應當前版本
vim tensorboard

把

import tensorflow.tensorboard.tensorboard

修改成：

import tensorboard.main

把

sys.exit(tensorflow.tensorboard.tensorboard.main())

修改成： sys.exit(tensorboard.main.main())

wq保存，退出，從新執行

tensorboard --logdir=/root/workspace_mrt/model_lab/train_logs

命令，無報錯。根據日誌提示，進入ip:6006進入tensorboard界面。

三.驗證

使用本身的數據集，需修改models/slim/eval_image_classifier.py

把

from datasets import dataset_factory

改成：

from datasets import dataset_classification

把

dataset = dataset_factory.get_dataset( FLAGS.dataset_name, FLAGS.dataset_split_name, FLAGS.dataset_dir)

改成：

dataset = dataset_classification.get_dataset(
    FLAGS.dataset_dir, FLAGS.num_samples, FLAGS.num_classes, FLAGS.labels_to_names_path)

在

tf.app.flags.DEFINE_string(
    'dataset_dir', None, 'The directory where the dataset files are stored.')

後加入

tf.app.flags.DEFINE_integer(
    'num_samples', 350, 'Number of samples.')
tf.app.flags.DEFINE_integer(
    'num_classes', 3, 'Number of classes.')
tf.app.flags.DEFINE_string(
    'labels_to_names_path', None, 'Label names file path.')

驗證時執行如下命令便可：

python eval_image_classifier.py \
    --checkpoint_path=../../../checkpoints/inception_resnet_v2_2016_08_30.ckpt \
    --eval_dir=/root/workspace_mrt/model_lab/eval_logs \
    --dataset_dir=../../../val \
    --num_samples=350 \
    --num_classes=3 \
    --model_name=inception_resnet_v2

能夠一邊訓練一邊驗證，注意使用其它的GPU或合理分配顯存。

一樣也能夠可視化log，若是已經在可視化訓練的log則建議使用其它端口，如：

tensorboard --logdir ../../../eval_logs/ --port 6007

四.測試

參考models/slim/eval_image_classifier.py，可編寫批量讀取圖片用模型進行推導的腳本models/slim/test_image_classifier.py

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import json
import math
import time
import numpy as np
import tensorflow as tf
from nets import nets_factory
from preprocessing import preprocessing_factory
slim = tf.contrib.slim

tf.app.flags.DEFINE_string(
    'master', '', 'The address of the TensorFlow master to use.')
tf.app.flags.DEFINE_string(
    'checkpoint_path', None,
    'The directory where the model was written to or an absolute path to a '
    'checkpoint file.')
tf.app.flags.DEFINE_string(
    'test_list', '', 'Test image list.')
tf.app.flags.DEFINE_string(
    'test_dir', '.', 'Test image directory.')
tf.app.flags.DEFINE_integer(
    'batch_size', 16, 'Batch size.')
tf.app.flags.DEFINE_integer(
    'num_classes', 3, 'Number of classes.')
tf.app.flags.DEFINE_integer(
    'labels_offset', 0,
    'An offset for the labels in the dataset. This flag is primarily used to '
    'evaluate the VGG and ResNet architectures which do not use a background '
    'class for the ImageNet dataset.')
tf.app.flags.DEFINE_string(
    'model_name', 'inception_resnet_v2', 'The name of the architecture to evaluate.')
tf.app.flags.DEFINE_string(
    'preprocessing_name', None, 'The name of the preprocessing to use. If left '
    'as `None`, then the model_name flag is used.')
tf.app.flags.DEFINE_integer(
    'test_image_size', None, 'Eval image size')
FLAGS = tf.app.flags.FLAGS
def main(_):
    if not FLAGS.test_list:
        raise ValueError('You must supply the test list with --test_list')
    tf.logging.set_verbosity(tf.logging.INFO)
    with tf.Graph().as_default():
        tf_global_step = slim.get_or_create_global_step()
        ####################
        # Select the model #
        ####################
        network_fn = nets_factory.get_network_fn(
            FLAGS.model_name,
            num_classes=(FLAGS.num_classes - FLAGS.labels_offset),
            is_training=False)
        #####################################
        # Select the preprocessing function #
        #####################################
        preprocessing_name = FLAGS.preprocessing_name or FLAGS.model_name
        image_preprocessing_fn = preprocessing_factory.get_preprocessing(
            preprocessing_name,
            is_training=False)
        test_image_size = FLAGS.test_image_size or network_fn.default_image_size
        if tf.gfile.IsDirectory(FLAGS.checkpoint_path):
            checkpoint_path = tf.train.latest_checkpoint(FLAGS.checkpoint_path)
        else:
            checkpoint_path = FLAGS.checkpoint_path
        batch_size = FLAGS.batch_size
        tensor_input = tf.placeholder(tf.float32, [None, test_image_size, test_image_size, 3])
        logits, _ = network_fn(tensor_input)
        logits = tf.nn.top_k(logits, 5)
        config = tf.ConfigProto()
        config.gpu_options.allow_growth = True
        test_ids = [line.strip() for line in open(FLAGS.test_list)]
        tot = len(test_ids)
        results = list()
        with tf.Session(config=config) as sess:
            sess.run(tf.global_variables_initializer())
            saver = tf.train.Saver()
            saver.restore(sess, checkpoint_path)
            time_start = time.time()
            for idx in range(0, tot, batch_size):
                images = list()
                idx_end = min(tot, idx + batch_size)
                print(idx)
                for i in range(idx, idx_end):
                    image_id = test_ids[i]
                    test_path = os.path.join(FLAGS.test_dir, image_id)
                    image = open(test_path, 'rb').read()
                    image = tf.image.decode_jpeg(image, channels=3)
                    processed_image = image_preprocessing_fn(image, test_image_size, test_image_size)
                    processed_image = sess.run(processed_image)
                    images.append(processed_image)
                images = np.array(images)
                predictions = sess.run(logits, feed_dict = {tensor_input : images}).indices
                for i in range(idx, idx_end):
                    print('{} {}'.format(image_id, predictions[i - idx].tolist())
            time_total = time.time() - time_start
            print('total time: {}, total images: {}, average time: {}'.format(
                time_total, len(test_ids), time_total / len(test_ids)))
if __name__ == '__main__':
    tf.app.run()

測試時執行如下命令便可：

CUDA_VISIBLE_DEVICES="0" python test_image_classifier.py \
    --checkpoint_path=../../../train_logs/ \
    --test_list=../../../list_val.txt \
    --test_dir=../../../val \
    --batch_size=16 \
    --num_classes=3 \
    --model_name=inception_resnet_v2