21個項目玩轉深度學習：基於TensorFlow的實踐詳解02—CIFAR10圖像識別

時間 2019-11-18

標籤項目深度學習基於 tensorflow 實踐詳解 cifar10 cifar 圖像識別简体版

原文原文鏈接

cifar10數據集

CIFAR-10 是由 Hinton 的學生 Alex Krizhevsky 和 Ilya Sutskever 整理的一個用於識別普適物體的小型數據集。一共包含 10 個類別的 RGB 彩色圖片：飛機（ airplane ）、汽車（ automobile ）、鳥類（ bird ）、貓（ cat ）、鹿（ deer ）、狗（ dog ）、蛙類（ frog ）、馬（ horse ）、船（ ship ）和卡車（ truck ）。圖片的尺寸爲 32 × 32 ，數據集中一共有 50000 張訓練圖片和 10000 張測試圖片。本文訓練過程可見官方示例：https://www.tensorflow.org/tutorials/images/deep_cnnhtml

下載腳本內容以下：node

# coding:utf-8
import tensorflow as tf

from six.moves import urllib
import os
import sys
import tarfile

# tf.app.flags.FLAGS是TensorFlow內部的一個全局變量存儲器，同時能夠用於命令行參數的處理
FLAGS = tf.app.flags.FLAGS
# 定義tf.app.flags.FLAGS.data_dir爲CIFAR-10的數據路徑
tf.app.flags.DEFINE_string('data_dir', '/tmp/cifar10_data', """Path to the CIFAR-10 data directory.""")
# 咱們把這個路徑改成cifar10_data
FLAGS.data_dir = 'cifar10_data/'

DATA_URL = 'http://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz'

# 若是不存在數據文件，就會執行下載
def maybe_download_and_extract():
    """Download and extract the tarball from Alex's website."""
    dest_directory = FLAGS.data_dir
    if not os.path.exists(dest_directory):
        os.makedirs(dest_directory)
    filename = DATA_URL.split('/')[-1]
    filepath = os.path.join(dest_directory, filename)
    if not os.path.exists(filepath):
        def _progress(count, block_size, total_size):
            sys.stdout.write('\r>> Downloading %s %.1f%%' % (filename,float(count * block_size) / float(total_size) * 100.0))
            sys.stdout.flush()
        filepath, _ = urllib.request.urlretrieve(DATA_URL, filepath, _progress)
        print()
        statinfo = os.stat(filepath)
        print('Successfully downloaded', filename, statinfo.st_size, 'bytes.')
    extracted_dir_path = os.path.join(dest_directory, 'cifar-10-batches-bin')
    if not os.path.exists(extracted_dir_path):
        tarfile.open(filepath, 'r:gz').extractall(dest_directory)

if __name__=='__main__':
    maybe_download_and_extract()

txt文本文件中存儲了每一個類別的英文名稱，每一個bin文件有1w張圖像 python

數據讀取

TensorFlow程序讀取數據方式可查看官方中文文檔：http://tensorfly.cn/tfdoc/how_tos/reading_data.htmlgit

通常狀況是將數據讀入內存，再交由GPU或CPU進行運算。假設讀入用時0.1s ，計算用時 0.9s ，那麼就意昧着每過1s, GPU 都會有0.1s無事可作，這大大下降了運算的效率。web

解決方法：將讀入數據和計算分別放在兩個線程中，將數據讀入內存的一個隊列api

讀取線程源源不斷地將文件系統中的圖片讀入一個內存的隊列中，而負責計算的是另外一個線程，計算須要數據肘，直接從內存隊列中取就能夠了。這樣能夠解決 GPU 由於 I/O而空閒的問題！網絡

在機器學習中有個概念：epoch。一次epoch至關於將整個訓練集中的圖片計算一次，考慮到epoch的狀況，在內存隊列前添加了「文件名隊列」session

TensorFlow 使用「文件名隊列＋內存隊列」雙隊列的形式讀入文件，能夠很好地管理 epoch 。架構

以A，B，C三張圖片，epoch=1爲例展現，內存隊列會從文件名隊列中取app

文件名隊列：tf.train.string_input_producer 傳入文件列表[A.jpg, B.jpg, C.jpg]，兩個重要參數num_epochs（至關於epoch），shuffle（一個epoch進文件名隊列是否打亂，默認爲True）
內存隊列：無須本身創建，使用reader對象從文件名隊列中讀取便可
真正執行：tf.train.start_ queue_runners 只有運行完此步，纔會向文件名隊列中裝東西，啓動填充隊列的線程

測試代碼以下：

# coding:utf-8
import os
if not os.path.exists('read'):
    os.makedirs('read/')

# 導入TensorFlow
import tensorflow as tf 

# 新建一個Session
with tf.Session() as sess:
    # 咱們要讀三幅圖片A.jpg, B.jpg, C.jpg
    filename = ['A.jpg', 'B.jpg', 'C.jpg']
    # string_input_producer會產生一個文件名隊列
    filename_queue = tf.train.string_input_producer(filename, shuffle=False, num_epochs=5)
    # reader從文件名隊列中讀數據。對應的方法是reader.read
    reader = tf.WholeFileReader()
    key, value = reader.read(filename_queue)
    # tf.train.string_input_producer定義了一個epoch變量，要對它進行初始化
    tf.local_variables_initializer().run()
    # 使用start_queue_runners以後，纔會開始填充隊列
    threads = tf.train.start_queue_runners(sess=sess)
    i = 0
    while True:
        i += 1
        # 獲取圖片數據並保存
        image_data = sess.run(value)
        with open('read/test_%d.jpg' % i, 'wb') as f:
            f.write(image_data)
# 程序最後會拋出一個OutOfRangeError，這是epoch跑完，隊列關閉的標誌

運行結果：

2018-10-30 16:28:09.015742: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
Traceback (most recent call last):  File "test.py", line 26, in <module>
    image_data = sess.run(value)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 889, in run    run_metadata_ptr)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1120, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
[root@node5 chapter_02]# python test.py
2018-10-30 16:28:27.836579: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports ins
tructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
Traceback (most recent call last):  File "test.py", line 26, in <module>
    image_data = sess.run(value)  File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 889, in run
    run_metadata_ptr)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1120, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
    options, run_metadata)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
    raise type(e)(node_def, op, message)tensorflow.python.framework.errors_impl.OutOfRangeError: FIFOQueue '_0_input_producer' is closed and h
as insufficient elements (requested 1, current size 0)
         [[Node: ReaderReadV2 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/device:CPU:0"](WholeFileReaderV2, input_producer)]]

Caused by op u'ReaderReadV2', defined at:
  File "test.py", line 17, in <module>
    key, value = reader.read(filename_queue)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/io_ops.py", line 195, in read
    return gen_io_ops._reader_read_v2(self._reader_ref, queue_ref, name=name)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 673, in _reader_read_v2
    queue_handle=queue_handle, name=name)  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in
_apply_op_helper
    op_def=op_def)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
    op_def=op_def)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

OutOfRangeError (see above for traceback): FIFOQueue '_0_input_producer' is closed and has insufficient elements (requested 1, current size 0)
         [[Node: ReaderReadV2 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/device:CPU:0"](WholeFileReaderV2, input_producer)]]

View Code

保存爲圖片

一個樣本由 3073 個字節組成，第一個字節爲標籤（ label ），剩下 3072 個字節爲圖像數據。樣本和樣本之間沒有多餘的字節分割，所以這幾個二進制文件的大小都是 30730000 字節。

如何用 TensorFlow 讀取 CIFAR-10 數據呢？

第一步，用 tf.train.string_input_producer 創建隊列。
第二步，經過 reader.read 讀數據。在以前例子中，一個文件就是一張圖片，所以用的 reader 是 tf.WholeFileReader（）。CIFAR-10 數據是以固定字節存在文件中的，一個文件中含有多個樣本。所以不能使用 tf.WholeFileReader（），而是用 tf.FixedLengthRecordReader（）。
第三步，調用 tf.train.start_queue_runners。
最後，經過 sess.run（）取出圖片結果。

#coding: utf-8
import tensorflow as tf
import os
import scipy.misc

# 從queue中讀取文件
def read_cifar10(filename_queue):
    """Reads and parses examples from CIFAR10 data files.

    Recommendation: if you want N-way read parallelism, call this function
    N times.  This will give you N independent Readers reading different
    files & positions within those files, which will give better mixing of
    examples.

    Args:
        filename_queue: A queue of strings with the filenames to read from.

    Returns:
        An object representing a single example, with the following fields:
        height: number of rows in the result (32)
        width: number of columns in the result (32)
        depth: number of color channels in the result (3)
        key: a scalar string Tensor describing the filename & record number
            for this example.
        label: an int32 Tensor with the label in the range 0..9.
        uint8image: a [height, width, depth] uint8 Tensor with the image data
    """

    class CIFAR10Record(object):
        pass
    result = CIFAR10Record()

    label_bytes = 1  # 2 for CIFAR-100
    result.height = 32
    result.width = 32
    result.depth = 3
    image_bytes = result.height * result.width * result.depth
    # Every record consists of a label followed by the image, with a fixed number of bytes for each.
    record_bytes = label_bytes + image_bytes

    # Read a record, getting filenames from the filename_queue.  
　　 # No header or footer in the CIFAR-10 format, so we leave header_bytes and footer_bytes at their default of 0.
    reader = tf.FixedLengthRecordReader(record_bytes=record_bytes)
    result.key, value = reader.read(filename_queue)

    # Convert from a string to a vector of uint8 that is record_bytes long.
    record_bytes = tf.decode_raw(value, tf.uint8)

    # The first bytes represent the label, which we convert from uint8->int32.
    result.label = tf.cast(tf.strided_slice(record_bytes, [0], [label_bytes]), tf.int32)

    # The remaining bytes after the label represent the image, which we reshape
    # from [depth * height * width] to [depth, height, width].
    depth_major = tf.reshape(tf.strided_slice(record_bytes, [label_bytes],[label_bytes + image_bytes]),
                             [result.depth, result.height, result.width])
    # Convert from [depth, height, width] to [height, width, depth].
    result.uint8image = tf.transpose(depth_major, [1, 2, 0])

    return result


def inputs_origin(data_dir):
    # filenames一共5個，從data_batch_1.bin到data_batch_5.bin
    # 讀入的都是訓練圖像
    filenames = [os.path.join(data_dir, 'data_batch_%d.bin' % i) for i in xrange(1, 6)]
    # 判斷文件是否存在
    for f in filenames:
        if not tf.gfile.Exists(f):
            raise ValueError('Failed to find file: ' + f)
    # 將文件名的list包裝成TensorFlow中queue的形式
    filename_queue = tf.train.string_input_producer(filenames)
    # 返回的結果read_input的屬性uint8image就是圖像的Tensor
    read_input = read_cifar10(filename_queue)
    # 將圖片轉換爲實數形式
    reshaped_image = tf.cast(read_input.uint8image, tf.float32)
    # 返回的reshaped_image是一張圖片的tensor
    # 咱們應當這樣理解reshaped_image：每次使用sess.run(reshaped_image)，就會取出一張圖片
    return reshaped_image

if __name__ == '__main__':
    # 建立一個會話sess,
    # 爲何不能用with tf.Session() as sess, 解答https://blog.csdn.net/chengqiuming/article/details/80293220
    sess = tf.Session()
    # 調用inputs_origin。cifar10_data/cifar-10-batches-bin是咱們下載的數據的文件夾位置
    reshaped_image = inputs_origin('cifar10_data/cifar-10-batches-bin')
    # 這一步start_queue_runner很重要。
    # 咱們以前有filename_queue = tf.train.string_input_producer(filenames)
    # 這個queue必須經過start_queue_runners才能啓動    缺乏start_queue_runners程序將不能執行
    threads = tf.train.start_queue_runners(sess=sess)
    # 變量初始化
    sess.run(tf.global_variables_initializer())
    # 建立文件夾cifar10_data/raw/
    if not os.path.exists('cifar10_data/raw/'):
        os.makedirs('cifar10_data/raw/')
    # 保存30張圖片
    for i in range(30):
        # 每次sess.run(reshaped_image)，都會取出一張圖片
        image_array = sess.run(reshaped_image)
        # 將圖片保存
        scipy.misc.toimage(image_array).save('cifar10_data/raw/%d.jpg' % i)

數據加強

對於圖像類型的訓練、數據，所謂的數據加強（ Data Augmentation ）方法是指利用平移、縮放、顏色等變躁，人工增大訓練集樣本的個數，從而得到更充足的訓練數據，使模型訓練的效果更好。

經常使用的圖像數據加強的方法以下。

平移：將圖像在必定尺度範圍內平移。
旋轉：將圖像在必定角度範圍內旋轉。
翻轉：水平翻轉或上下翻轉圖像。
裁剪：在原有圖像上裁剪出一塊。
縮放：將圖像在必定尺度內放大或縮小。
顏色變換：對圖像的 RGB 顏色空間進行一些變換。
噪聲擾動：給圖像加入一些人工生成的噪聲。

使用數據加強方法的前提是，這些數據加強方法不會改變圖像的原有標籤。

  # 隨機剪裁圖片，從32*32到24*24
  distorted_image = tf.random_crop(reshaped_image, [height, width, 3])

  # 隨機翻轉圖片，每張圖片有50%的機率被水平左右翻轉，另有50%的機率保持不變
  distorted_image = tf.image.random_flip_left_right(distorted_image)

  # 隨機改變亮度和對比度
  distorted_image = tf.image.random_brightness(distorted_image, max_delta=63)
  distorted_image = tf.image.random_contrast(distorted_image，lower=0.2, upper=1.8)

原始的訓練圖片是 reshaped_image。最後會獲得一個數據加強後的訓練樣本 distorted_image 。訓練時，直接使用 distorted_image 進行訓練便可。

訓練

代碼邏輯以下：

cifar10_input.py

該文件中包含三個和訓練過程相關的函數: read_cifar10, _generate_image_and_label_batch, distorted_inputs三個函數，下面依次來看函數的實現

文件頭的定義

#encoding=utf-8
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import os

from six.moves import xrange  # pylint: disable=redefined-builtin
import tensorflow as tf

# 注意此處不是原圖的size 32*32，由於後續會作剪裁，若是修改了此值，整個模型架構會被改變須要從新訓練整個模型
IMAGE_SIZE = 24

# 全局常量
NUM_CLASSES = 10
NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN = 50000
NUM_EXAMPLES_PER_EPOCH_FOR_EVAL = 10000

read_cifar10

從文件名隊列中取圖片，一次運行取到一張

def read_cifar10(filename_queue):
    '''
    從文件名隊列中按字節讀取圖像數據
    返回值：一個對象 height,width,depth,key(filename),label(an int32 Tensor),uint8image(a [height, width, depth] uint8 Tensor with the image data)
    建議：if you want N-way read parallelism, call this function N times. This will give you N independent Readers reading different
          files & positions within those files, which will give better mixing of examples.
    '''

    class CIFAR10Record(object):
        pass
    result = CIFAR10Record()

    # CIFAR-10數據集中圖片的各維. 詳情見 http://www.cs.toronto.edu/~kriz/cifar.html
    label_bytes = 1  # 2 for CIFAR-100
    result.height = 32
    result.width = 32
    result.depth = 3
    image_bytes = result.height * result.width * result.depth
    # 每條記錄的構成是<label><image>
    record_bytes = label_bytes + image_bytes

    # 讀取固定字節的內容，key是文件名，value中包含label和image
    reader = tf.FixedLengthRecordReader(record_bytes=record_bytes)
    result.key, value = reader.read(filename_queue)

    # 編碼轉換 Convert from a string to a vector of uint8 that is record_bytes long.
    record_bytes = tf.decode_raw(value, tf.uint8)

    # 第一個/第二個字節表示label, 並作轉換 uint8->int32.
    result.label = tf.cast(tf.strided_slice(record_bytes, [0], [label_bytes]), tf.int32)

    # 標籤字節後面是圖像相關字節[depth * height * width]重塑成[depth, height, width].
    depth_major = tf.reshape(tf.strided_slice(record_bytes, [label_bytes], [label_bytes + image_bytes]), [result.depth, result.height, result.width])
    # 轉置 Convert from [depth, height, width] to [height, width, depth].
    result.uint8image = tf.transpose(depth_major, [1, 2, 0])

    return result

View Code

涉及不熟悉的tf操做：

tf.decode_raw：https://blog.csdn.net/u012571510/article/details/82112452
tf.strided_slice：https://blog.csdn.net/banana1006034246/article/details/75092388

_generate_image_and_label_batch

生成批次的訓練數據

def _generate_image_and_label_batch(image, label, min_queue_examples, batch_size, shuffle):
    """
    生成一個batch的數據
    Args:
        image: 3-D Tensor of [height, width, 3] of type.float32.
        label: 1-D Tensor of type.int32
        min_queue_examples: int32, minimum number of samples to retain in the queue that provides of batches of examples.
        batch_size: 每批次數據數目
        shuffle: 是否打亂
    Returns:
        images: Images. 4D tensor of [batch_size, height, width, 3] size.
        labels: Labels. 1D tensor of [batch_size] size.
    """
    # Create a queue that shuffles the examples, and then read 'batch_size' images + labels from the example queue.
    num_preprocess_threads = 16
    if shuffle:
        images, label_batch = tf.train.shuffle_batch(
                [image, label], batch_size=batch_size,
                num_threads=num_preprocess_threads,
                capacity=min_queue_examples + 3 * batch_size,
                min_after_dequeue=min_queue_examples)
    else:
        images, label_batch = tf.train.batch(
                [image, label], batch_size=batch_size,
                num_threads=num_preprocess_threads,
                capacity=min_queue_examples + 3 * batch_size)

    # Display the training images in the visualizer.
    tf.summary.image('images', images)

    return images, tf.reshape(label_batch, [batch_size])

View Code

涉及不熟悉的tf操做：

tf.train.batch && tf.train.shuffle_batch：http://www.javashuo.com/article/p-vyvhdvvj-np.html
tf.summary.image：https://www.tensorflow.org/api_docs/python/tf/summary/image

效果以下：

distorted_inputs

利用上面兩個函數生成要訓練的數據

def distorted_inputs(data_dir, batch_size):
    '''
    調用read_cifar10讀取圖片並作數據加強，繼而調用_generate_image_and_label_batch產生一個batch的數據
    返回值：
        images: Images. 4D tensor of [batch_size, IMAGE_SIZE, IMAGE_SIZE, 3] size.
        labels: Labels. 1D tensor of [batch_size] size.
    '''

    filenames = [os.path.join(data_dir, 'data_batch_%d.bin' % i) for i in xrange(1, 6)]
    for f in filenames:
        if not tf.gfile.Exists(f):
            raise ValueError('Failed to find file: ' + f)

    # 文件名隊列
    filename_queue = tf.train.string_input_producer(filenames)

    # 從文件名隊列中讀取圖片
    read_input = read_cifar10(filename_queue)
    reshaped_image = tf.cast(read_input.uint8image, tf.float32)

    height = IMAGE_SIZE
    width = IMAGE_SIZE

    # 數據加強
    distorted_image = tf.random_crop(reshaped_image, [height, width, 3])
    distorted_image = tf.image.random_flip_left_right(distorted_image)
    distorted_image = tf.image.random_brightness(distorted_image, max_delta=63)
    distorted_image = tf.image.random_contrast(distorted_image, lower=0.2, upper=1.8)

    # Subtract off the mean and divide by the variance of the pixels.
    float_image = tf.image.per_image_standardization(distorted_image)

    # Set the shapes of tensors.
    float_image.set_shape([height, width, 3])
    read_input.label.set_shape([1])

    # Ensure that the random shuffling has good mixing properties.
    min_fraction_of_examples_in_queue = 0.4
    min_queue_examples = int(NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN * min_fraction_of_examples_in_queue)
    print('Filling queue with %d CIFAR images before starting to train. % min_queue_examples)

    # Generate a batch of images and labels by building up a queue of examples.
    return _generate_image_and_label_batch(float_image, read_input.label, min_queue_examples, batch_size, shuffle=True)

View Code

tf.gfile：https://www.tensorflow.org/api_docs/python/tf/gfile https://zhuanlan.zhihu.com/p/31536538
tf.image.per_image_standardization：https://www.tensorflow.org/api_docs/python/tf/image/per_image_standardization

cifar10_train.py

知道cifar10.py是真正的訓練網絡實現文件，先來看cifar10_train.py的調用，再進而學習每一個步驟是如何實現的。

完整代碼：

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

from datetime import datetime
import time

import tensorflow as tf

import cifar10

# tf.app.flags.FLAGS 是 TensorFlow 內部的一個全局變量存儲器，同時能夠用於命令行參數的處理
FLAGS = tf.app.flags.FLAGS

tf.app.flags.DEFINE_string('train_dir', '/tmp/cifar10_train', "Directory where to write event logs and checkpoint.")
tf.app.flags.DEFINE_integer('max_steps', 100000, "Number of batches to run.")
tf.app.flags.DEFINE_boolean('log_device_placement', False, "Whether to log device placement.")
tf.app.flags.DEFINE_integer('log_frequency', 100, "How often to log results to the console.")


def train():
    """Train CIFAR-10 for a number of steps."""
    with tf.Graph().as_default():
        global_step = tf.contrib.framework.get_or_create_global_step()

        # Get images and labels for CIFAR-10.
        images, labels = cifar10.distorted_inputs()

        # Build a Graph that computes the logits predictions from the inference model.
        logits = cifar10.inference(images)

        # Calculate loss.
        loss = cifar10.loss(logits, labels)

        # Build a Graph that trains the model with one batch of examples and updates the model parameters.
        train_op = cifar10.train(loss, global_step)

        class _LoggerHook(tf.train.SessionRunHook):
            """記錄損失loss和運行時間"""

            def begin(self):
              self._step = -1
              self._start_time = time.time()

            def before_run(self, run_context):
              self._step += 1
              return tf.train.SessionRunArgs(loss)  # Asks for loss value.

            def after_run(self, run_context, run_values):
                if self._step % FLAGS.log_frequency == 0:
                    current_time = time.time()
                    duration = current_time - self._start_time
                    self._start_time = current_time

                    loss_value = run_values.results
                    examples_per_sec = FLAGS.log_frequency * FLAGS.batch_size / duration
                    sec_per_batch = float(duration / FLAGS.log_frequency)

                    format_str = ('%s: step %d, loss = %.2f (%.1f examples/sec; %.3f sec/batch)')
                    print(format_str % (datetime.now(), self._step, loss_value, examples_per_sec, sec_per_batch))

    with tf.train.MonitoredTrainingSession(
        checkpoint_dir=FLAGS.train_dir,
        hooks=[tf.train.StopAtStepHook(last_step=FLAGS.max_steps),
               tf.train.NanTensorHook(loss),
               _LoggerHook()],
        config=tf.ConfigProto(log_device_placement=FLAGS.log_device_placement)) as mon_sess:
        while not mon_sess.should_stop():
            mon_sess.run(train_op)


def main(argv=None):  # pylint: disable=unused-argument
    cifar10.maybe_download_and_extract()
    if tf.gfile.Exists(FLAGS.train_dir):
        tf.gfile.DeleteRecursively(FLAGS.train_dir)
    tf.gfile.MakeDirs(FLAGS.train_dir)
    train()


if __name__ == '__main__':
    tf.app.run()

cifar10_train.py

tf.train.MonitoredTrainingSession：https://www.tensorflow.org/api_docs/python/tf/train/MonitoredTrainingSession
tf.train.SessionRunHook：https://www.tensorflow.org/api_docs/python/tf/train/SessionRunHook

cifar10.py

該文件是關鍵，他實現了整個網絡架構

#encoding=utf-8

# pylint: disable=missing-docstring
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import os
import re
import sys
import tarfile

from six.moves import urllib
import tensorflow as tf

import cifar10_input

FLAGS = tf.app.flags.FLAGS

# Basic model parameters.
tf.app.flags.DEFINE_integer('batch_size', 128, "Number of images to process in a batch.")
tf.app.flags.DEFINE_string('data_dir', '/tmp/cifar10_data', "Path to the CIFAR-10 data directory.")
tf.app.flags.DEFINE_boolean('use_fp16', False, "Train the model using fp16.")

# Global constants describing the CIFAR-10 data set.
IMAGE_SIZE = cifar10_input.IMAGE_SIZE
NUM_CLASSES = cifar10_input.NUM_CLASSES
NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN = cifar10_input.NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN
NUM_EXAMPLES_PER_EPOCH_FOR_EVAL = cifar10_input.NUM_EXAMPLES_PER_EPOCH_FOR_EVAL

# Constants describing the training process.
MOVING_AVERAGE_DECAY = 0.9999     # The decay to use for the moving average.
NUM_EPOCHS_PER_DECAY = 350.0      # Epochs after which learning rate decays.
LEARNING_RATE_DECAY_FACTOR = 0.1  # Learning rate decay factor.
INITIAL_LEARNING_RATE = 0.1       # Initial learning rate.

# If a model is trained with multiple GPUs, prefix all Op names with tower_name
# to differentiate the operations. Note that this prefix is removed from the
# names of the summaries when visualizing a model.
TOWER_NAME = 'tower'

DATA_URL = 'http://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz'

文件頭

一些輔助函數：

def _activation_summary(x):
    """Helper to create summaries for activations.
    Creates a summary that provides a histogram of activations.
    Creates a summary that measures the sparsity of activations.
    Args:
        x: Tensor
    """
    # Remove 'tower_[0-9]/' from the name in case this is a multi-GPU training session. 
    # This helps the clarity of presentation on tensorboard.
    tensor_name = re.sub('%s_[0-9]*/' % TOWER_NAME, '', x.op.name)
    tf.summary.histogram(tensor_name + '/activations', x)
    tf.summary.scalar(tensor_name + '/sparsity', tf.nn.zero_fraction(x))

在tensorboard中添加紀錄（_activation_summary）

def _variable_on_cpu(name, shape, initializer):
    """Helper to create a Variable stored on CPU memory.
    Args:
      name: name of the variable
      shape: list of ints
      initializer: initializer for Variable
    Returns:
      Variable Tensor
    """
    with tf.device('/cpu:0'):
        dtype = tf.float16 if FLAGS.use_fp16 else tf.float32
        var = tf.get_variable(name, shape, initializer=initializer, dtype=dtype)
    return var

cpu上創建變量（_variable_on_cpu）

def _variable_with_weight_decay(name, shape, stddev, wd):
    """Helper to create an initialized Variable with weight decay.

    Note that the Variable is initialized with a truncated normal distribution.
    A weight decay is added only if one is specified.

    Args:
      name: name of the variable
      shape: list of ints
      stddev: standard deviation of a truncated Gaussian
      wd: add L2Loss weight decay multiplied by this float. If None, weight
        decay is not added for this Variable.

    Returns:
      Variable Tensor
    """
    dtype = tf.float16 if FLAGS.use_fp16 else tf.float32
    var = _variable_on_cpu(name,shape, tf.truncated_normal_initializer(stddev=stddev, dtype=dtype))
    if wd is not None:
        weight_decay = tf.multiply(tf.nn.l2_loss(var), wd, name='weight_loss')
        tf.add_to_collection('losses', weight_decay)
    return var

建立帶權重衰減的初始化變量（_variable_with_weight_decay）

def maybe_download_and_extract():
    """Download and extract the tarball from Alex's website."""
    dest_directory = FLAGS.data_dir
    if not os.path.exists(dest_directory):
        os.makedirs(dest_directory)
    filename = DATA_URL.split('/')[-1]
    filepath = os.path.join(dest_directory, filename)
    if not os.path.exists(filepath):
        def _progress(count, block_size, total_size):
            sys.stdout.write('\r>> Downloading %s %.1f%%' % (filename,
          float(count * block_size) / float(total_size) * 100.0))
            sys.stdout.flush()
        filepath, _ = urllib.request.urlretrieve(DATA_URL, filepath, _progress)
        print()
        statinfo = os.stat(filepath)
        print('Successfully downloaded', filename, statinfo.st_size, 'bytes.')
    extracted_dir_path = os.path.join(dest_directory, 'cifar-10-batches-bin')
    if not os.path.exists(extracted_dir_path):
        tarfile.open(filepath, 'r:gz').extractall(dest_directory)

檢查數據是否存在 maybe_download_and_extract

tf.get_variable()和tf.Variable()的區別：https://blog.csdn.net/u012223913/article/details/78533910?locationNum=8&fps=1

distorted_inputs

把cifar10_input.py中distorted_inputs函數添加了一層，根據配置參數use_fp16決定是否採用float16的數據類型進行計算

def distorted_inputs():
    """Construct distorted input for CIFAR training using the Reader ops.
    Returns:
        images: Images. 4D tensor of [batch_size, IMAGE_SIZE, IMAGE_SIZE, 3] size.
        labels: Labels. 1D tensor of [batch_size] size.
    Raises:
        ValueError: If no data_dir
    """
    if not FLAGS.data_dir:
        raise ValueError('Please supply a data_dir')
    data_dir = os.path.join(FLAGS.data_dir, 'cifar-10-batches-bin')
    images, labels = cifar10_input.distorted_inputs(data_dir=data_dir, batch_size=FLAGS.batch_size)
    if FLAGS.use_fp16:
        images = tf.cast(images, tf.float16)
        labels = tf.cast(labels, tf.float16)
    return images, labels

distorted_inputs

inference

def inference(images):
    """Build the CIFAR-10 model.
    Args:
        images: Images returned from distorted_inputs() or inputs().
    Returns:
        Logits.
    """
    # We instantiate all variables using tf.get_variable() instead of tf.Variable() in order to share variables across multiple GPU training runs.
    # If we only ran this model on a single GPU, we could simplify this function by replacing all instances of tf.get_variable() with tf.Variable().

    # 卷積層
    with tf.variable_scope('conv1') as scope:
        kernel = _variable_with_weight_decay('weights', shape=[5, 5, 3, 64], stddev=5e-2, wd=0.0)
        conv = tf.nn.conv2d(images, kernel, [1, 1, 1, 1], padding='SAME')
        biases = _variable_on_cpu('biases', [64], tf.constant_initializer(0.0))
        pre_activation = tf.nn.bias_add(conv, biases)
        conv1 = tf.nn.relu(pre_activation, name=scope.name)
        _activation_summary(conv1)

    pool1 = tf.nn.max_pool(conv1, ksize=[1, 3, 3, 1], strides=[1, 2, 2, 1], padding='SAME', name='pool1')
    # 這是局部響應歸一化層（LRN），如今的模型大多不採用
    norm1 = tf.nn.lrn(pool1, 4, bias=1.0, alpha=0.001 / 9.0, beta=0.75, name='norm1')

    with tf.variable_scope('conv2') as scope:
        kernel = _variable_with_weight_decay('weights', shape=[5, 5, 64, 64], stddev=5e-2, wd=0.0)
        conv = tf.nn.conv2d(norm1, kernel, [1, 1, 1, 1], padding='SAME')
        biases = _variable_on_cpu('biases', [64], tf.constant_initializer(0.1))
        pre_activation = tf.nn.bias_add(conv, biases)
        conv2 = tf.nn.relu(pre_activation, name=scope.name)
        _activation_summary(conv2)

    norm2 = tf.nn.lrn(conv2, 4, bias=1.0, alpha=0.001 / 9.0, beta=0.75, name='norm2')
    pool2 = tf.nn.max_pool(norm2, ksize=[1, 3, 3, 1], strides=[1, 2, 2, 1], padding='SAME', name='pool2')

    # 全鏈接層
    with tf.variable_scope('local3') as scope:
        # 後面再也不作卷積了，因此把pool2進行reshape，方便作全鏈接
        reshape = tf.reshape(pool2, [FLAGS.batch_size, -1])
        dim = reshape.get_shape()[1].value
        weights = _variable_with_weight_decay('weights', shape=[dim, 384], stddev=0.04, wd=0.004)
        biases = _variable_on_cpu('biases', [384], tf.constant_initializer(0.1))
        local3 = tf.nn.relu(tf.matmul(reshape, weights) + biases, name=scope.name)
        _activation_summary(local3)

    with tf.variable_scope('local4') as scope:
        weights = _variable_with_weight_decay('weights', shape=[384, 192], stddev=0.04, wd=0.004)
        biases = _variable_on_cpu('biases', [192], tf.constant_initializer(0.1))
        local4 = tf.nn.relu(tf.matmul(local3, weights) + biases, name=scope.name)
        _activation_summary(local4)

    # 這裏不顯示i進行softmax變換，只輸出變換前的Logit（即變量softmax_linear）
    # tf.nn.sparse_softmax_cross_entropy_with_logits accepts the unscaled logits and performs the softmax internally for efficiency.
    with tf.variable_scope('softmax_linear') as scope:
        weights = _variable_with_weight_decay('weights', [192, NUM_CLASSES], stddev=1/192.0, wd=0.0)
        biases = _variable_on_cpu('biases', [NUM_CLASSES], tf.constant_initializer(0.0))
        softmax_linear = tf.add(tf.matmul(local4, weights), biases, name=scope.name)
        _activation_summary(softmax_linear)

    return softmax_linear

模型主幹

兩層卷積，三層全鏈接

loss

def loss(logits, labels):
    """Add L2Loss to all the trainable variables. Add summary for "Loss" and "Loss/avg".
    Args:
        logits: Logits from inference().
        labels: Labels from distorted_inputs or inputs(). 1-D tensor of shape [batch_size]
    Returns:
        Loss tensor of type float.
  """
    # Calculate the average cross entropy loss across the batch.
    labels = tf.cast(labels, tf.int64)
    cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(
      labels=labels, logits=logits, name='cross_entropy_per_example')
    cross_entropy_mean = tf.reduce_mean(cross_entropy, name='cross_entropy')
    tf.add_to_collection('losses', cross_entropy_mean)

    # The total loss is defined as the cross entropy loss plus all of the weight decay terms (L2 loss).
    return tf.add_n(tf.get_collection('losses'), name='total_loss')

帶L2損失的loss

def _add_loss_summaries(total_loss):
    """Add summaries for losses in CIFAR-10 model.

    Generates moving average for all losses and associated summaries for visualizing the performance of the network.

    Args:
        total_loss: Total loss from loss().
    Returns:
        loss_averages_op: op for generating moving averages of losses.
    """
    # Compute the moving average of all individual losses and the total loss.
    loss_averages = tf.train.ExponentialMovingAverage(0.9, name='avg')
    losses = tf.get_collection('losses')
    loss_averages_op = loss_averages.apply(losses + [total_loss])

    # Attach a scalar summary to all individual losses and the total loss; do the same for the averaged version of the losses.
    for l in losses + [total_loss]:
        # Name each loss as '(raw)' and name the moving average version of the loss as the original loss name.
        tf.summary.scalar(l.op.name + ' (raw)', l)
        tf.summary.scalar(l.op.name, loss_averages.average(l))

    return loss_averages_op

記錄loss到tensorboard

train

def train(total_loss, global_step):
    """Train CIFAR-10 model.
    Create an optimizer and apply to all trainable variables. Add moving average for all trainable variables.
    Args:
        total_loss: Total loss from loss().
        global_step: Integer Variable counting the number of training steps processed.
    Returns:
        train_op: op for training.
    """
    # Variables that affect learning rate.
    num_batches_per_epoch = NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN / FLAGS.batch_size
    decay_steps = int(num_batches_per_epoch * NUM_EPOCHS_PER_DECAY)

    # Decay the learning rate exponentially based on the number of steps.
    lr = tf.train.exponential_decay(INITIAL_LEARNING_RATE,
                                  global_step,
                                  decay_steps,
                                  LEARNING_RATE_DECAY_FACTOR,
                                  staircase=True)
    tf.summary.scalar('learning_rate', lr)

    # Generate moving averages of all losses and associated summaries.
    loss_averages_op = _add_loss_summaries(total_loss)

    # Compute gradients.
    with tf.control_dependencies([loss_averages_op]):
        opt = tf.train.GradientDescentOptimizer(lr)
        grads = opt.compute_gradients(total_loss)

    # Apply gradients.
    apply_gradient_op = opt.apply_gradients(grads, global_step=global_step)

    # Add histograms for trainable variables.
    for var in tf.trainable_variables():
        tf.summary.histogram(var.op.name, var)

    # Add histograms for gradients.
    for grad, var in grads:
        if grad is not None:
            tf.summary.histogram(var.op.name + '/gradients', grad)

    # Track the moving averages of all trainable variables.
    variable_averages = tf.train.ExponentialMovingAverage(MOVING_AVERAGE_DECAY, global_step)
    variables_averages_op = variable_averages.apply(tf.trainable_variables())

    with tf.control_dependencies([apply_gradient_op, variables_averages_op]):
        train_op = tf.no_op(name='train')

    return train_op

優化器

tf.train.exponential_decay：https://www.jianshu.com/p/f9f66a89f6ba
tf.control_dependencies：https://www.tensorflow.org/api_docs/python/tf/control_dependencies
tf.train.ExponentialMovingAverage：https://www.tensorflow.org/api_docs/python/tf/train/ExponentialMovingAverage

以上是訓練過程代碼的學習，執行python cifar10_train.py --train_dir cifar10_train/ --data_dir cifar10_data/便可運行，運行tensorboard --logdir cifar10_train/便可在tensorboard中查看訓練進度

我是100K steps (256 epochs of data) 訓練的，差很少花了3.5h

測試

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

from datetime import datetime
import math
import time

import numpy as np
import tensorflow as tf

import cifar10

FLAGS = tf.app.flags.FLAGS

tf.app.flags.DEFINE_string('eval_dir', '/tmp/cifar10_eval', "Directory where to write event logs.")
tf.app.flags.DEFINE_string('eval_data', 'test', "Either 'test' or 'train_eval'.")
tf.app.flags.DEFINE_string('checkpoint_dir', '/tmp/cifar10_train', "Directory where to read model checkpoints.")
tf.app.flags.DEFINE_integer('eval_interval_secs', 60 * 5, "How often to run the eval.")
tf.app.flags.DEFINE_integer('num_examples', 10000, "Number of examples to run.")
tf.app.flags.DEFINE_boolean('run_once', False, "Whether to run eval only once.")


def eval_once(saver, summary_writer, top_k_op, summary_op):
    """Run Eval once.
    Args:
        saver: Saver.
        summary_writer: Summary writer.
        top_k_op: Top K op.
        summary_op: Summary op.
    """
    with tf.Session() as sess:
        ckpt = tf.train.get_checkpoint_state(FLAGS.checkpoint_dir)
        if ckpt and ckpt.model_checkpoint_path:
            # Restores from checkpoint
            saver.restore(sess, ckpt.model_checkpoint_path)
            # Assuming model_checkpoint_path looks something like: /my-favorite-path/cifar10_train/model.ckpt-0 
            #extract global_step from it.
            global_step = ckpt.model_checkpoint_path.split('/')[-1].split('-')[-1]
        else:
            print('No checkpoint file found')
            return

        # Start the queue runners.
        coord = tf.train.Coordinator()
        try:
            threads = []
            for qr in tf.get_collection(tf.GraphKeys.QUEUE_RUNNERS):
                threads.extend(qr.create_threads(sess, coord=coord, daemon=True, start=True))

            num_iter = int(math.ceil(FLAGS.num_examples / FLAGS.batch_size))
            true_count = 0  # Counts the number of correct predictions.
            total_sample_count = num_iter * FLAGS.batch_size
            step = 0
            while step < num_iter and not coord.should_stop():
                predictions = sess.run([top_k_op])
                true_count += np.sum(predictions)
                step += 1

            # Compute precision @ 1.
            precision = true_count / total_sample_count
            print('%s: precision @ 1 = %.3f' % (datetime.now(), precision))

            summary = tf.Summary()
            summary.ParseFromString(sess.run(summary_op))
            summary.value.add(tag='Precision @ 1', simple_value=precision)
            summary_writer.add_summary(summary, global_step)
        except Exception as e:  # pylint: disable=broad-except
            coord.request_stop(e)

        coord.request_stop()
        coord.join(threads, stop_grace_period_secs=10)


def evaluate():
    """Eval CIFAR-10 for a number of steps."""
    with tf.Graph().as_default() as g:
        # Get images and labels for CIFAR-10.
        eval_data = FLAGS.eval_data == 'test'
        images, labels = cifar10.inputs(eval_data=eval_data)

        # Build a Graph that computes the logits predictions from the
        # inference model.
        logits = cifar10.inference(images)

        # Calculate predictions.
        top_k_op = tf.nn.in_top_k(logits, labels, 1)

        # Restore the moving average version of the learned variables for eval.
        variable_averages = tf.train.ExponentialMovingAverage(cifar10.MOVING_AVERAGE_DECAY)
        variables_to_restore = variable_averages.variables_to_restore()
        saver = tf.train.Saver(variables_to_restore)

        # Build the summary operation based on the TF collection of Summaries.
        summary_op = tf.summary.merge_all()

        summary_writer = tf.summary.FileWriter(FLAGS.eval_dir, g)

        while True:
            eval_once(saver, summary_writer, top_k_op, summary_op)
            if FLAGS.run_once:
                break
            time.sleep(FLAGS.eval_interval_secs)


def main(argv=None):  # pylint: disable=unused-argument
    cifar10.maybe_download_and_extract()
    if tf.gfile.Exists(FLAGS.eval_dir):
        tf.gfile.DeleteRecursively(FLAGS.eval_dir)
    tf.gfile.MakeDirs(FLAGS.eval_dir)
    evaluate()


if __name__ == '__main__':
    tf.app.run()