『TensorFlow』SSD源碼學習_其二：基於VGG的SSD網絡前向架構

時間 2019-11-18

標籤 TensorFlow ssd 源碼學習其二基於 vgg 網絡架構欄目存儲简体版

原文原文鏈接

Fork版本項目地址：SSDpython

1、SSD基礎

在分類器基礎之上想要識別物體，實質就是 用分類器掃描整張圖像，定位特徵位置 。這裏的關鍵就是用什麼算法掃描，好比能夠將圖片分紅若干網格，用分類器一個格子、一個格子掃描，這種方法有幾個問題：git

問題1 ：目標正好處在兩個網格交界處，就會形成分類器的結果在兩邊都不足夠顯著，形成漏報（True Negative）。github

問題2 ：目標過大或太小，致使網格中結果不足夠顯著，形成漏報。算法

針對第一點，能夠採用相互重疊的網格。好比一個網格大小是 32x32 像素，那麼就網格向下移動時，只動 8 個像素，走四步才徹底移出之前的網格。針對第二點，能夠採用大小網格相互結合的策略，32x32 網格掃完，64x64 網格再掃描一次，16x16 網格也再掃一次。網絡

可是這樣會帶來其餘問題——咱們爲了保證準確率， 對同一張圖片掃描次數過多，嚴重影響了計算速度 ，形成這種策略 沒法作到實時標註 。架構

爲了快速、實時標註圖像特徵，對於整個識別定位算法，就有了諸多改進方法。app

一個最基本的思路是，合理使用卷積神經網絡的內部結構，避免重複計算。用卷積神經網絡掃描某一圖片時，實際上卷積獲得的結果已經存儲了不一樣大小的網格信息，這一過程實際上已經完成了咱們上一部分提出的改進措施，以下圖所示，咱們發現前幾層卷積核的結果更關注細節，後面的卷積層結果更加關注總體：ide

對於問題1，若是一個物體位於兩個格子的中間，雖然兩邊都不必定足夠顯著，可是兩邊的基本特徵若是能夠合理組合的話，咱們就不須要再掃描一次。然後幾層則愈來愈關注總體，對問題2，目標可能會過大太小，可是特徵一樣也會留下。也就是說，用卷積神經網絡掃描圖像過程當中，因爲深度神經網絡自己就有好幾層卷積、實際上已經反覆屢次掃描圖像，以上兩個問題能夠經過合理使用卷積神經網絡的中間結果獲得解決。函數

在 SSD 算法以前，MultiBox，FastR-CNN 法都採用了兩步的策略，即第一步經過深度神經網絡，對潛在的目標物體進行定位，即先產生Box；至於Box 裏面的物體如何分類，這裏再進行第二步計算。此外第一代的 YOLO 算法能夠作到一步完成計算加定位，可是結構中採用了全鏈接層，而全鏈接層有不少問題，而且正在逐步被深度神經網絡架構「拋棄」。

2、TF_SSD項目中網絡的結構

回到項目中，以VGG300（/nets/ssd_vgg_300.py）爲例，大致思路就是，用VGG 深度神經網絡的前五層，並額外多加幾層結構，最後提取其中幾層進過卷積後的結果，進行網格搜索，找目標特徵。對應到函數裏，轉化爲三個大部分，原網絡結構、添加網絡結構、SSD處理結構：

def ssd_net(inputs,
            num_classes=SSDNet.default_params.num_classes,
            feat_layers=SSDNet.default_params.feat_layers,
            anchor_sizes=SSDNet.default_params.anchor_sizes,
            anchor_ratios=SSDNet.default_params.anchor_ratios,
            normalizations=SSDNet.default_params.normalizations,
            is_training=True,
            dropout_keep_prob=0.5,
            prediction_fn=slim.softmax,
            reuse=None,
            scope='ssd_300_vgg'):
    """SSD net definition.
    """
    # if data_format == 'NCHW':
    #     inputs = tf.transpose(inputs, perm=(0, 3, 1, 2))

    # End_points collect relevant activations for external use.
    """
      net = layers_lib.repeat(
          inputs, 2, layers.conv2d, 64, [3, 3], scope='conv1')
      net = layers_lib.max_pool2d(net, [2, 2], scope='pool1')
      net = layers_lib.repeat(net, 2, layers.conv2d, 128, [3, 3], scope='conv2')
      net = layers_lib.max_pool2d(net, [2, 2], scope='pool2')
      net = layers_lib.repeat(net, 3, layers.conv2d, 256, [3, 3], scope='conv3')
      net = layers_lib.max_pool2d(net, [2, 2], scope='pool3')
      net = layers_lib.repeat(net, 3, layers.conv2d, 512, [3, 3], scope='conv4')
      net = layers_lib.max_pool2d(net, [2, 2], scope='pool4')
      net = layers_lib.repeat(net, 3, layers.conv2d, 512, [3, 3], scope='conv5')
      net = layers_lib.max_pool2d(net, [2, 2], scope='pool5')
    """
    end_points = {}
    with tf.variable_scope(scope, 'ssd_300_vgg', [inputs], reuse=reuse):
        ######################################
        # 前五個 Blocks，首先照搬 VGG16 架構   #
        # 注意這裏使用 end_points 標註中間結果  #
        ######################################
        # ——————————————————Original VGG-16 blocks.———————————————————————
        net = slim.repeat(inputs, 2, slim.conv2d, 64, [3, 3], scope='conv1')
        end_points['block1'] = net
        net = slim.max_pool2d(net, [2, 2], scope='pool1')
        # Block 2.
        net = slim.repeat(net, 2, slim.conv2d, 128, [3, 3], scope='conv2')
        end_points['block2'] = net
        net = slim.max_pool2d(net, [2, 2], scope='pool2')
        # Block 3.
        net = slim.repeat(net, 3, slim.conv2d, 256, [3, 3], scope='conv3')
        end_points['block3'] = net
        net = slim.max_pool2d(net, [2, 2], scope='pool3')
        # Block 4.
        net = slim.repeat(net, 3, slim.conv2d, 512, [3, 3], scope='conv4')
        end_points['block4'] = net
        net = slim.max_pool2d(net, [2, 2], scope='pool4')
        # Block 5.
        net = slim.repeat(net, 3, slim.conv2d, 512, [3, 3], scope='conv5')
        end_points['block5'] = net
        net = slim.max_pool2d(net, [3, 3], stride=1, scope='pool5')  # 池化層步長由2修改到三

        ####################################
        # 後六個 Blocks，使用額外卷積層      #
        ####################################
        # ————————————Additional SSD blocks.——————————————————————
        # Block 6: let's dilate the hell out of it!
        net = slim.conv2d(net, 1024, [3, 3], rate=6, scope='conv6')
        end_points['block6'] = net
        net = tf.layers.dropout(net, rate=dropout_keep_prob, training=is_training)
        # Block 7: 1x1 conv. Because the fuck.
        net = slim.conv2d(net, 1024, [1, 1], scope='conv7')
        end_points['block7'] = net
        net = tf.layers.dropout(net, rate=dropout_keep_prob, training=is_training)

        # Block 8/9/10/11: 1x1 and 3x3 convolutions stride 2 (except lasts).
        end_point = 'block8'
        with tf.variable_scope(end_point):
            net = slim.conv2d(net, 256, [1, 1], scope='conv1x1')
            net = custom_layers.pad2d(net, pad=(1, 1))
            net = slim.conv2d(net, 512, [3, 3], stride=2, scope='conv3x3', padding='VALID')
        end_points[end_point] = net
        end_point = 'block9'
        with tf.variable_scope(end_point):
            net = slim.conv2d(net, 128, [1, 1], scope='conv1x1')
            net = custom_layers.pad2d(net, pad=(1, 1))
            net = slim.conv2d(net, 256, [3, 3], stride=2, scope='conv3x3', padding='VALID')
        end_points[end_point] = net
        end_point = 'block10'
        with tf.variable_scope(end_point):
            net = slim.conv2d(net, 128, [1, 1], scope='conv1x1')
            net = slim.conv2d(net, 256, [3, 3], scope='conv3x3', padding='VALID')
        end_points[end_point] = net
        end_point = 'block11'
        with tf.variable_scope(end_point):
            net = slim.conv2d(net, 128, [1, 1], scope='conv1x1')
            net = slim.conv2d(net, 256, [3, 3], scope='conv3x3', padding='VALID')
        end_points[end_point] = net

        ######################################
        # 每一箇中間層 end_points 返回中間結果   #
        # 將各層預測結果存入列表，返回給優化函數 #
        ######################################
        # Prediction and localisations layers.
        predictions = []
        logits = []
        localisations = []
        # feat_layers=['block4', 'block7', 'block8', 'block9', 'block10', 'block11']
        for i, layer in enumerate(feat_layers):
            with tf.variable_scope(layer + '_box'):
                p, l = ssd_multibox_layer(end_points[layer],
                                          num_classes,
                                          anchor_sizes[i],
                                          anchor_ratios[i],
                                          normalizations[i])
                """
                框的數目等於anchor_sizes[i]和anchor_ratios[i]的長度和
                anchor_sizes=[(21., 45.),
                              (45., 99.),
                              (99., 153.),
                              (153., 207.),
                              (207., 261.),
                              (261., 315.)]
                anchor_ratios=[[2, .5],
                               [2, .5, 3, 1./3],
                               [2, .5, 3, 1./3],
                               [2, .5, 3, 1./3],
                               [2, .5],
                               [2, .5]]
                normalizations=[20, -1, -1, -1, -1, -1]
                """
            predictions.append(prediction_fn(p))  # prediction_fn=slim.softmax
            logits.append(p)
            localisations.append(l)

        return predictions, localisations, logits, end_points
ssd_net.default_image_size = 300

在整個函數最後，給出了ssd_arg_scope函數，用於約束網絡中的超參數設定，用法腳本頭中已經給了：

Usage:
 with slim.arg_scope(ssd_vgg.ssd_vgg()):
 outputs, end_points = ssd_vgg.ssd_vgg(inputs)

def ssd_arg_scope(weight_decay=0.0005, data_format='NHWC'):
    """Defines the VGG arg scope.

    Args:
      weight_decay: The l2 regularization coefficient.

    Returns:
      An arg_scope.
    """
    with slim.arg_scope([slim.conv2d, slim.fully_connected],
                        activation_fn=tf.nn.relu,
                        weights_regularizer=slim.l2_regularizer(weight_decay),
                        weights_initializer=tf.contrib.layers.xavier_initializer(),
                        biases_initializer=tf.zeros_initializer()):
        with slim.arg_scope([slim.conv2d, slim.max_pool2d],
                            padding='SAME',
                            data_format=data_format):
            with slim.arg_scope([custom_layers.pad2d,
                                 custom_layers.l2_normalization,
                                 custom_layers.channel_to_last],
                                data_format=data_format) as sc:
                return sc

a、超參數設定

實際上原程序中超參數做爲一個class屬性給出的，咱們如今不關心這個class的信息，僅僅將其包含超參數設定的部分提取出來，提高對上面網絡的理解，

SSDParams = namedtuple('SSDParameters', ['img_shape',
                                         'num_classes',
                                         'no_annotation_label',
                                         'feat_layers',
                                         'feat_shapes',
                                         'anchor_size_bounds',
                                         'anchor_sizes',
                                         'anchor_ratios',
                                         'anchor_steps',
                                         'anchor_offset',
                                         'normalizations',
                                         'prior_scaling'
                                         ])


class SSDNet(object):
    default_params = SSDParams(
        img_shape=(300, 300),
        num_classes=21,
        no_annotation_label=21,
        feat_layers=['block4', 'block7', 'block8', 'block9', 'block10', 'block11'],
        feat_shapes=[(38, 38), (19, 19), (10, 10), (5, 5), (3, 3), (1, 1)],
        anchor_size_bounds=[0.15, 0.90],
        # anchor_size_bounds=[0.20, 0.90],
        anchor_sizes=[(21., 45.),
                      (45., 99.),
                      (99., 153.),
                      (153., 207.),
                      (207., 261.),
                      (261., 315.)],
        anchor_ratios=[[2, .5],
                       [2, .5, 3, 1./3],
                       [2, .5, 3, 1./3],
                       [2, .5, 3, 1./3],
                       [2, .5],
                       [2, .5]],
        anchor_steps=[8, 16, 32, 64, 100, 300],
        anchor_offset=0.5,
        normalizations=[1, -1, -1, -1, -1, -1],  # 控制SSD層處理時是否預先沿着HW正則化
        prior_scaling=[0.1, 0.1, 0.2, 0.2]
        )

b、SSD處理結構

        # Prediction and localisations layers.
        predictions = []
        logits = []
        localisations = []
        # feat_layers=['block4', 'block7', 'block8', 'block9', 'block10', 'block11']
        for i, layer in enumerate(feat_layers):
            with tf.variable_scope(layer + '_box'):
                p, l = ssd_multibox_layer(end_points[layer],  # <-----SSD處理
                                          num_classes,
                                          anchor_sizes[i],
                                          anchor_ratios[i],
                                          normalizations[i])
            predictions.append(prediction_fn(p))  # prediction_fn=slim.softmax
            logits.append(p)
            localisations.append(l)

        return predictions, localisations, logits, end_points

在網絡架構的最後，會對選取的特徵層外接新的卷積處理（上面代碼），處理函數以下：

def tensor_shape(x, rank=3):
    """Returns the dimensions of a tensor.
    Args:
      image: A N-D Tensor of shape.
    Returns:
      A list of dimensions. Dimensions that are statically known are python
        integers,otherwise they are integer scalar tensors.
    """
    if x.get_shape().is_fully_defined():
        return x.get_shape().as_list()
    else:
        # get_shape返回值，with_rank至關於斷言assert，是否rank爲指定值
        static_shape = x.get_shape().with_rank(rank).as_list()
        # tf.shape返回張量，其中num解釋爲"The length of the dimension `axis`."，axis默認爲0
        dynamic_shape = tf.unstack(tf.shape(x), num=rank)
        # list，有定義的給數字，沒有的給tensor
        return [s if s is not None else d
                for s, d in zip(static_shape, dynamic_shape)]


def ssd_multibox_layer(inputs,
                       num_classes,
                       sizes,
                       ratios=[1],
                       normalization=-1,
                       bn_normalization=False):
    """Construct a multibox layer, return a class and localization predictions.
    """
    net = inputs
    if normalization > 0:
        net = custom_layers.l2_normalization(net, scaling=True)
    # Number of anchors.
    num_anchors = len(sizes) + len(ratios)

    # Location.
    num_loc_pred = num_anchors * 4  # 每個框有四個座標
    loc_pred = slim.conv2d(net, num_loc_pred, [3, 3], activation_fn=None,
                           scope='conv_loc')  # 輸出C表示不一樣框的某個座標
    # 強制轉換爲NHWC
    loc_pred = custom_layers.channel_to_last(loc_pred)
    # NHW(num_anchors+4)->NHW,num_anchors,4
    loc_pred = tf.reshape(loc_pred,
                          tensor_shape(loc_pred, 4)[:-1]+[num_anchors, 4])
    # Class prediction.
    num_cls_pred = num_anchors * num_classes  # 每個框都要計算全部的類別
    cls_pred = slim.conv2d(net, num_cls_pred, [3, 3], activation_fn=None,
                           scope='conv_cls')  # 輸出C表示不一樣框的對某個類的預測
    # 強制轉換爲NHWC
    cls_pred = custom_layers.channel_to_last(cls_pred)
    # NHW(num_anchors+類別)->NHW,num_anchors,類別
    cls_pred = tf.reshape(cls_pred,
                          tensor_shape(cls_pred, 4)[:-1]+[num_anchors, num_classes])
    return cls_pred, loc_pred

根據是否正則化的的參數，對特徵層進行L2正則化（空間維度C上正則化），具體流程見下節

而後並行的在選定特徵層後面加上兩個卷積，一個輸出通道爲num_anchors×4，一個輸出通道爲num_anchors×類別數

將兩個卷積的輸出格維度各自擴展一維，排序轉換爲：[NHW,num_anchors,4] 和 [NHW,num_anchors,類別]

此時咱們能夠知道網絡結構函數的返回的意義了：各個指定層SSD處理後輸出的框對類別的機率，各個指定層SSD處理後輸出的框座標修正，各個指定層SSD處理後輸出的框對類別的原始輸出，全部中間層的end_point。

c、custom_layers.l2_normalization：特徵層L2正則化

首先在特徵層維度進行正則化，過程見nn.l2_normalize，而後對每個層取一個scale因子，對各個層放縮調整（因子是可學習的），最後返回這個調整後的特徵

@add_arg_scope
def l2_normalization(
        inputs,
        scaling=False,
        scale_initializer=init_ops.ones_initializer(),
        reuse=None,
        variables_collections=None,
        outputs_collections=None,
        data_format='NHWC',
        trainable=True,
        scope=None):
    """Implement L2 normalization on every feature (i.e. spatial normalization).

    Should be extended in some near future to other dimensions, providing a more
    flexible normalization framework.

    Args:
      inputs: a 4-D tensor with dimensions [batch_size, height, width, channels].
      scaling: whether or not to add a post scaling operation along the dimensions
        which have been normalized.
      scale_initializer: An initializer for the weights.
      reuse: whether or not the layer and its variables should be reused. To be
        able to reuse the layer scope must be given.
      variables_collections: optional list of collections for all the variables or
        a dictionary containing a different list of collection per variable.
      outputs_collections: collection to add the outputs.
      data_format:  NHWC or NCHW data format.
      trainable: If `True` also add variables to the graph collection
        `GraphKeys.TRAINABLE_VARIABLES` (see tf.Variable).
      scope: Optional scope for `variable_scope`.
    Returns:
      A `Tensor` representing the output of the operation.
    """

    with variable_scope.variable_scope(
            scope, 'L2Normalization', [inputs], reuse=reuse) as sc:
        inputs_shape = inputs.get_shape()
        inputs_rank = inputs_shape.ndims
        dtype = inputs.dtype.base_dtype

        # 在C上作l2標準化
        if data_format == 'NHWC':
            # norm_dim = tf.range(1, inputs_rank-1)
            norm_dim = tf.range(inputs_rank-1, inputs_rank)
            params_shape = inputs_shape[-1:]
        elif data_format == 'NCHW':
            # norm_dim = tf.range(2, inputs_rank)
            norm_dim = tf.range(1, 2)
            params_shape = (inputs_shape[1])

        # Normalize along spatial dimensions.
        outputs = nn.l2_normalize(inputs, norm_dim, epsilon=1e-12)
        # Additional scaling.
        if scaling:
　　　　　　　# 從collections獲取變量
            scale_collections = utils.get_variable_collections(
                variables_collections, 'scale')
            # 建立變量，shape=C的層數
            scale = variables.model_variable('gamma',
                                             shape=params_shape,
                                             dtype=dtype,
                                             initializer=scale_initializer,
                                             collections=scale_collections,
                                             trainable=trainable)
            if data_format == 'NHWC':
                outputs = tf.multiply(outputs, scale)
            elif data_format == 'NCHW':
                scale = tf.expand_dims(scale, axis=-1)
                scale = tf.expand_dims(scale, axis=-1)
                outputs = tf.multiply(outputs, scale)
                # outputs = tf.transpose(outputs, perm=(0, 2, 3, 1))

        # 爲outputs添加別名，並將之收集進collection，返回原節點
        return utils.collect_named_outputs(outputs_collections,
                                           sc.original_name_scope, outputs)

至此，網絡結構的介紹就完成了，下一節咱們將關注目標檢測模型的關鍵技術之一：定位框的生成，並串聯本節，理解整個SSD網絡的生成過程。

附錄、相關實現

custom_layers.channel_to_last：NHWC轉化

@add_arg_scope  # 層能夠被slim.arg_scope設定
def channel_to_last(inputs,
                    data_format='NHWC',
                    scope=None):
    """Move the channel axis to the last dimension. Allows to
    provide a single output format whatever the input data format.

    Args:
      inputs: Input Tensor;
      data_format: NHWC or NCHW.
    Return:
      Input in NHWC format.
    """
    with tf.name_scope(scope, 'channel_to_last', [inputs]):
        if data_format == 'NHWC':
            net = inputs
        elif data_format == 'NCHW':
            net = tf.transpose(inputs, perm=(0, 2, 3, 1))
        return net

custom_layers.pad2d：2D-tensor填充

@add_arg_scope  # 層能夠被slim.arg_scope設定
def pad2d(inputs,
          pad=(0, 0),
          mode='CONSTANT',
          data_format='NHWC',
          trainable=True,
          scope=None):
    """2D Padding layer, adding a symmetric padding to H and W dimensions.

    Aims to mimic padding in Caffe and MXNet, helping the port of models to
    TensorFlow. Tries to follow the naming convention of `tf.contrib.layers`.

    Args:
      inputs: 4D input Tensor;
      pad: 2-Tuple with padding values for H and W dimensions;（填充的寬度）
      mode: Padding mode. C.f. `tf.pad`
      data_format:  NHWC or NCHW data format.
    """
    with tf.name_scope(scope, 'pad2d', [inputs]):
        # Padding shape.
        if data_format == 'NHWC':
            paddings = [[0, 0], [pad[0], pad[0]], [pad[1], pad[1]], [0, 0]]
        elif data_format == 'NCHW':
            paddings = [[0, 0], [0, 0], [pad[0], pad[0]], [pad[1], pad[1]]]
        net = tf.pad(inputs, paddings, mode=mode)
        return net

slim的vgg_16

def vgg_16(inputs,
           num_classes=1000,
           is_training=True,
           dropout_keep_prob=0.5,
           spatial_squeeze=True,
           scope='vgg_16'):
  """Oxford Net VGG 16-Layers version D Example.
  Note: All the fully_connected layers have been transformed to conv2d layers.
        To use in classification mode, resize input to 224x224.
  Args:
    inputs: a tensor of size [batch_size, height, width, channels].
    num_classes: number of predicted classes.
    is_training: whether or not the model is being trained.
    dropout_keep_prob: the probability that activations are kept in the dropout
      layers during training.
    spatial_squeeze: whether or not should squeeze the spatial dimensions of the
      outputs. Useful to remove unnecessary dimensions for classification.
    scope: Optional scope for the variables.
  Returns:
    the last op containing the log predictions and end_points dict.
  """
  with variable_scope.variable_scope(scope, 'vgg_16', [inputs]) as sc:
    end_points_collection = sc.original_name_scope + '_end_points'
    # Collect outputs for conv2d, fully_connected and max_pool2d.
    with arg_scope(
        [layers.conv2d, layers_lib.fully_connected, layers_lib.max_pool2d],
        outputs_collections=end_points_collection):
      net = layers_lib.repeat(
          inputs, 2, layers.conv2d, 64, [3, 3], scope='conv1')
      net = layers_lib.max_pool2d(net, [2, 2], scope='pool1')
      net = layers_lib.repeat(net, 2, layers.conv2d, 128, [3, 3], scope='conv2')
      net = layers_lib.max_pool2d(net, [2, 2], scope='pool2')
      net = layers_lib.repeat(net, 3, layers.conv2d, 256, [3, 3], scope='conv3')
      net = layers_lib.max_pool2d(net, [2, 2], scope='pool3')
      net = layers_lib.repeat(net, 3, layers.conv2d, 512, [3, 3], scope='conv4')
      net = layers_lib.max_pool2d(net, [2, 2], scope='pool4')
      net = layers_lib.repeat(net, 3, layers.conv2d, 512, [3, 3], scope='conv5')
      net = layers_lib.max_pool2d(net, [2, 2], scope='pool5')
      # Use conv2d instead of fully_connected layers.
      net = layers.conv2d(net, 4096, [7, 7], padding='VALID', scope='fc6')
      net = layers_lib.dropout(
          net, dropout_keep_prob, is_training=is_training, scope='dropout6')
      net = layers.conv2d(net, 4096, [1, 1], scope='fc7')
      net = layers_lib.dropout(
          net, dropout_keep_prob, is_training=is_training, scope='dropout7')
      net = layers.conv2d(
          net,
          num_classes, [1, 1],
          activation_fn=None,
          normalizer_fn=None,
          scope='fc8')
      # Convert end_points_collection into a end_point dict.
      end_points = utils.convert_collection_to_dict(end_points_collection)
      if spatial_squeeze:
        net = array_ops.squeeze(net, [1, 2], name='fc8/squeezed')
        end_points[sc.name + '/fc8'] = net
      return net, end_points


vgg_16.default_image_size = 224