Fork版本項目地址:SSDpython
參考自集智專欄ios
在分類器基礎之上想要識別物體,實質就是 用分類器掃描整張圖像,定位特徵位置 。這裏的關鍵就是用什麼算法掃描,好比能夠將圖片分紅若干網格,用分類器一個格子、一個格子掃描,這種方法有幾個問題:git
問題1 : 目標正好處在兩個網格交界處,就會形成分類器的結果在兩邊都不足夠顯著,形成漏報(True Negative)。github
問題2 : 目標過大或太小,致使網格中結果不足夠顯著,形成漏報。算法
針對第一點,能夠採用相互重疊的網格。好比一個網格大小是 32x32 像素,那麼就網格向下移動時,只動 8 個像素,走四步才徹底移出之前的網格。針對第二點,能夠採用大小網格相互結合的策略,32x32 網格掃完,64x64 網格再掃描一次,16x16 網格也再掃一次。網絡
可是這樣會帶來其餘問題——咱們爲了保證準確率, 對同一張圖片掃描次數過多,嚴重影響了計算速度 ,形成這種策略 沒法作到實時標註 。架構
爲了快速、實時標註圖像特徵,對於整個識別定位算法,就有了諸多改進方法。app
一個最基本的思路是,合理使用卷積神經網絡的內部結構,避免重複計算。用卷積神經網絡掃描某一圖片時,實際上卷積獲得的結果已經存儲了不一樣大小的網格信息,這一過程實際上已經完成了咱們上一部分提出的改進措施,以下圖所示,咱們發現前幾層卷積核的結果更關注細節,後面的卷積層結果更加關注總體:ide
對於問題1,若是一個物體位於兩個格子的中間,雖然兩邊都不必定足夠顯著,可是兩邊的基本特徵若是能夠合理組合的話,咱們就不須要再掃描一次。然後幾層則愈來愈關注總體,對問題2,目標可能會過大太小,可是特徵一樣也會留下。也就是說,用卷積神經網絡掃描圖像過程當中,因爲深度神經網絡自己就有好幾層卷積、實際上已經反覆屢次掃描圖像,以上兩個問題能夠經過合理使用卷積神經網絡的中間結果獲得解決。函數
在 SSD 算法以前,MultiBox,FastR-CNN 法都採用了兩步的策略,即第一步經過深度神經網絡,對潛在的目標物體進行定位,即先產生Box;至於Box 裏面的物體如何分類,這裏再進行第二步計算。此外第一代的 YOLO 算法能夠作到一步完成計算加定位,可是結構中採用了全鏈接層,而全鏈接層有不少問題,而且正在逐步被深度神經網絡架構「拋棄」。
回到項目中,以VGG300(/nets/ssd_vgg_300.py)爲例,大致思路就是,用VGG 深度神經網絡的前五層,並額外多加幾層結構,最後提取其中幾層進過卷積後的結果,進行網格搜索,找目標特徵。對應到函數裏,轉化爲三個大部分,原網絡結構、添加網絡結構、SSD處理結構:
def ssd_net(inputs, num_classes=SSDNet.default_params.num_classes, feat_layers=SSDNet.default_params.feat_layers, anchor_sizes=SSDNet.default_params.anchor_sizes, anchor_ratios=SSDNet.default_params.anchor_ratios, normalizations=SSDNet.default_params.normalizations, is_training=True, dropout_keep_prob=0.5, prediction_fn=slim.softmax, reuse=None, scope='ssd_300_vgg'): """SSD net definition. """ # if data_format == 'NCHW': # inputs = tf.transpose(inputs, perm=(0, 3, 1, 2)) # End_points collect relevant activations for external use. """ net = layers_lib.repeat( inputs, 2, layers.conv2d, 64, [3, 3], scope='conv1') net = layers_lib.max_pool2d(net, [2, 2], scope='pool1') net = layers_lib.repeat(net, 2, layers.conv2d, 128, [3, 3], scope='conv2') net = layers_lib.max_pool2d(net, [2, 2], scope='pool2') net = layers_lib.repeat(net, 3, layers.conv2d, 256, [3, 3], scope='conv3') net = layers_lib.max_pool2d(net, [2, 2], scope='pool3') net = layers_lib.repeat(net, 3, layers.conv2d, 512, [3, 3], scope='conv4') net = layers_lib.max_pool2d(net, [2, 2], scope='pool4') net = layers_lib.repeat(net, 3, layers.conv2d, 512, [3, 3], scope='conv5') net = layers_lib.max_pool2d(net, [2, 2], scope='pool5') """ end_points = {} with tf.variable_scope(scope, 'ssd_300_vgg', [inputs], reuse=reuse): ###################################### # 前五個 Blocks,首先照搬 VGG16 架構 # # 注意這裏使用 end_points 標註中間結果 # ###################################### # ——————————————————Original VGG-16 blocks.——————————————————————— net = slim.repeat(inputs, 2, slim.conv2d, 64, [3, 3], scope='conv1') end_points['block1'] = net net = slim.max_pool2d(net, [2, 2], scope='pool1') # Block 2. net = slim.repeat(net, 2, slim.conv2d, 128, [3, 3], scope='conv2') end_points['block2'] = net net = slim.max_pool2d(net, [2, 2], scope='pool2') # Block 3. net = slim.repeat(net, 3, slim.conv2d, 256, [3, 3], scope='conv3') end_points['block3'] = net net = slim.max_pool2d(net, [2, 2], scope='pool3') # Block 4. net = slim.repeat(net, 3, slim.conv2d, 512, [3, 3], scope='conv4') end_points['block4'] = net net = slim.max_pool2d(net, [2, 2], scope='pool4') # Block 5. net = slim.repeat(net, 3, slim.conv2d, 512, [3, 3], scope='conv5') end_points['block5'] = net net = slim.max_pool2d(net, [3, 3], stride=1, scope='pool5') # 池化層步長由2修改到三 #################################### # 後六個 Blocks,使用額外卷積層 # #################################### # ————————————Additional SSD blocks.—————————————————————— # Block 6: let's dilate the hell out of it! net = slim.conv2d(net, 1024, [3, 3], rate=6, scope='conv6') end_points['block6'] = net net = tf.layers.dropout(net, rate=dropout_keep_prob, training=is_training) # Block 7: 1x1 conv. Because the fuck. net = slim.conv2d(net, 1024, [1, 1], scope='conv7') end_points['block7'] = net net = tf.layers.dropout(net, rate=dropout_keep_prob, training=is_training) # Block 8/9/10/11: 1x1 and 3x3 convolutions stride 2 (except lasts). end_point = 'block8' with tf.variable_scope(end_point): net = slim.conv2d(net, 256, [1, 1], scope='conv1x1') net = custom_layers.pad2d(net, pad=(1, 1)) net = slim.conv2d(net, 512, [3, 3], stride=2, scope='conv3x3', padding='VALID') end_points[end_point] = net end_point = 'block9' with tf.variable_scope(end_point): net = slim.conv2d(net, 128, [1, 1], scope='conv1x1') net = custom_layers.pad2d(net, pad=(1, 1)) net = slim.conv2d(net, 256, [3, 3], stride=2, scope='conv3x3', padding='VALID') end_points[end_point] = net end_point = 'block10' with tf.variable_scope(end_point): net = slim.conv2d(net, 128, [1, 1], scope='conv1x1') net = slim.conv2d(net, 256, [3, 3], scope='conv3x3', padding='VALID') end_points[end_point] = net end_point = 'block11' with tf.variable_scope(end_point): net = slim.conv2d(net, 128, [1, 1], scope='conv1x1') net = slim.conv2d(net, 256, [3, 3], scope='conv3x3', padding='VALID') end_points[end_point] = net ###################################### # 每一箇中間層 end_points 返回中間結果 # # 將各層預測結果存入列表,返回給優化函數 # ###################################### # Prediction and localisations layers. predictions = [] logits = [] localisations = [] # feat_layers=['block4', 'block7', 'block8', 'block9', 'block10', 'block11'] for i, layer in enumerate(feat_layers): with tf.variable_scope(layer + '_box'): p, l = ssd_multibox_layer(end_points[layer], num_classes, anchor_sizes[i], anchor_ratios[i], normalizations[i]) """ 框的數目等於anchor_sizes[i]和anchor_ratios[i]的長度和 anchor_sizes=[(21., 45.), (45., 99.), (99., 153.), (153., 207.), (207., 261.), (261., 315.)] anchor_ratios=[[2, .5], [2, .5, 3, 1./3], [2, .5, 3, 1./3], [2, .5, 3, 1./3], [2, .5], [2, .5]] normalizations=[20, -1, -1, -1, -1, -1] """ predictions.append(prediction_fn(p)) # prediction_fn=slim.softmax logits.append(p) localisations.append(l) return predictions, localisations, logits, end_points ssd_net.default_image_size = 300
在整個函數最後,給出了ssd_arg_scope函數,用於約束網絡中的超參數設定,用法腳本頭中已經給了:
Usage:
with slim.arg_scope(ssd_vgg.ssd_vgg()):
outputs, end_points = ssd_vgg.ssd_vgg(inputs)
def ssd_arg_scope(weight_decay=0.0005, data_format='NHWC'): """Defines the VGG arg scope. Args: weight_decay: The l2 regularization coefficient. Returns: An arg_scope. """ with slim.arg_scope([slim.conv2d, slim.fully_connected], activation_fn=tf.nn.relu, weights_regularizer=slim.l2_regularizer(weight_decay), weights_initializer=tf.contrib.layers.xavier_initializer(), biases_initializer=tf.zeros_initializer()): with slim.arg_scope([slim.conv2d, slim.max_pool2d], padding='SAME', data_format=data_format): with slim.arg_scope([custom_layers.pad2d, custom_layers.l2_normalization, custom_layers.channel_to_last], data_format=data_format) as sc: return sc
實際上原程序中超參數做爲一個class屬性給出的,咱們如今不關心這個class的信息,僅僅將其包含超參數設定的部分提取出來,提高對上面網絡的理解,
SSDParams = namedtuple('SSDParameters', ['img_shape', 'num_classes', 'no_annotation_label', 'feat_layers', 'feat_shapes', 'anchor_size_bounds', 'anchor_sizes', 'anchor_ratios', 'anchor_steps', 'anchor_offset', 'normalizations', 'prior_scaling' ]) class SSDNet(object): default_params = SSDParams( img_shape=(300, 300), num_classes=21, no_annotation_label=21, feat_layers=['block4', 'block7', 'block8', 'block9', 'block10', 'block11'], feat_shapes=[(38, 38), (19, 19), (10, 10), (5, 5), (3, 3), (1, 1)], anchor_size_bounds=[0.15, 0.90], # anchor_size_bounds=[0.20, 0.90], anchor_sizes=[(21., 45.), (45., 99.), (99., 153.), (153., 207.), (207., 261.), (261., 315.)], anchor_ratios=[[2, .5], [2, .5, 3, 1./3], [2, .5, 3, 1./3], [2, .5, 3, 1./3], [2, .5], [2, .5]], anchor_steps=[8, 16, 32, 64, 100, 300], anchor_offset=0.5, normalizations=[1, -1, -1, -1, -1, -1], # 控制SSD層處理時是否預先沿着HW正則化 prior_scaling=[0.1, 0.1, 0.2, 0.2] )
# Prediction and localisations layers. predictions = [] logits = [] localisations = [] # feat_layers=['block4', 'block7', 'block8', 'block9', 'block10', 'block11'] for i, layer in enumerate(feat_layers): with tf.variable_scope(layer + '_box'): p, l = ssd_multibox_layer(end_points[layer], # <-----SSD處理 num_classes, anchor_sizes[i], anchor_ratios[i], normalizations[i]) predictions.append(prediction_fn(p)) # prediction_fn=slim.softmax logits.append(p) localisations.append(l) return predictions, localisations, logits, end_points
在網絡架構的最後,會對選取的特徵層外接新的卷積處理(上面代碼),處理函數以下:
def tensor_shape(x, rank=3): """Returns the dimensions of a tensor. Args: image: A N-D Tensor of shape. Returns: A list of dimensions. Dimensions that are statically known are python integers,otherwise they are integer scalar tensors. """ if x.get_shape().is_fully_defined(): return x.get_shape().as_list() else: # get_shape返回值,with_rank至關於斷言assert,是否rank爲指定值 static_shape = x.get_shape().with_rank(rank).as_list() # tf.shape返回張量,其中num解釋爲"The length of the dimension `axis`.",axis默認爲0 dynamic_shape = tf.unstack(tf.shape(x), num=rank) # list,有定義的給數字,沒有的給tensor return [s if s is not None else d for s, d in zip(static_shape, dynamic_shape)] def ssd_multibox_layer(inputs, num_classes, sizes, ratios=[1], normalization=-1, bn_normalization=False): """Construct a multibox layer, return a class and localization predictions. """ net = inputs if normalization > 0: net = custom_layers.l2_normalization(net, scaling=True) # Number of anchors. num_anchors = len(sizes) + len(ratios) # Location. num_loc_pred = num_anchors * 4 # 每個框有四個座標 loc_pred = slim.conv2d(net, num_loc_pred, [3, 3], activation_fn=None, scope='conv_loc') # 輸出C表示不一樣框的某個座標 # 強制轉換爲NHWC loc_pred = custom_layers.channel_to_last(loc_pred) # NHW(num_anchors+4)->NHW,num_anchors,4 loc_pred = tf.reshape(loc_pred, tensor_shape(loc_pred, 4)[:-1]+[num_anchors, 4]) # Class prediction. num_cls_pred = num_anchors * num_classes # 每個框都要計算全部的類別 cls_pred = slim.conv2d(net, num_cls_pred, [3, 3], activation_fn=None, scope='conv_cls') # 輸出C表示不一樣框的對某個類的預測 # 強制轉換爲NHWC cls_pred = custom_layers.channel_to_last(cls_pred) # NHW(num_anchors+類別)->NHW,num_anchors,類別 cls_pred = tf.reshape(cls_pred, tensor_shape(cls_pred, 4)[:-1]+[num_anchors, num_classes]) return cls_pred, loc_pred
根據是否正則化的的參數,對特徵層進行L2正則化(空間維度C上正則化),具體流程見下節
而後並行的在選定特徵層後面加上兩個卷積,一個輸出通道爲num_anchors×4,一個輸出通道爲num_anchors×類別數
將兩個卷積的輸出格維度各自擴展一維,排序轉換爲:[NHW,num_anchors,4] 和 [NHW,num_anchors,類別]
此時咱們能夠知道網絡結構函數的返回的意義了:各個指定層SSD處理後輸出的框對類別的機率,各個指定層SSD處理後輸出的框座標修正,各個指定層SSD處理後輸出的框對類別的原始輸出,全部中間層的end_point。
首先在特徵層維度進行正則化,過程見nn.l2_normalize,而後對每個層取一個scale因子,對各個層放縮調整(因子是可學習的),最後返回這個調整後的特徵
@add_arg_scope def l2_normalization( inputs, scaling=False, scale_initializer=init_ops.ones_initializer(), reuse=None, variables_collections=None, outputs_collections=None, data_format='NHWC', trainable=True, scope=None): """Implement L2 normalization on every feature (i.e. spatial normalization). Should be extended in some near future to other dimensions, providing a more flexible normalization framework. Args: inputs: a 4-D tensor with dimensions [batch_size, height, width, channels]. scaling: whether or not to add a post scaling operation along the dimensions which have been normalized. scale_initializer: An initializer for the weights. reuse: whether or not the layer and its variables should be reused. To be able to reuse the layer scope must be given. variables_collections: optional list of collections for all the variables or a dictionary containing a different list of collection per variable. outputs_collections: collection to add the outputs. data_format: NHWC or NCHW data format. trainable: If `True` also add variables to the graph collection `GraphKeys.TRAINABLE_VARIABLES` (see tf.Variable). scope: Optional scope for `variable_scope`. Returns: A `Tensor` representing the output of the operation. """ with variable_scope.variable_scope( scope, 'L2Normalization', [inputs], reuse=reuse) as sc: inputs_shape = inputs.get_shape() inputs_rank = inputs_shape.ndims dtype = inputs.dtype.base_dtype # 在C上作l2標準化 if data_format == 'NHWC': # norm_dim = tf.range(1, inputs_rank-1) norm_dim = tf.range(inputs_rank-1, inputs_rank) params_shape = inputs_shape[-1:] elif data_format == 'NCHW': # norm_dim = tf.range(2, inputs_rank) norm_dim = tf.range(1, 2) params_shape = (inputs_shape[1]) # Normalize along spatial dimensions. outputs = nn.l2_normalize(inputs, norm_dim, epsilon=1e-12) # Additional scaling. if scaling: # 從collections獲取變量 scale_collections = utils.get_variable_collections( variables_collections, 'scale') # 建立變量,shape=C的層數 scale = variables.model_variable('gamma', shape=params_shape, dtype=dtype, initializer=scale_initializer, collections=scale_collections, trainable=trainable) if data_format == 'NHWC': outputs = tf.multiply(outputs, scale) elif data_format == 'NCHW': scale = tf.expand_dims(scale, axis=-1) scale = tf.expand_dims(scale, axis=-1) outputs = tf.multiply(outputs, scale) # outputs = tf.transpose(outputs, perm=(0, 2, 3, 1)) # 爲outputs添加別名,並將之收集進collection,返回原節點 return utils.collect_named_outputs(outputs_collections, sc.original_name_scope, outputs)
至此,網絡結構的介紹就完成了,下一節咱們將關注目標檢測模型的關鍵技術之一:定位框的生成,並串聯本節,理解整個SSD網絡的生成過程。
@add_arg_scope # 層能夠被slim.arg_scope設定 def channel_to_last(inputs, data_format='NHWC', scope=None): """Move the channel axis to the last dimension. Allows to provide a single output format whatever the input data format. Args: inputs: Input Tensor; data_format: NHWC or NCHW. Return: Input in NHWC format. """ with tf.name_scope(scope, 'channel_to_last', [inputs]): if data_format == 'NHWC': net = inputs elif data_format == 'NCHW': net = tf.transpose(inputs, perm=(0, 2, 3, 1)) return net
@add_arg_scope # 層能夠被slim.arg_scope設定 def pad2d(inputs, pad=(0, 0), mode='CONSTANT', data_format='NHWC', trainable=True, scope=None): """2D Padding layer, adding a symmetric padding to H and W dimensions. Aims to mimic padding in Caffe and MXNet, helping the port of models to TensorFlow. Tries to follow the naming convention of `tf.contrib.layers`. Args: inputs: 4D input Tensor; pad: 2-Tuple with padding values for H and W dimensions;(填充的寬度) mode: Padding mode. C.f. `tf.pad` data_format: NHWC or NCHW data format. """ with tf.name_scope(scope, 'pad2d', [inputs]): # Padding shape. if data_format == 'NHWC': paddings = [[0, 0], [pad[0], pad[0]], [pad[1], pad[1]], [0, 0]] elif data_format == 'NCHW': paddings = [[0, 0], [0, 0], [pad[0], pad[0]], [pad[1], pad[1]]] net = tf.pad(inputs, paddings, mode=mode) return net
def vgg_16(inputs, num_classes=1000, is_training=True, dropout_keep_prob=0.5, spatial_squeeze=True, scope='vgg_16'): """Oxford Net VGG 16-Layers version D Example. Note: All the fully_connected layers have been transformed to conv2d layers. To use in classification mode, resize input to 224x224. Args: inputs: a tensor of size [batch_size, height, width, channels]. num_classes: number of predicted classes. is_training: whether or not the model is being trained. dropout_keep_prob: the probability that activations are kept in the dropout layers during training. spatial_squeeze: whether or not should squeeze the spatial dimensions of the outputs. Useful to remove unnecessary dimensions for classification. scope: Optional scope for the variables. Returns: the last op containing the log predictions and end_points dict. """ with variable_scope.variable_scope(scope, 'vgg_16', [inputs]) as sc: end_points_collection = sc.original_name_scope + '_end_points' # Collect outputs for conv2d, fully_connected and max_pool2d. with arg_scope( [layers.conv2d, layers_lib.fully_connected, layers_lib.max_pool2d], outputs_collections=end_points_collection): net = layers_lib.repeat( inputs, 2, layers.conv2d, 64, [3, 3], scope='conv1') net = layers_lib.max_pool2d(net, [2, 2], scope='pool1') net = layers_lib.repeat(net, 2, layers.conv2d, 128, [3, 3], scope='conv2') net = layers_lib.max_pool2d(net, [2, 2], scope='pool2') net = layers_lib.repeat(net, 3, layers.conv2d, 256, [3, 3], scope='conv3') net = layers_lib.max_pool2d(net, [2, 2], scope='pool3') net = layers_lib.repeat(net, 3, layers.conv2d, 512, [3, 3], scope='conv4') net = layers_lib.max_pool2d(net, [2, 2], scope='pool4') net = layers_lib.repeat(net, 3, layers.conv2d, 512, [3, 3], scope='conv5') net = layers_lib.max_pool2d(net, [2, 2], scope='pool5') # Use conv2d instead of fully_connected layers. net = layers.conv2d(net, 4096, [7, 7], padding='VALID', scope='fc6') net = layers_lib.dropout( net, dropout_keep_prob, is_training=is_training, scope='dropout6') net = layers.conv2d(net, 4096, [1, 1], scope='fc7') net = layers_lib.dropout( net, dropout_keep_prob, is_training=is_training, scope='dropout7') net = layers.conv2d( net, num_classes, [1, 1], activation_fn=None, normalizer_fn=None, scope='fc8') # Convert end_points_collection into a end_point dict. end_points = utils.convert_collection_to_dict(end_points_collection) if spatial_squeeze: net = array_ops.squeeze(net, [1, 2], name='fc8/squeezed') end_points[sc.name + '/fc8'] = net return net, end_points vgg_16.default_image_size = 224
nn.l2_normalize:L2正則化層
slim.repeat:重複層快速構建
Tensor.get_shape().with_rank(rank).as_list()
:加相似斷言的shape獲取函數
tensorflow.contrib.layers.python.layers.utils.collect_named_outputs:變量添加進collections,並取別名