『計算機視覺』Mask-RCNN_推斷網絡其五：目標檢測結果精煉

時間 2019-11-08

原文原文鏈接

1、Detections網絡

通過了ROI網絡，咱們已經獲取了所有推薦區域的信息，包含：python

推薦區域特徵（ROIAlign獲得）git

推薦區域類別github

推薦區域座標修正項（deltas）windows

再加上推薦區域原始座標[IMAGES_PER_GPU, num_rois, (y1, x1, y2, x2)]，咱們將進行最後的目標檢測精修部分。數組

            # Detections
            # output is [batch, num_detections, (y1, x1, y2, x2, class_id, score)] in
            # normalized coordinates
            detections = DetectionLayer(config, name="mrcnn_detection")(
                [rpn_rois, mrcnn_class, mrcnn_bbox, input_image_meta])

一、原始圖片resize參數"window"

注意到咱們的輸入中一個input_image_meta項，它記錄了每一張圖片的原始信息，[batch, n]維矩陣，n是固定的，其生成與config.py文件中網絡

        # Image meta data length
        # See compose_image_meta() for details
        self.IMAGE_META_SIZE = 1 + 3 + 3 + 4 + 1 + self.NUM_CLASSES

其信息在將來的（若是有的話）圖像預處理中會介紹，本節使用了其中記錄的原圖大小信息和對應圖片的"window"信息。圖片大小信息爲3個整數，對應輸入圖片（即已經預處理以後的圖片）的長寬和深度，"window"信息包含4個整數，其含義爲(top_pad, left_pad, h + top_pad, w + left_pad)，和重置圖片大小的處理有關，下面代碼見utils.py的resize_image函數，app

    if mode == "square":
        # Get new height and width
        h, w = image.shape[:2]
        top_pad = (max_dim - h) // 2
        bottom_pad = max_dim - h - top_pad
        left_pad = (max_dim - w) // 2
        right_pad = max_dim - w - left_pad
        padding = [(top_pad, bottom_pad), (left_pad, right_pad), (0, 0)]
        image = np.pad(image, padding, mode='constant', constant_values=0)
        window = (top_pad, left_pad, h + top_pad, w + left_pad)

即咱們將深藍色的原圖（不要求w等於h）經過填充的方式擴展爲淺灰色的大圖用於feed網絡，"window"記錄了以新圖左上角爲原點創建座標系，原圖的左上角點和右下角點的座標，因爲座標系選取的是像素座標，"window"記錄的就是原始圖片的大小，其蘊含了輸入圖片中真正有意義的位置信息。ide

二、從"window"還原原始圖片大小

有一點注意，假如top_pad=5，也就是咱們在圖像頂部填充了5行，實際上0、一、二、三、4爲非圖像區域，因此咱們從第5行開始是圖像；假設圖像有3行（很極端），即五、六、7行爲圖像，可是:函數

top_pad+h=5+3=8spa

即[top_pad:top_pad+h-1]行爲真實圖片，列同理。

另外，用於解析image_meta結構的函數以下：

def parse_image_meta_graph(meta):
    """Parses a tensor that contains image attributes to its components.
    See compose_image_meta() for more details.

    meta: [batch, meta length] where meta length depends on NUM_CLASSES

    Returns a dict of the parsed tensors.
    """
    image_id = meta[:, 0]
    original_image_shape = meta[:, 1:4]
    image_shape = meta[:, 4:7]
    window = meta[:, 7:11]  # (y1, x1, y2, x2) window of image in in pixels
    scale = meta[:, 11]
    active_class_ids = meta[:, 12:]
    return {
        "image_id": image_id,
        "original_image_shape": original_image_shape,
        "image_shape": image_shape,
        "window": window,
        "scale": scale,
        "active_class_ids": active_class_ids,
    }

2、源碼講解

首先接收參數，初始化網絡，

class DetectionLayer(KE.Layer):
    """Takes classified proposal boxes and their bounding box deltas and
    returns the final detection boxes.

    Returns:
    [batch, num_detections, (y1, x1, y2, x2, class_id, class_score)] where
    coordinates are normalized.
    """

    def __init__(self, config=None, **kwargs):
        super(DetectionLayer, self).__init__(**kwargs)
        self.config = config

    def call(self, inputs):
        rois = inputs[0]         # [batch, num_rois, (y1, x1, y2, x2)]
        mrcnn_class = inputs[1]  # [batch, num_rois, NUM_CLASSES]
        mrcnn_bbox = inputs[2]   # [batch, num_rois, NUM_CLASSES, (dy, dx, log(dh), log(dw))]
        image_meta = inputs[3]

一、原始圖片尺寸獲取

而後獲取"window"參數即原始圖片尺寸，而後獲取其相對於輸入圖片的image_shape即[w, h, channels]的尺寸，

        # Get windows of images in normalized coordinates. Windows are the area
        # in the image that excludes the padding.
        # Use the shape of the first image in the batch to normalize the window
        # because we know that all images get resized to the same size.
        m = parse_image_meta_graph(image_meta)
        image_shape = m['image_shape'][0]
        window = norm_boxes_graph(m['window'], image_shape[:2])  # (y1, x1, y2, x2)

上面第5行調用函數以下（本文第一節中已經貼了），用於解析並獲取輸入圖片的shape和原始圖片的shape（即"window"）。第7行函數以下：

def norm_boxes_graph(boxes, shape):
    """Converts boxes from pixel coordinates to normalized coordinates.
    boxes: [..., (y1, x1, y2, x2)] in pixel coordinates
    shape: [..., (height, width)] in pixels

    Note: In pixel coordinates (y2, x2) is outside the box. But in normalized
    coordinates it's inside the box.

    Returns:
        [..., (y1, x1, y2, x2)] in normalized coordinates
    """
    h, w = tf.split(tf.cast(shape, tf.float32), 2)
    scale = tf.concat([h, w, h, w], axis=-1) - tf.constant(1.0)
    shift = tf.constant([0., 0., 1., 1.])
    return tf.divide(boxes - shift, scale)

咱們通過"window"獲取了原始圖片相對輸入圖片的座標（像素空間），而後除以輸入圖片的寬高，獲得了原始圖片相對於輸入圖片的normalized座標，分佈於[0,1]區間上。

事實上因爲anchors生成的4個座標值均位於[0,1]，在網絡中全部的座標都是位於[0,1]的，原始圖片信息是新的被引入的量，不可或缺的須要被處理到正則空間。

對於每一張圖片，咱們有：

每一個推薦區域的座標

每一個推薦區域的粗分類狀況

每一個推薦區域的座標粗修

圖片中真正有意義的位置座標

下面咱們基於這些信息，進行精提。

二、分類、迴歸信息精煉

        # Run detection refinement graph on each item in the batch
        detections_batch = utils.batch_slice(
            [rois, mrcnn_class, mrcnn_bbox, window],
            lambda x, y, w, z: refine_detections_graph(x, y, w, z, self.config),

注意，下面調用的函數，每次處理的是單張圖片。

邏輯流程以下：

a 獲取每一個推薦區域得分最高的class的得分

b 獲取每一個推薦區域通過粗修後的座標和"window"交集的座標

c 剔除掉最高得分爲背景的推薦區域

d 剔除掉最高得分達不到閾值的推薦區域

e 對屬於同一類別的候選框進行非極大值抑制

f 對非極大值抑制後的框索引：剔除-1佔位符，獲取top k（100）

最後返回每一個框(y1, x1, y2, x2, class_id, score)信息

step1

調用函數前半部分以下，

def refine_detections_graph(rois, probs, deltas, window, config):
    """Refine classified proposals and filter overlaps and return final
    detections.

    Inputs:
        rois: [N, (y1, x1, y2, x2)] in normalized coordinates
        probs: [N, num_classes]. Class probabilities.
        deltas: [N, num_classes, (dy, dx, log(dh), log(dw))]. Class-specific
                bounding box deltas.
        window: (y1, x1, y2, x2) in normalized coordinates. The part of the image
            that contains the image excluding the padding.

    Returns detections shaped: [num_detections, (y1, x1, y2, x2, class_id, score)] where
        coordinates are normalized.
    """
    # Class IDs per ROI
    class_ids = tf.argmax(probs, axis=1, output_type=tf.int32)  # [N]，每張圖片最高得分類
    # Class probability of the top class of each ROI
    indices = tf.stack([tf.range(probs.shape[0]), class_ids], axis=1)  # [N, (圖片序號, 最高class序號)]
    class_scores = tf.gather_nd(probs, indices)  # [N], 每張圖片最高得分類得分值

    # Class-specific bounding box deltas
    deltas_specific = tf.gather_nd(deltas, indices)  # [N, 4]
    # Apply bounding box deltas
    # Shape: [boxes, (y1, x1, y2, x2)] in normalized coordinates
    refined_rois = apply_box_deltas_graph(
        rois, deltas_specific * config.BBOX_STD_DEV)  # [N, 4]
    # Clip boxes to image window
    refined_rois = clip_boxes_graph(refined_rois, window)

    # TODO: Filter out boxes with zero area

    # Filter out background boxes
    # class_ids: N, where(class_ids > 0): [M, 1] 即where會升維
    keep = tf.where(class_ids > 0)[:, 0]

    # Filter out low confidence boxes
    if config.DETECTION_MIN_CONFIDENCE:  # 0.7
        conf_keep = tf.where(class_scores >= config.DETECTION_MIN_CONFIDENCE)[:, 0]
        # 求交集，返回稀疏Tensor，要求a、b除最後一維外維度相同，最後一維的各個子列分別求交集
        # a   = np.array([[{1, 2}, {3}], [{4}, {5, 6}]])
        # b   = np.array([[{1}   , {}] , [{4}, {5, 6, 7, 8}]])
        # res = np.array([[{1}   , {}] , [{4}, {5, 6}]])
        keep = tf.sets.set_intersection(tf.expand_dims(keep, 0),
                                        tf.expand_dims(conf_keep, 0))
        keep = tf.sparse_tensor_to_dense(keep)[0]

    # Apply per-class NMS
    # 1. Prepare variables
    pre_nms_class_ids = tf.gather(class_ids, keep)  # [n]
    pre_nms_scores = tf.gather(class_scores, keep)  # [n]
    pre_nms_rois = tf.gather(refined_rois,   keep)  # [n, 4]
    unique_pre_nms_class_ids = tf.unique(pre_nms_class_ids)[0]  # 去重後class類別
    '''
    # tensor 'x' is [1, 1, 2, 4, 4, 4, 7, 8, 8]
    y, idx = unique(x)
    y ==> [1, 2, 4, 7, 8]
    idx ==> [0, 0, 1, 2, 2, 2, 3, 4, 4]
    '''

這一部分代碼主要對於當前的信息進行整理爲精煉作準備，流程很清晰：

a 獲取每一個推薦區域得分最高的class的得分

b 獲取每一個推薦區域通過粗修後的座標和"window"交集的座標

c 剔除掉最高得分爲背景的推薦區域

d 剔除掉最高得分達不到閾值的推薦區域

此時使用張量keep保存符合條件的推薦區域的index，即一個一維數組，每一個值爲一個框的序號，後面會繼續對這個keep中的序號作進一步的篩選。

step2

e 對屬於同一類別的候選框進行非極大值抑制。

注意下面的內嵌函數，包含keep（step1中保留的框索引）、pre_nms_class_ids（step1中保留的框類別）、pre_nms_scores（step1中保留的框得分）幾個外部變量，

    def nms_keep_map(class_id):
        """Apply Non-Maximum Suppression on ROIs of the given class."""
        # 接受了外部變量pre_nms_class_ids、keep

        # Indices of ROIs of the given class
        # class_id表示當前NMS的目標類的數字，pre_nms_class_ids爲所有的疑似目標類
        ixs = tf.where(tf.equal(pre_nms_class_ids, class_id))[:, 0]
        # Apply NMS
        class_keep = tf.image.non_max_suppression(
                tf.gather(pre_nms_rois, ixs),  # 當前class的所有推薦區座標
                tf.gather(pre_nms_scores, ixs),  # 當前class的所有推薦區得分
                max_output_size=config.DETECTION_MAX_INSTANCES,  # 100
                iou_threshold=config.DETECTION_NMS_THRESHOLD)  # 0.3
        # Map indices
        # class_keep是對ixs的索引，ixs是對keep的索引
        class_keep = tf.gather(keep, tf.gather(ixs, class_keep))  # 由索引獲取索引
        # Pad with -1 so returned tensors have the same shape
        gap = config.DETECTION_MAX_INSTANCES - tf.shape(class_keep)[0]
        class_keep = tf.pad(class_keep, [(0, gap)],
                            mode='CONSTANT', constant_values=-1)
        # Set shape so map_fn() can infer result shape
        class_keep.set_shape([config.DETECTION_MAX_INSTANCES])
        # 返回長度必須固定，不然tf.map_fn不能正常運行
        return class_keep

    # 2. Map over class IDs
    nms_keep = tf.map_fn(nms_keep_map, unique_pre_nms_class_ids,
                         dtype=tf.int64)  # [?, 默認100]：類別順序，每一個類別中的框索引

本步驟輸出nms_keep，[?, 100]格式，？表示該張圖片中保留的類別數（不是實例數注意）。

step3

f 對非極大值抑制後的框索引：剔除-1佔位符，獲取top k（100），返回每一個框(y1, x1, y2, x2, class_id, score)信息。

    # 3. Merge results into one list, and remove -1 padding
    nms_keep = tf.reshape(nms_keep, [-1])  # 所有框索引
    nms_keep = tf.gather(nms_keep, tf.where(nms_keep > -1)[:, 0])  # 剔除-1索引
    # 4. Compute intersection between keep and nms_keep
    # nms_keep自己就是從keep中截取的，本步感受冗餘
    keep = tf.sets.set_intersection(tf.expand_dims(keep, 0),
                                    tf.expand_dims(nms_keep, 0))
    keep = tf.sparse_tensor_to_dense(keep)[0]
    # Keep top detections
    roi_count = config.DETECTION_MAX_INSTANCES
    class_scores_keep = tf.gather(class_scores, keep)  # 獲取得分
    num_keep = tf.minimum(tf.shape(class_scores_keep)[0], roi_count)
    top_ids = tf.nn.top_k(class_scores_keep, k=num_keep, sorted=True)[1]
    keep = tf.gather(keep, top_ids)  # 由索引獲取索引

    # Arrange output as [N, (y1, x1, y2, x2, class_id, score)]
    # Coordinates are normalized.
    detections = tf.concat([
        tf.gather(refined_rois, keep),  # 索引座標[?, 4]
        tf.to_float(tf.gather(class_ids, keep))[..., tf.newaxis],  # 索引class，添加維[?, 1]
        tf.gather(class_scores, keep)[..., tf.newaxis]  # 索引的分，添加維[?, 1]
        ], axis=1)

    # 若是 detections < DETECTION_MAX_INSTANCES，填充0
    gap = config.DETECTION_MAX_INSTANCES - tf.shape(detections)[0]
    detections = tf.pad(detections, [(0, gap), (0, 0)], "CONSTANT")
    return detections

至此，咱們獲得了能夠用於輸出的目標檢測結果，下一步就是Mask信息生成。