『計算機視覺』Mask-RCNN_推斷網絡終篇：使用detect方法進行推斷

時間 2019-11-08

標籤計算機視覺 mask rcnn 推斷網絡終篇使用 detect 方法進行欄目系統網絡简体版

原文原文鏈接

1、detect和build

前面多節中咱們花了大量筆墨介紹build方法的inference分支，這節咱們看看它是如何被調用的。html

在dimo.ipynb中，涉及model的操做咱們簡單進行一下彙總，首先建立圖並載入預訓練權重，python

而後規範了類別序列，git

實際開始檢測的代碼塊以下，github

經由model.detect方法，調用model.build方法（也就是咱們前面多節在講解的方法）構建圖，實施預測。windows

2、detect方法

首先看看detect方法的前幾行（和build同樣，同見model.py），數組

    def detect(self, images, verbose=0):
        """Runs the detection pipeline.

        images: List of images, potentially of different sizes.

        Returns a list of dicts, one dict per image. The dict contains:
        rois: [N, (y1, x1, y2, x2)] detection bounding boxes
        class_ids: [N] int class IDs
        scores: [N] float probability scores for the class IDs
        masks: [H, W, N] instance binary masks
        """
        assert self.mode == "inference", "Create model in inference mode."
        assert len(
            images) == self.config.BATCH_SIZE, "len(images) must be equal to BATCH_SIZE"

        # 日誌記錄
        if verbose:
            log("Processing {} images".format(len(images)))
            for image in images:
                log("image", image)

一、待檢測圖像預處理

        # Mold inputs to format expected by the neural network
        molded_images, image_metas, windows = self.mold_inputs(images)

        # Validate image sizes
        # All images in a batch MUST be of the same size
        image_shape = molded_images[0].shape
        for g in molded_images[1:]:
            assert g.shape == image_shape,\
                "After resizing, all images must have the same size. Check IMAGE_RESIZE_MODE and image sizes."

簡單的糾錯和日誌控制以後，即調用mold_input函數對輸入圖片進行調整，並記錄圖片信息。網絡

self.mold_inputs方法以下，app

    def mold_inputs(self, images):
        """Takes a list of images and modifies them to the format expected
        as an input to the neural network.
        images: List of image matrices [height,width,depth]. Images can have
            different sizes.

        Returns 3 Numpy matrices:
        molded_images: [N, h, w, 3]. Images resized and normalized.
        image_metas: [N, length of meta data]. Details about each image.
        windows: [N, (y1, x1, y2, x2)]. The portion of the image that has the
            original image (padding excluded).
        """
        molded_images = []
        image_metas = []
        windows = []
        for image in images:
            # Resize image
            # TODO: move resizing to mold_image()
            molded_image, window, scale, padding, crop = utils.resize_image(
                image,
                min_dim=self.config.IMAGE_MIN_DIM,      # 800
                min_scale=self.config.IMAGE_MIN_SCALE,  # 0
                max_dim=self.config.IMAGE_MAX_DIM,      # 1024
                mode=self.config.IMAGE_RESIZE_MODE)     # square
            molded_image = mold_image(molded_image, self.config)  # 減平均像素
            # Build image_meta 形式爲np數組
            image_meta = compose_image_meta(
                0, image.shape, molded_image.shape, window, scale,
                np.zeros([self.config.NUM_CLASSES], dtype=np.int32))
            # Append
            molded_images.append(molded_image)
            windows.append(window)
            image_metas.append(image_meta)
        # Pack into arrays
        molded_images = np.stack(molded_images)
        image_metas = np.stack(image_metas)
        windows = np.stack(windows)
        return molded_images, image_metas, windows

utils.resize_image函數用於縮放原圖像，它生成一個scale，返回圖像大小等於輸入圖像大小*scale並保證dom

最短邊等於輸入min_dim，最長邊不大於max_dim
若是最長邊超過了max_dim則保證最長邊等於max_dim，最短邊再也不限制

最後，將圖片padding到max_dim*max_dim大小（即molded_images大小實際上是固定的），其返回值以下：ide

image.astype(image_dtype), window, scale, padding, crop

表示：resize後圖片，原圖相對resize後圖片的位置信息（詳見『計算機視覺』Mask-RCNN_推斷網絡其五：目標檢測結果精煉），放縮倍數，padding信息（四個整數），crop信息（四個整數或者None）。

mold_image函數更爲簡單，就是把圖片像素減去了個平均值，MEAN_PIXEL=[123.7 116.8 103.9]。

compose_image_meta記錄了所有的原始信息，能夠看到，crop並未收錄在內，

def compose_image_meta(image_id, original_image_shape, image_shape,
                       window, scale, active_class_ids):
    """Takes attributes of an image and puts them in one 1D array.

    image_id: An int ID of the image. Useful for debugging.
    original_image_shape: [H, W, C] before resizing or padding.
    image_shape: [H, W, C] after resizing and padding
    window: (y1, x1, y2, x2) in pixels. The area of the image where the real
            image is (excluding the padding)
    scale: The scaling factor applied to the original image (float32)
    active_class_ids: List of class_ids available in the dataset from which
        the image came. Useful if training on images from multiple datasets
        where not all classes are present in all datasets.
    """
    meta = np.array(
        [image_id] +                  # size=1
        list(original_image_shape) +  # size=3
        list(image_shape) +           # size=3
        list(window) +                # size=4 (y1, x1, y2, x2) in image cooredinates
        [scale] +                     # size=1
        list(active_class_ids)        # size=num_classes
    )
    return meta

最後拼接返回。

二、anchors生成

首先調用方法get_anchors生成錨框（見『計算機視覺』Mask-RCNN_錨框生成），shape爲[anchor_count, (y1, x1, y2, x2)]，

        # Anchors
        anchors = self.get_anchors(image_shape)
        # Duplicate across the batch dimension because Keras requires it
        # TODO: can this be optimized to avoid duplicating the anchors?
        # [anchor_count, (y1, x1, y2, x2)] --> [batch, anchor_count, (y1, x1, y2, x2)]
        anchors = np.broadcast_to(anchors, (self.config.BATCH_SIZE,) + anchors.shape)

而後爲之添加batch維度，最終[batch, anchor_count, (y1, x1, y2, x2)]。

三、inference網絡預測

調用keras的predict方法前向傳播，在預測任務中咱們僅僅關注detections和mrcnn_mask兩個輸出。

        # Run object detection
        # 於__init__中定義：self.keras_model = self.build(mode=mode, config=config)
        # 返回list：    [detections, mrcnn_class, mrcnn_bbox,
        #               mrcnn_mask, rpn_rois, rpn_class, rpn_bbox]
        # detections,  [batch, num_detections, (y1, x1, y2, x2, class_id, score)]
        # mrcnn_mask,  [batch, num_detections, MASK_POOL_SIZE, MASK_POOL_SIZE, NUM_CLASSES]
        detections, _, _, mrcnn_mask, _, _, _ =\
            self.keras_model.predict([molded_images, image_metas, anchors], verbose=0)

四、座標框重映射

咱們對於座標的操做都是基於輸入圖片的相對位置，且單位長度也是其寬高，在最後咱們須要將之修正回像素空間座標。

令輸入圖片list不須要輸入圖片具備相同的尺寸，因此咱們在恢復時必須注意單張處理之。

        # Process detections
        results = []
        for i, image in enumerate(images):
            # 須要單張處理，由於原始圖片images不保證每張尺寸一致
            final_rois, final_class_ids, final_scores, final_masks =\
                self.unmold_detections(detections[i], mrcnn_mask[i],
                                       image.shape, molded_images[i].shape,
                                       windows[i])

目標檢測框重映射：unmold_detections函數

    def unmold_detections(self, detections, mrcnn_mask, original_image_shape,
                          image_shape, window):
        """Reformats the detections of one image from the format of the neural
        network output to a format suitable for use in the rest of the
        application.

        detections: [N, (y1, x1, y2, x2, class_id, score)] in normalized coordinates
        mrcnn_mask: [N, height, width, num_classes]
        original_image_shape: [H, W, C] Original image shape before resizing
        image_shape: [H, W, C] Shape of the image after resizing and padding
        window: [y1, x1, y2, x2] Pixel coordinates of box in the image where the real
                image is excluding the padding.

        Returns:
        boxes: [N, (y1, x1, y2, x2)] Bounding boxes in pixels
        class_ids: [N] Integer class IDs for each bounding box
        scores: [N] Float probability scores of the class_id
        masks: [height, width, num_instances] Instance masks
        """
        # How many detections do we have?
        # Detections array is padded with zeros. Find the first class_id == 0.
        zero_ix = np.where(detections[:, 4] == 0)[0]  # DetectionLayer 末尾對結果進行了全0填充
        N = zero_ix[0] if zero_ix.shape[0] > 0 else detections.shape[0]  # 有意義的檢測結果數N

        # Extract boxes, class_ids, scores, and class-specific masks
        boxes = detections[:N, :4]                         # [N, (y1, x1, y2, x2)]
        class_ids = detections[:N, 4].astype(np.int32)     # [N, class_id]
        scores = detections[:N, 5]                         # [N, score]
        masks = mrcnn_mask[np.arange(N), :, :, class_ids]  # [N, height, width, num_classes]

        # Translate normalized coordinates in the resized image to pixel
        # coordinates in the original image before resizing
        window = utils.norm_boxes(window, image_shape[:2])  # window相對輸入圖片規範化

        wy1, wx1, wy2, wx2 = window
        shift = np.array([wy1, wx1, wy1, wx1])
        wh = wy2 - wy1  # window height
        ww = wx2 - wx1  # window width
        scale = np.array([wh, ww, wh, ww])
        # Convert boxes to normalized coordinates on the window
        boxes = np.divide(boxes - shift, scale)  # box相對window座標規範化
        # Convert boxes to pixel coordinates on the original image
        boxes = utils.denorm_boxes(boxes, original_image_shape[:2])  # box相對原圖解規範化

        # Filter out detections with zero area. Happens in early training when
        # network weights are still random
        exclude_ix = np.where(
            (boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1]) <= 0)[0]
        if exclude_ix.shape[0] > 0:
            boxes = np.delete(boxes, exclude_ix, axis=0)
            class_ids = np.delete(class_ids, exclude_ix, axis=0)
            scores = np.delete(scores, exclude_ix, axis=0)
            masks = np.delete(masks, exclude_ix, axis=0)
            N = class_ids.shape[0]

        # Resize masks to original image size and set boundary threshold.
        full_masks = []
        for i in range(N):  # 單個box操做
            # Convert neural network mask to full size mask
            full_mask = utils.unmold_mask(masks[i], boxes[i], original_image_shape)
            full_masks.append(full_mask)
        full_masks = np.stack(full_masks, axis=-1)\
            if full_masks else np.empty(original_image_shape[:2] + (0,))

        # [n, (y1, x1, y2, x2)]
        # [n, class_id]
        # [n, class_id]
        # [h, w, n]
        return boxes, class_ids, scores, full_masks

爲了將輸出結果格式還原，咱們須要進行以下幾步：

剔除爲了湊齊DETECTION_MAX_INSTANCES 填充的全0檢測結果

將box放縮回原始圖片對應尺寸

剔除面積爲0的box

mask輸出尺寸還原

在網絡中操做的box尺寸爲基於輸入圖片的規範化座標，window爲像素座標，因此咱們先將window相對輸入圖片規範化，使得window和box處於同一座標系，而後這二者座標就能夠直接交互了，使box相對window規範化，此時box座標尺寸都是window的相對值，而window和原始圖片是直接有映射關係的，因此box遵循其關係，映射回原始像素大小便可。

完成box映射後，咱們開始對mask進行處理。

Mask信息重映射：utils.unmold_mask函數

utils.unmold_mask受調用於unmold_detections尾部：

        # Resize masks to original image size and set boundary threshold.
        full_masks = []
        for i in range(N):  # 單個box操做
            # Convert neural network mask to full size mask
            full_mask = utils.unmold_mask(masks[i], boxes[i], original_image_shape)
            full_masks.append(full_mask)
        full_masks = np.stack(full_masks, axis=-1)\
            if full_masks else np.empty(original_image_shape[:2] + (0,))

首先重申咱們的unmold_detections函數是對單張圖片進行處理的，而mask處理進一步的是對每個檢測框進行處理的，

def unmold_mask(mask, bbox, image_shape):
    """Converts a mask generated by the neural network to a format similar
    to its original shape.
    mask: [height, width] of type float. A small, typically 28x28 mask.
    bbox: [y1, x1, y2, x2]. The box to fit the mask in.

    Returns a binary mask with the same size as the original image.
    """
    threshold = 0.5
    y1, x1, y2, x2 = bbox
    mask = resize(mask, (y2 - y1, x2 - x1))
    mask = np.where(mask >= threshold, 1, 0).astype(np.bool)

    # Put the mask in the right location.
    full_mask = np.zeros(image_shape[:2], dtype=np.bool)
    full_mask[y1:y2, x1:x2] = mask
    return full_mask

咱們在inference中輸出的mask信息僅僅是通常的生成網絡輸出，因此爲了獲得掩碼格式咱們須要一個閾值。明確了這個概念，下一步就簡單了，咱們將mask輸出放縮到對應的box大小便可（此時的box已經相對原始圖片進行了放縮，是像素座標），而後將放縮後的掩碼按照box相對原始圖片的位置貼在一張和原始圖片等大的空白圖片上。

咱們對每個檢測目標作這個操做，就能夠獲得等同於檢測目標數的原始圖片大小的掩碼圖片（每一個掩碼圖片上有一個掩碼對象），將之按照axis=-1拼接，最終獲取[h, w, n]格式輸出，hw爲原始圖片大小，n爲最終檢測到的目標數目。

最終，將計算結果返回，退出函數。

        # [n, (y1, x1, y2, x2)]
        # [n, class_id]
        # [n, class_id]
        # [h, w, n]
        return boxes, class_ids, scores, full_masks

實際調用以下：