Faster RCNN 學習與實現

時間 2019-11-24

標籤 faster rcnn 學習實現简体版

原文原文鏈接

Faster R-CNN 主要分爲兩個部分：html

RPN（Region Proposal Network）生成高質量的 region proposal；
Fast R-CNN 利用 region proposal 作出檢測。

在論文中做者將 RPN 比做神經網絡的注意力機制（"attention" mechanisms），告訴網絡看哪裏。爲了更好的理解，下面簡要的敘述論文的關鍵內容。python

RPN

Input：任意尺寸的圖像
Output：一組帶有目標得分的目標矩形 proposals

爲了生成 region proposals，在基網絡的最後一個卷積層 x 上滑動一個小網絡。該小網絡由一個 \(3\times 3\) 卷積 conv1 和一對兄弟卷積（並行的）\(1\times 1\) 卷積 loc 和 score 組成。其中，conv1 的參數 padding=1，stride=1 以保證其不會改變輸出的特徵圖的尺寸。loc 做爲 box-regression 用來編碼 box 的座標，score 做爲 box-classifaction 用來編碼每一個 proposal 是目標的機率。詳細內容見個人博客：個人目標檢測筆記。論文中把不一樣 scale 和 aspect ratio 的 \(k\) 個 reference boxes（參數化的 proposal）稱做 anchors（錨點）。錨點是滑塊的中心。ios

爲了更好的理解 anchors，下面以 Python 來展現其內涵。git

錨點

首先利用COCO 數據集的使用中介紹的 API 來獲取一張 COCO 數據集的圖片及其標註。github

先載入一些必備的包：編程

import cv2
from matplotlib import pyplot as plt
import numpy as np

# 載入 coco 相關 api
import sys
sys.path.append(r'D:\API\cocoapi\PythonAPI')
from pycocotools.dataset import Loader
%matplotlib inline

利用 Loader 載入 val2017 數據集，並選擇包含 'cat', 'dog', 'person' 的圖片：json

dataType = 'val2017'
root = 'E:/Data/coco'
catNms = ['cat', 'dog', 'person']
annType = 'annotations_trainval2017'
loader = Loader(dataType, catNms, root, annType)

輸出結果：api

Loading json in memory ...
used time: 0.762376 s
Loading json in memory ...
creating index...
index created!
used time: 0.401951 s

能夠看出，Loader 載入數據的速度很快。爲了更加詳細的查看 loader，下面打印出現一些相關信息：網絡

print(f'總共包含圖片 {len(loader)} 張')
for i, ann in enumerate(loader.images):
    w, h = ann['height'], ann['width']
    print(f'第 {i+1} 張圖片的高和寬分別爲: {w, h}')

顯示：app

總共包含圖片 2 張
第 1 張圖片的高和寬分別爲: (612, 612)
第 2 張圖片的高和寬分別爲: (500, 333)

下面以第 1 張圖片爲例來探討 anchors。先可視化：

img, labels = loader[0]
plt.imshow(img);

輸出：

爲了讓特徵圖的尺寸大一點，能夠將其 resize 爲 (800, 800, 3)：

img = cv2.resize(img, (800, 800))
print(img.shape)

輸出：

(800, 800, 3)

下面藉助 MXNet 來完成接下來的代碼編程，爲了適配 MXNet 須要將圖片由 (h, w, 3) 轉換爲 (3, w, h) 形式。

img = img.transpose(2, 1, 0)
print(img.shape)

輸出：

(3, 800, 800)

因爲卷積神經網絡的輸入是四維數據，故而，還須要：

img = np.expand_dims(img, 0)
print(img.shape)

輸出

(1, 3, 800, 800)

爲了和論文一致，咱們也採用 VGG16 網絡（載入 gluoncv中的權重）：

from gluoncv.model_zoo import vgg16
net = vgg16(pretrained=True)  #  載入權重

僅僅考慮直至最後一層卷積層(去除池化層)的網絡，下面查看網絡的各個卷積層的輸出狀況：

from mxnet import nd
imgs = nd.array(img)  # 轉換爲 mxnet 的數據類型
x = imgs
for layer in net.features[:29]:
    x = layer(x)
    if "conv" in layer.name:
        print(layer.name, x.shape) # 輸出該卷積層的 shape

結果爲：

vgg0_conv0 (1, 64, 800, 800)
vgg0_conv1 (1, 64, 800, 800)
vgg0_conv2 (1, 128, 400, 400)
vgg0_conv3 (1, 128, 400, 400)
vgg0_conv4 (1, 256, 200, 200)
vgg0_conv5 (1, 256, 200, 200)
vgg0_conv6 (1, 256, 200, 200)
vgg0_conv7 (1, 512, 100, 100)
vgg0_conv8 (1, 512, 100, 100)
vgg0_conv9 (1, 512, 100, 100)
vgg0_conv10 (1, 512, 50, 50)
vgg0_conv11 (1, 512, 50, 50)
vgg0_conv12 (1, 512, 50, 50)

由此，能夠看出尺寸爲 (800, 800) 的原圖變爲了 (50, 50) 的特徵圖（比原來縮小了 16 倍）。

感覺野

上面的 16 不只僅是針對尺寸爲 (800, 800)，它適用於任意尺寸的圖片，由於 16 是特徵圖的一個像素點的感覺野（receptive ﬁeld ）。

感覺野的大小是如何計算的？咱們回憶卷積運算的過程，即可發現感覺野的計算偏偏是卷積計算的逆過程（參考感覺野計算 ¹）。

記 \(F_k, S_k, P_k\) 分別表示第 \(k\) 層的卷積核的高(或者寬)、移動步長（stride）、Padding 個數；記 \(i_k\) 表示第 \(k\) 層的輸出特徵圖的高（或者寬）。這樣，很容易得出以下遞推公式：

\[ i_{k+1} = \lfloor \frac{i_{k}-F_{k}+2P_{k}}{s_{k}}\rfloor + 1 \]

其中 \(k \in \{1, 2, \cdots\}\)，且 \(i_0\) 表示原圖的高或者寬。令 \(t_k = \frac{F_k - 1}{2} - P_k\)，上式能夠轉換爲

\[ (i_{k-1} - 1) = (i_{k} - 1) S_k + 2t_k \]

反推感覺野, 令 \(i_1 = F_1\), 且\(t_k = \frac{F_k -1}{2} - P_k\), 且 \(1\leq j \leq L\), 則有

\[ i_0 = (i_L - 1)\alpha_L + \beta_L \]

其中 \(\alpha_L = \prod_{p=1}^{L}S_p\)，且有：

\[ \beta_L = 1 + 2\sum_{p=1}^L (\prod_{q=1}^{p-1}S_q) t_p \]

因爲 VGG16 的卷積核的配置均是 kernel_size=(3, 3), padding=(1, 1)，同時只有在通過池化層才使得 \(S_j = 2\)，故而 \(\beta_j = 0\)，且有 \(\alpha_L = 2^4 = 16\)。

錨點的計算

在編程實現的時候，將感覺野的大小使用 base_size 來表示。下面咱們討論如何生成錨框？爲了計算的方便，先定義一個 Box：

import numpy as np


class Box:
    '''
    corner: Numpy, List, Tuple, MXNet.nd, rotch.tensor
    '''

    def __init__(self, corner):
        self._corner = corner

    @property
    def corner(self):
        return self._corner

    @corner.setter
    def corner(self, new_corner):
        self._corner = new_corner

    @property
    def w(self):
        '''
        計算 bbox 的 寬
        '''
        return self.corner[2] - self.corner[0] + 1

    @property
    def h(self):
        '''
        計算 bbox 的 高
        '''
        return self.corner[3] - self.corner[1] + 1

    @property
    def area(self):
        '''
        計算 bbox 的 面積
        '''
        return self.w * self.h

    @property
    def whctrs(self):
        '''
        計算 bbox 的 中心座標
        '''
        assert isinstance(self.w, (int, float)), 'need int or float'
        xctr = self.corner[0] + (self.w - 1) * .5
        yctr = self.corner[1] + (self.h - 1) * .5
        return xctr, yctr

    def __and__(self, other):
        '''
        運算符：&，實現兩個 box 的交集運算
        '''
        xmin = max(self.corner[0], other.corner[0])  # xmin 中的大者
        xmax = min(self.corner[2], other.corner[2])  # xmax 中的小者
        ymin = max(self.corner[1], other.corner[1])  # ymin 中的大者
        ymax = min(self.corner[3], other.corner[3])  # ymax 中的小者
        w = xmax - xmin
        h = ymax - ymin
        if w < 0 or h < 0: # 兩個邊界框沒有交集
            return 0
        else:  
            return w * h

    def __or__(self, other):
        '''
        運算符：|，實現兩個 box 的並集運算
        '''
        I = self & other
        if I == 0:
            return 0
        else:
            return self.area + other.area - I

    def IoU(self, other):
        '''
        計算 IoU
        '''
        I = self & other
        if I == 0:
            return 0
        else:
            U = self | other
            return I / U

類 Box 實現了 bbox 的交集、並集運算以及 IoU 的計算。下面舉一個例子來講明：

bbox = [0, 0, 15, 15]  # 邊界框
bbox1 = [5, 5, 12, 12] # 邊界框
A = Box(bbox)  # 一個 bbox 實例
B = Box(bbox1) # 一個 bbox 實例

下面即可以輸出 A 與 B 的高寬、中心、面積、交集、並集、Iou：

print('A 與 B 的交集', str(A & B))
print('A 與 B 的並集', str(A | B))
print('A 與 B 的 IoU', str(A.IoU(B)))
print(u'A 的中心、高、寬以及面積', str(A.whctrs), A.h, A.w, A.area)

輸出結果：

A 與 B 的交集 49
A 與 B 的並集 271
A 與 B 的 IoU 0.18081180811808117
A 的中心、高、寬以及面積 (7.5, 7.5) 16 16 256

下面從新考慮 loader。首先定義一個轉換函數：

def getX(img):
    # 將 img (h, w, 3) 轉換爲 (1, 3, w, h)
    img = img.transpose((2, 1, 0))
    return np.expand_dims(img, 0)

函數 getX 將圖片由 (h, w, 3) 轉換爲 (1, 3, w, h)：

img, label = loader[0]
img = cv2.resize(img, (800, 800)) # resize 爲 800 x 800
X = getX(img)     # 轉換爲 (1, 3, w, h)
img.shape, X.shape

輸出結果：

((800, 800, 3), (1, 3, 800, 800))

與此同時，獲取特徵圖的數據：

features = net.features[:29]
F = features(imgs)
F.shape

輸出：

(1, 512, 50, 50)

接着須要考慮如何將特徵圖 F 映射回原圖？

全卷積（FCN）：將錨點映射回原圖

faster R-CNN 中的 FCN 僅僅是有着 FCN 的特性，並非真正意義上的卷積。faster R-CNN 僅僅是借用了 FCN 的思想來實現將特徵圖映射回原圖的目的，同時將輸出許多錨框。

特徵圖上的 1 個像素點的感覺野爲 \(16\times 16\)，換言之，特徵圖上的錨點映射回原圖的感覺區域爲 \(16 \times 16\)，論文稱其爲 reference box。下面相對於 reference box 依據不一樣的尺度與高寬比例來生成不一樣的錨框。

base_size = 2**4  # 特徵圖的每一個像素的感覺野大小
scales = [8, 16, 32]  # 錨框相對於 reference box 的尺度
ratios = [0.5, 1, 2]  # reference box 與錨框的高寬的比率（aspect ratios）

其實 reference box 也對應於論文描述的 window（滑動窗口），這個以後再解釋。咱們先看看 scales 與 ratios 的具體含義。

爲了更加通常化，假設 reference box 圖片高寬分別爲 \(h, w\)，而錨框的高寬分別爲 \(h_1, w_1\)，形式化 scales 與 ratios 爲公式 1：

\[ \begin{cases} \frac{w_1 h_1}{wh} = s^2\\ \frac{h_1}{w_1} = \frac{h}{w} r \Rightarrow \frac{h_1}{h} = \frac{w_1}{w} r \end{cases} \]

能夠將上式轉換爲公式 2：

\[ \begin{cases} \frac{w_1}{w} = \frac{s}{\sqrt{r}}\\ \frac{h_1}{h} = \frac{w_1}{w} r = s \sqrt{r} \end{cases} \]

一樣能夠轉換爲公式3：

\[ \begin{cases} w_s = \frac{w_1}{s} = \frac{w}{\sqrt{r}}\\ h_s = \frac{h_1}{s} = h \sqrt{r} \end{cases} \]

基於公式 2 與公式 3 都可以很容易計算出 \(w_1,h_1\). 通常地，\(w=h\)，公式 3 亦能夠轉換爲公式 4：

\[ \begin{cases} w_s = \sqrt{\frac{wh}{r}}\\ h_s = w_s r \end{cases} \]

gluoncv 結合公式 4 來編程，本文依據 3 進行編程。不管原圖的尺寸如何，特徵圖的左上角第一個錨點映射回原圖後的 reference box 的 bbox = (xmain, ymin, xmax, ymax) 均爲 (0, 0, bas_size-1, base_size-1)，爲了方便稱呼，咱們稱其爲 base_reference box。基於 base_reference box 依據不一樣的 s 與 r 的組合生成不一樣尺度和高寬比的錨框，且稱其爲 base_anchors。編程實現：

class MultiBox(Box):
    def __init__(self, base_size, ratios, scales):
        if not base_size:
            raise ValueError("Invalid base_size: {}.".format(base_size))
        if not isinstance(ratios, (tuple, list)):
            ratios = [ratios]
        if not isinstance(scales, (tuple, list)):
            scales = [scales]
        super().__init__([0]*2+[base_size-1]*2)  # 特徵圖的每一個像素的感覺野大小爲 base_size
        # reference box 與錨框的高寬的比率（aspect ratios）
        self._ratios = np.array(ratios)[:, None]
        self._scales = np.array(scales)     # 錨框相對於 reference box 的尺度

    @property
    def base_anchors(self):
        ws = np.round(self.w / np.sqrt(self._ratios))
        w = ws * self._scales
        h = w * self._ratios
        wh = np.stack([w.flatten(), h.flatten()], axis=1)
        wh = (wh - 1) * .5
        return np.concatenate([self.whctrs - wh, self.whctrs + wh], axis=1)

    def _generate_anchors(self, stride, alloc_size):
        # propagete to all locations by shifting offsets
        height, width = alloc_size  # 特徵圖的尺寸
        offset_x = np.arange(0, width * stride, stride)
        offset_y = np.arange(0, height * stride, stride)
        offset_x, offset_y = np.meshgrid(offset_x, offset_y)
        offsets = np.stack((offset_x.ravel(), offset_y.ravel(),
                            offset_x.ravel(), offset_y.ravel()), axis=1)
        # broadcast_add (1, N, 4) + (M, 1, 4)
        anchors = (self.base_anchors.reshape(
            (1, -1, 4)) + offsets.reshape((-1, 1, 4)))
        anchors = anchors.reshape((1, 1, height, width, -1)).astype(np.float32)
        return anchors

下面看看具體效果：

base_size = 2**4  # 特徵圖的每一個像素的感覺野大小
scales = [8, 16, 32]  # 錨框相對於 reference box 的尺度
ratios = [0.5, 1, 2]  # reference box 與錨框的高寬的比率（aspect ratios）
A = MultiBox(base_size,ratios, scales)
A.base_anchors

輸出結果：

array([[ -84.,  -38.,   99.,   53.],
       [-176.,  -84.,  191.,   99.],
       [-360., -176.,  375.,  191.],
       [ -56.,  -56.,   71.,   71.],
       [-120., -120.,  135.,  135.],
       [-248., -248.,  263.,  263.],
       [ -36.,  -80.,   51.,   95.],
       [ -80., -168.,   95.,  183.],
       [-168., -344.,  183.,  359.]])

接着考慮將 base_anchors 在整個原圖上進行滑動。好比，特徵圖的尺寸爲 (5， 5) 而感覺野的大小爲 50，則 base_reference box 在原圖滑動的狀況（移動步長爲 50）以下圖：

x, y = np.mgrid[0:300:50, 0:300:50]
plt.pcolor(x, y, x+y);  # x和y是網格,z是(x,y)座標處的顏色值colorbar()

輸出結果：

原圖被劃分爲了 25 個 block，每一個 block 均表明一個 reference box。若 base_anchors 有 9 個，則只須要按照 stride = 50 進行滑動即可以得到這 25 個 block 的全部錨框（總計 5x5x9=225 個）。針對前面的特徵圖 F 有：

stride = 16  # 滑動的步長
alloc_size = F.shape[2:]  # 特徵圖的尺寸
A._generate_anchors(stride, alloc_size).shape

輸出結果：

(1, 1, 50, 50, 36)

即總共 \(50\times 50 \times 9=22500\) 個錨點（anchors 數量龐大且必然有許多的高度重疊的框。）。至此，咱們生成初始錨框的過程便結束了，同時很容易發現，anchors 的生成僅僅藉助 Numpy 便完成了，這樣作十分有利於代碼遷移到 Pytorch、TensorFlow 等支持 Numpy 做爲輸入的框架。下面僅僅考慮 MXNet，其餘框架之後再討論。下面先看看 MultiBox 的設計對於使用 MXNet 進行後續的開發有什麼好處吧！

因爲 base-net （基網絡）的結構一經肯定即是是固定的，針對不一樣尺寸的圖片，若是每次生成 anchors 都要從新調用 A._generate_anchors() 一次，那麼將會產生不少的沒必要要的冗餘計算，gluoncv 提供了一種十分不錯的思路：先生成 base_anchors，而後選擇一個比較大的尺度 alloc_size（好比 \(128\times 128\)）用來生成錨框的初選模板；接着把真正的特徵圖傳入到 RPNAnchorGenerator 並經過前向傳播計算獲得特徵圖的錨框。具體的操做細節見以下代碼：

class RPNAnchorGenerator(gluon.HybridBlock):
    r"""生成 RPN 的錨框

    參數
    ----------
    stride : int
        特徵圖相對於原圖的滑動步長，或是說是特徵圖上單個像素點的感覺野。
    base_size : int
        reference anchor box 的寬或者高
    ratios : iterable of float
        anchor boxes 的 aspect ratios（高寬比）。咱們指望它是 tuple 或者 list
    scales : iterable of float
        錨框相對於 reference anchor boxes 的尺度
        採用以下形式計算錨框的高和寬:

        .. math::

            width_{anchor} = size_{base} \times scale \times \sqrt{ 1 / ratio}
            height_{anchor} = width_{anchor} \times ratio

    alloc_size : tuple of int
        預設錨框的尺寸爲 (H, W)，一般用來生成比較大的特徵圖（如 128x128）。
        在推斷的後期, 咱們能夠有可變的輸入大小, 在這個時候, 咱們能夠從這個大的 anchor map 中直接裁剪出對應的 anchors, 以便咱們能夠避免在每次輸入都要從新生成錨點。
    """

    def __init__(self, alloc_size, base_size, ratios, scales, **kwargs):
        super().__init__(**kwargs)
        # 生成錨框初選模板，以後經過切片獲取特徵圖的真正錨框
        anchors = MultiBox(base_size, ratios, scales)._generate_anchors(
            base_size, alloc_size)
        self.anchors = self.params.get_constant('anchor_', anchors)

    # pylint: disable=arguments-differ
    def hybrid_forward(self, F, x, anchors):
        """Slice anchors given the input image shape.

        Inputs:
            - **x**: input tensor with (1 x C x H x W) shape.
        Outputs:
            - **out**: output anchor with (1, N, 4) shape. N is the number of anchors.

        """
        a = F.slice_like(anchors, x * 0, axes=(2, 3))
        return a.reshape((1, -1, 4))

上面的 RPNAnchorGenerator 直接改寫自 gluoncv。看看 RPNAnchorGenerator 的魅力所在：

base_size = 2**4  # 特徵圖的每一個像素的感覺野大小
scales = [8, 16, 32]  # 錨框相對於 reference box 的尺度
ratios = [0.5, 1, 2]  # reference box 與錨框的高寬的比率（aspect ratios）
stride = base_size  # 在原圖上滑動的步長
alloc_size = (128, 128)  # 一個比較大的特徵圖的錨框生成模板
# 調用 RPNAnchorGenerator 生成 anchors
A = RPNAnchorGenerator(alloc_size, base_size, ratios, scales)
A.initialize()
A(F)  # 直接傳入特徵圖 F，獲取 F 的 anchors

輸出結果：

[[[ -84.  -38.   99.   53.]
  [-176.  -84.  191.   99.]
  [-360. -176.  375.  191.]
  ...
  [ 748.  704.  835.  879.]
  [ 704.  616.  879.  967.]
  [ 616.  440.  967. 1143.]]]
<NDArray 1x22500x4 @cpu(0)>

shape = 1x22500x4 符合咱們的預期。若是咱們更改特徵圖的尺寸：

x = nd.zeros((1, 3, 75, 45))
A(x).shape

輸出結果：

(1, 30375, 4)

這裏 \(30375 = 75 \times 45 \times 9\) 也符合咱們的預期。

至此，咱們完成了將特徵圖上的全部錨點映射回原圖生成錨框的工做！

平移不變性的錨點

反觀上述的編程實現，很容易即可理解論文提到的錨點的平移不變性。不管是錨點的生成仍是錨框的生成都是基於 base_reference box 進行平移和卷積運算（亦可看做是一種線性變換）的。爲了敘述方便下文將 RPNAnchorGenerator（被放置在 app/detection/anchor.py）生成的 anchor boxes 由 corner（記做 \(A\) 座標形式：(xmin,ymin,xmax,ymax)）轉換爲 center（形式爲：(xctr,yctr,w,h)）後的錨框記做 \(B\)。其中(xmin,ymin),(xmax,ymax) 分別表示錨框的最小值與最大值座標；(xctr,yctr) 表示錨框的中心座標，w,h 表示錨框的寬和高。且記 \(a = (x_a,y_a,w_a,h_a) \in B\)，即便用下標 \(a\) 來標識錨框。\(A\) 與 \(B\) 是錨框的兩種不一樣的表示形式。

在 gluoncv.nn.bbox 中提供了將 \(A\) 轉換爲 \(B\) 的模塊：BBoxCornerToCenter。下面便利用其進行編程。先載入環境：

cd ../app/

接着載入本小節所需的包：

from mxnet import init, gluon, autograd
from mxnet.gluon import nn
from gluoncv.nn.bbox import BBoxCornerToCenter
# 自定義包
from detection.bbox import MultiBox
from detection.anchor import RPNAnchorGenerator

爲了更加容易理解 \(A\) 與 \(B\) 的處理過程，下面先自創一個類(以後會拋棄)：

class RPNProposal(nn.HybridBlock):
    def __init__(self, channels, stride, base_size, ratios, scales, alloc_size, **kwargs):
        super().__init__(**kwargs)
        weight_initializer = init.Normal(0.01)

        with self.name_scope():
            self.anchor_generator = RPNAnchorGenerator(
                stride, base_size, ratios, scales, alloc_size)
            anchor_depth = self.anchor_generator.num_depth
            # conv1 的建立
            self.conv1 = nn.HybridSequential()
            self.conv1.add(nn.Conv2D(channels, 3, 1, 1,
                                     weight_initializer=weight_initializer))
            self.conv1.add(nn.Activation('relu'))
            # score 的建立
            # use sigmoid instead of softmax, reduce channel numbers
            self.score = nn.Conv2D(anchor_depth, 1, 1, 0,
                                   weight_initializer=weight_initializer)
            # loc 的建立
            self.loc = nn.Conv2D(anchor_depth * 4, 1, 1, 0,
                                 weight_initializer=weight_initializer)
# 具體的操做以下
channels = 256
base_size = 2**4  # 特徵圖的每一個像素的感覺野大小
scales = [8, 16, 32]  # 錨框相對於 reference box 的尺度
ratios = [0.5, 1, 2]  # reference box 與錨框的高寬的比率（aspect ratios）
stride = base_size  # 在原圖上滑動的步長
alloc_size = (128, 128)  # 一個比較大的特徵圖的錨框生成模板
alloc_size = (128, 128)  # 一個比較大的特徵圖的錨框生成模板
self = RPNProposal(channels, stride, base_size, ratios, scales, alloc_size)
self.initialize()

下面咱們即可以看看如何將 \(A\) 轉換爲 \(B\)：

img, label = loader[0]  # 載入圖片和標註信息
img = cv2.resize(img, (800, 800))  # resize 爲 (800，800)
imgs = nd.array(getX(img))  # 轉換爲 MXNet 的輸入形式
xs = features(imgs)        # 獲取特徵圖張量
F = nd
A = self.anchor_generator(xs)    # (xmin,ymin,xmax,ymax) 形式的錨框
box_to_center = BBoxCornerToCenter()  
B = box_to_center(A)      # (x,y,w,h) 形式的錨框

邊界框迴歸

手動設計的錨框 \(B\) 並不能很好的知足後續的 Fast R-CNN 的檢測工做，還須要藉助論文介紹的 3 個卷積層：conv一、score、loc。對於論文中的 \(3 \times 3\) 的卷積核 conv1 個人理解是模擬錨框的生成過程：經過不改變原圖尺寸的卷積運算達到降維的目標，同時有着在原圖滑動尺寸爲 base_size 的 reference box 的做用。換言之，conv1 的做用是模擬生成錨點。假設經過 RPN 生成的邊界框 bbox 記爲 \(G=\{p:(x,y,w,h)\}\)，利用 \(1\times 1\) 卷積核 loc 預測 \(p\) 相對於每一個像素點（即錨點）所生成的 \(k\) 個錨框的中心座標與高寬的偏移量，利用 \(1\times 1\) 卷積核 score 判別錨框是目標（objectness, foreground）仍是背景（background）。記真實的邊界框集合爲 \(G^* = \{p^*：(x^*,y^*,w^*,h^*)\}\)。其中，\((x,y), (x^*,y^*)\) 分別表明預測邊界框、真實邊界框的中心座標；\((w, h)， (w^*, h^*)\) 分別表明預測邊界框、真實邊界框的的寬與高。論文在 Training RPNs 中提到，在訓練階段 conv一、loc、score 使用均值爲 \(0\)，標準差爲 \(0.01\) 的高斯分佈來隨機初始化。
接着，看看如何使用 conv一、loc、score：

x = self.conv1(xs)
# score 的輸出
raw_rpn_scores = self.score(x).transpose(axes=(0, 2, 3, 1)).reshape((0, -1,1))
rpn_scores = F.sigmoid(F.stop_gradient(raw_rpn_scores)) # 轉換爲機率形式
# loc 的輸出
rpn_box_pred = self.loc(x).transpose(axes=(0, 2, 3, 1)).reshape((0, -1, 4))

卷積核 loc 的做用是用來學習偏移量的，在論文中給出了以下公式：

\[ \begin{cases} t_x = \frac{x - x_a}{w_a} & t_y = \frac{y - y_a}{h_a}\\ t_x^* = \frac{x^* - x_a}{w_a} & t_y^* = \frac{y^* - y_a}{h_a}\\ t_w = \log(\frac{w}{w_a}) & t_h = \log(\frac{h}{h_a})\\ t_w^* = \log(\frac{w^*}{w_a}) & t_h^* = \log(\frac{h^*}{h_a}) \end{cases} \]

這樣能夠很好的消除圖像尺寸的不一樣帶來的影響。爲了使得修正後的錨框 G 具有與真實邊界框 \(G^*\) 有相同的均值和標準差，還須要設定：\(\sigma = (\sigma_x, \sigma_y, \sigma_w, \sigma_h), \mu = (\mu_x,\mu_y,\mu_w,\mu_h)\) 表示 G 的 (x, y, w, h) 對應的標準差與均值。故而，爲了讓預測的邊界框的的偏移量的分佈更加均勻還須要將座標轉換一下：

\[ \begin{cases} t_x = \frac{\frac{x - x_a}{w_a} - \mu_x}{\sigma_x}\\ t_y = \frac{\frac{y - y_a}{h_a} - \mu_y}{\sigma_y}\\ t_w = \frac{\log(\frac{w}{w_a}) - \mu_w}{\sigma_w}\\ t_h = \frac{\log(\frac{h}{h_a}) - \mu_h}{\sigma_h} \end{cases} \]

對於 \(G^*\) 也是同樣的操做。（略去）通常地，\(\sigma = (0.1, 0.1, 0.2, 0.2), \mu = (0, 0, 0, 0)\)。

因爲 loc 的輸出即是 \(\{(t_x, t_y, t_w, t_h)\}\)，下面咱們須要反推 \((x, y, w, h)\)：

\[ \begin{cases} x = (t_x\sigma_x +\mu_x)w_a + x_a\\ y = (t_y\sigma_y +\mu_x)h_a + y_a\\ w = w_a e^{t_w\sigma_w + \mu_w}\\ h = h_a e^{t_h\sigma_h + \mu_h} \end{cases} \]

一般狀況下，\(A\) 形式的邊界框轉換爲 \(B\) 形式的邊界框被稱爲編碼（encode），反之，則稱爲解碼（decode）。在 gluoncv.nn.coder 中的 NormalizedBoxCenterDecoder 類實現了上述的轉換過程，同時也完成了 \(G\) 解碼工做。

from gluoncv.nn.coder import NormalizedBoxCenterDecoder
stds = (0.1, 0.1, 0.2, 0.2)
means = (0., 0., 0., 0.)
box_decoder = NormalizedBoxCenterDecoder(stds, means)
roi = box_decoder(rpn_box_pred, B)  # 解碼後的 G

裁剪預測邊界框超出原圖邊界的邊界

爲了保持一致性，須要重寫 getX：

def getX(img):
    # 將 img (h, w, 3) 轉換爲 (1, 3, h, w)
    img = img.transpose((2, 0, 1))
    return np.expand_dims(img, 0)

考慮到 RPN 的輸入能夠是批量數據：

imgs = []
labels = []
for img, label in loader:
    img = cv2.resize(img, (600, 800))  # resize 寬高爲 (600，800)
    imgs.append(getX(img))
    labels.append(label)
imgs = nd.array(np.concatenate(imgs)) # 一個批量的圖片
labels = nd.array(np.stack(labels))  # 一個批量的標註信息

這樣便有：

from gluoncv.nn.coder import NormalizedBoxCenterDecoder
from gluoncv.nn.bbox import BBoxCornerToCenter
stds = (0.1, 0.1, 0.2, 0.2)
means = (0., 0., 0., 0.)

fs = features(imgs)        # 獲取特徵圖張量
F = nd
A = self.anchor_generator(fs)    # (xmin,ymin,xmax,ymax) 形式的錨框
box_to_center = BBoxCornerToCenter()  
B = box_to_center(A)      # (x,y,w,h) 形式的錨框
x = self.conv1(fs)    # conv1 卷積
raw_rpn_scores = self.score(x).transpose(axes=(0, 2, 3, 1)).reshape((0, -1,1))  # 激活以前的 score
rpn_scores = F.sigmoid(F.stop_gradient(raw_rpn_scores))    # 激活後的 score
rpn_box_pred = self.loc(x).transpose(axes=(0, 2, 3, 1)).reshape((0, -1, 4))  # loc 預測偏移量 (tx,ty,tw,yh)
box_decoder = NormalizedBoxCenterDecoder(stds, means)
roi = box_decoder(rpn_box_pred, B)  # 解碼後的 G
print(roi.shape)

此時，便有兩張圖片的預測 G：

(2, 16650, 4)

由於此時生成的 RoI 有許多超出邊界的框，因此，須要進行裁減操做。先裁剪掉全部小於 \(0\) 的邊界：

x = F.maximum(roi, 0.0)

nd.maximum(x) 的做用是 \(\max\{0, x\}\)。接下來裁剪掉大於原圖的邊界的邊界：

shape = F.shape_array(imgs)  # imgs 的 shape
size = shape.slice_axis(axis=0, begin=2, end=None)  # imgs 的尺寸
window = size.expand_dims(0)
window

輸出結果：

[[800 600]]
<NDArray 1x2 @cpu(0)>

此時是 (高, 寬) 的形式，而錨框的是以 (寬, 高) 的形式生成的，故而還須要：

F.reverse(window, axis=1)

結果：

[[600 800]]
<NDArray 1x2 @cpu(0)>

於是，下面的 m 能夠用來判斷是否超出邊界：

m = F.tile(F.reverse(window, axis=1), reps=(2,)).reshape((0, -4, 1, -1))
m

結果：

[[[600 800 600 800]]]
<NDArray 1x1x4 @cpu(0)>

接着，即可以獲取裁剪以後的 RoI：

roi = F.broadcast_minimum(x, F.cast(m, dtype='float32'))

整個裁剪工做能夠經過以下操做簡單實現：

from gluoncv.nn.bbox import BBoxClipToImage
clipper = BBoxClipToImage()
roi = clipper(roi, imgs)  # 裁剪超出邊界的邊界

移除小於 min_size 的邊界框

移除小於 min_size 的邊界框進一步篩選邊界框：

min_size = 5  # 最小錨框的尺寸
xmin, ymin, xmax, ymax = roi.split(axis=-1, num_outputs=4) # 拆分座標
width = xmax - xmin  # 錨框寬度的集合
height = ymax - ymin  # # 錨框高度的集合
invalid = (width < min_size) + (height < min_size) # 全部小於 min_size 的高寬

因爲張量的 < 運算有一個特性：知足條件的設置爲 1，不然爲 0。這樣一來兩個這樣的運算相加即可篩選出同時不知足條件的對象：

cond = invalid[0,:10]
cond.T

結果：

[[1. 0. 0. 0. 1. 0. 1. 1. 0. 2.]]
<NDArray 1x10 @cpu(0)>

能夠看出有 2 存在，表明着兩個條件都知足，咱們能夠作篩選以下：

F.where(cond, F.ones_like(cond)* -1, rpn_scores[0,:10]).T

結果：

[[-1. 0.999997 0.0511509 0.9994136  -1.  0.00826993 -1.   -1. 0.99783903 -1.]]
<NDArray 1x10 @cpu(0)>

由此能夠篩選出全部不知足條件的對象。更進一步，篩選出全部不知足條件的對象：

score = F.where(invalid, F.ones_like(invalid) * -1, rpn_scores)  # 篩選 score
invalid = F.repeat(invalid, axis=-1, repeats=4)
roi = F.where(invalid, F.ones_like(invalid) * -1, roi)  # 篩選 RoI

NMS (Non-maximum suppression)

咱們先總結 RPN 的前期工做中 Proposal 的生成階段：

利用 base_net 獲取原圖 \(I\) 對應的特徵圖 \(X\)；
依據 base_net 的網絡結構計算特徵圖的感覺野大小爲 base_size；
依據不一樣的 scale 和 aspect ratio 經過 MultiBox 計算特徵圖 \(X\) 在 \((0,0)\) 位置的錨點對應的 \(k\) 個錨框 base_anchors；
經過 RPNAnchorGenerator 將 \(X\) 映射回原圖並生成 corner 格式 (xmin,ymin,xmax,ymax) 的錨框 \(A\)；
將錨框 \(A\) 轉換爲 center 格式 \((x,y,w,h)\)，記做 \(B\)；
\(X\) 經過卷積 conv1 得到模擬錨點，亦稱之爲 rpn_features；
經過卷積層 score 獲取 rpn_features 的得分 rpn_score；
與 7 並行的經過卷積層 loc 獲取 rpn_features 的邊界框迴歸的偏移量 rpn_box_pred；
依據 rpn_box_pred 修正錨框 \(B\) 並將其解碼爲 \(G\)；
裁剪掉 \(G\) 的超出原圖尺寸的邊界，並移除小於 min_size 的邊界框。

雖然上面的步驟移除了許多無效的邊界並裁剪掉超出原圖尺寸的邊界，可是，能夠想象到 \(G\) 中必然存在許多的高度重疊的邊界框，此時若將 \(G\) 看成 Region Proposal 送入 PoI Pooling 層將給計算機帶來十分龐大的負載，而且 \(G\) 中的背景框遠遠多於目標極爲不利於模型的訓練。論文中給出了 NMS 的策略來解決上述難題。根據咱們預測的 rpn_score，對 G 進行非極大值抑制操做（NMS），去除得分較低以及重複區域的 RoI。在 MXNet 提供了 nd.contrib.box_nms 來實現這次目標：

nms_thresh = 0.7
n_train_pre_nms = 12000  # 訓練時 nms 以前的 bbox 的數目
n_train_post_nms = 2000  # 訓練時 nms 以後的 bbox 的數目
n_test_pre_nms = 6000   # 測試時 nms 以前的 bbox 的數目
n_test_post_nms = 300   # 測試時 nms 以後的 bbox 的數目

pre = F.concat(scores, rois, dim=-1)  # 合併 score 與 roi
# 非極大值抑制
tmp = F.contrib.box_nms(pre, overlap_thresh=nms_thresh, topk=n_train_pre_nms,
                        coord_start=1, score_index=0, id_index=-1, force_suppress=True)

# slice post_nms number of boxes
result = F.slice_axis(tmp, axis=1, begin=0, end=n_train_post_nms)
rpn_scores = F.slice_axis(result, axis=-1, begin=0, end=1)
rpn_bbox = F.slice_axis(result, axis=-1, begin=1, end=None)

上述的封裝比較完全，沒法獲取具體的實現，而且也不利於咱們理解 NMS 的具體實現原理。爲了更好的理解 NMS，本身從新實現 NMS 是十分有必要的。

將 scores 按照從大到小進行排序，並返回其索引：

scores = scores.flatten()  # 去除無效維度
# 將 scores 按照從大到小進行排序，並返回其索引
orders = scores.argsort()[:,::-1]

因爲 loc 生成的錨框實在是太多了，爲此，僅僅考慮得分前 n_train_pre_nms 的錨框：

keep = orders[:,:n_train_pre_nms]

下面先考慮一張圖片，以後再考慮多張圖片一塊兒訓練的狀況：

order = keep[0]   # 第一張圖片的得分降序索引
score = scores[0][order]  # 第一張圖片的得分預測降序排列
roi = rois[0][order]     # 第一張圖片的邊界框預測按得分降序排列
label = labels[0]     # 真實邊界框

雖然 Box 支持 nd 做爲輸入，可是計算多個 IoU 效率並不高：

%%timeit
GT = [Box(corner) for corner in label]  # 真實邊界框實例化
G = [Box(corner) for corner in roi]   # 預測邊界框實例化
ious = nd.zeros((len(G), len(GT)))    #  初始化 IoU 的計算
for i, A in enumerate(G):
    for j, B in enumerate(GT):
        iou = A.IoU(B)
        ious[i, j] = iou

輸出計時結果：

1min 10s ± 6.08 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

先轉換爲 Numpy 再計算 IoU 效率會更高：

%%timeit
GT = [Box(corner) for corner in label.asnumpy()]  # 真實邊界框實例化
G = [Box(corner) for corner in roi.asnumpy()]   # 預測邊界框實例化
ious = nd.zeros((len(G), len(GT)))    #  初始化 IoU 的計算
for i, A in enumerate(G):
    for j, B in enumerate(GT):
        iou = A.IoU(B)
        ious[i, j] = iou

輸出計時結果：

6.88 s ± 410 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

比使用 nd 快了近 10 倍！可是若是所有使用 Numpy，會有什麼變化？

%%timeit
GT = [Box(corner) for corner in label.asnumpy()]  # 真實邊界框實例化
G = [Box(corner) for corner in roi.asnumpy()]   # 預測邊界框實例化
ious = np.zeros((len(G), len(GT)))    #  初始化 IoU 的計算
for i, A in enumerate(G):
    for j, B in enumerate(GT):
        iou = A.IoU(B)
        ious[i, j] = iou

輸出計時結果：

796 ms ± 30.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

速度又提高了 10 倍左右！爲此，後續的與 IoU 相關的計算咱們僅僅考慮使用 Numpy 來計算。將其封裝進 group_ious 函數：

def group_ious(pred_bbox, true_bbox):
    # 計算 pred_bbox 與 true_bbox 的 IoU 組合
    GT = [Box(corner) for corner in true_bbox]  # 真實邊界框實例化
    G = [Box(corner) for corner in pred_bbox]   # 預測邊界框實例化
    ious = np.zeros((len(G), len(GT)))    #  初始化 IoU 的計算
    for i, A in enumerate(G):
        for j, B in enumerate(GT):
            iou = A.IoU(B)
            ious[i, j] = iou
    return ious

重構代碼

前面的代碼有點混亂，爲了後續開發的便利，咱們先從新整理一下代碼。先僅僅考慮一張圖片，下面先載入設置 RPN 網絡的輸入以及部分輸出：

# PRN 前期的設定
channels = 256   # conv1 的輸出通道數
base_size = 2**4  # 特徵圖的每一個像素的感覺野大小
scales = [8, 16, 32]  # 錨框相對於 reference box 的尺度
ratios = [0.5, 1, 2]  # reference box 與錨框的高寬的比率（aspect ratios）
stride = base_size  # 在原圖上滑動的步長
alloc_size = (128, 128)  # 一個比較大的特徵圖的錨框生成模板
# 用來輔助理解 RPN 的類
self = RPNProposal(channels, stride, base_size, ratios, scales, alloc_size)
self.initialize()   # 初始化卷積層 conv1, loc, score
stds = (0.1, 0.1, 0.2, 0.2)  # 偏移量的標準差
means = (0., 0., 0., 0.)    # 偏移量的均值
# 錨框的編碼
box_to_center = BBoxCornerToCenter()  # 將 (xmin,ymin,xmax,ymax) 轉換爲 (x,y,w,h)
# 將錨框經過偏移量進行修正，並解碼爲 (xmin,ymin,xmax,ymax)
box_decoder = NormalizedBoxCenterDecoder(stds, means)
clipper = BBoxClipToImage()  # 裁剪超出原圖尺寸的邊界
# 獲取 COCO 的一張圖片用來作實驗
img, label = loader[0]   # 獲取一張圖片
img = cv2.resize(img, (800, 800))  # resize 爲 (800, 800)
imgs = nd.array(getX(img))      # 轉換爲 MXNet 的輸入形式
# 提取最後一層卷積的特徵
net = vgg16(pretrained=True)    # 載入基網絡的權重
features = net.features[:29]    # 卷積層特徵提取器
fs = features(imgs)  # 獲取特徵圖張量
A = self.anchor_generator(fs)    # 生成 (xmin,ymin,xmax,ymax) 形式的錨框
B = box_to_center(A)      # 編碼 爲(x,y,w,h) 形式的錨框
x = self.conv1(fs)    # conv1 卷積
# sigmoid 激活以前的 score
raw_rpn_scores = self.score(x).transpose(axes=(0, 2, 3, 1)).reshape((0, -1, 1))
rpn_scores = nd.sigmoid(nd.stop_gradient(raw_rpn_scores))    # 激活後的 score
# loc 預測偏移量 (tx,ty,tw,yh)
rpn_box_pred = self.loc(x).transpose(axes=(0, 2, 3, 1)).reshape((0, -1, 4))
# 修正錨框的座標
roi = box_decoder(rpn_box_pred, B)  # 解碼後的預測邊界框 G（RoIs）
print(roi.shape)   # 裁剪以前
roi = clipper(roi, imgs)  # 裁剪超出原圖尺寸的邊界

雖然，roi 已經裁剪掉超出原圖尺寸的邊界，可是還有一部分邊界框實在有點兒小，不利於後續的訓練，故而須要丟棄。丟棄的方法是將其邊界框與得分均設置爲 \(-1\)：

def size_control(F, min_size, rois, scores):
    # 拆分座標
    xmin, ymin, xmax, ymax = rois.split(axis=-1, num_outputs=4)
    width = xmax - xmin  # 錨框寬度的集合
    height = ymax - ymin  # # 錨框高度的集合
    # 獲取全部小於 min_size 的高寬
    invalid = (width < min_size) + (height < min_size) # 同時知足條件
    # 將不知足條件的錨框的座標與得分均設置爲 -1
    scores = F.where(invalid, F.ones_like(invalid) * -1, scores)
    invalid = F.repeat(invalid, axis=-1, repeats=4)
    rois = F.where(invalid, F.ones_like(invalid) * -1, rois)
    return rois, scores
min_size = 16  # 最小錨框的尺寸
pre_nms = 12000  # nms 以前的 bbox 的數目
post_nms = 2000  # ms 以後的 bbox 的數目
rois, scores = size_control(nd, min_size, roi, rpn_scores)

爲了可讓 Box 一次計算多個 bbox 的 IoU，下面須要從新改寫 Box：

class BoxTransform(Box):
    '''
    一組 bbox 的運算
    '''
    def __init__(self, F, corners):
        '''
        F 能夠是 mxnet.nd, numpy, torch.tensor
        '''
        super().__init__(corners)
        self.corner = corners.T
        self.F = F

    def __and__(self, other):
        r'''
        運算符 `&` 交集運算
        '''
        xmin = self.F.maximum(self.corner[0].expand_dims(
            0), other.corner[0].expand_dims(1))  # xmin 中的大者
        xmax = self.F.minimum(self.corner[2].expand_dims(
            0), other.corner[2].expand_dims(1))  # xmax 中的小者
        ymin = self.F.maximum(self.corner[1].expand_dims(
            0), other.corner[1].expand_dims(1))  # ymin 中的大者
        ymax = self.F.minimum(self.corner[3].expand_dims(
            0), other.corner[3].expand_dims(1))  # ymax 中的小者
        w = xmax - xmin
        h = ymax - ymin
        cond = (w <= 0) + (h <= 0)
        I = self.F.where(cond, nd.zeros_like(cond), w * h)
        return I

    def __or__(self, other):
        r'''
        運算符 `|` 並集運算
        '''
        I = self & other
        U = self.area.expand_dims(0) + other.area.expand_dims(1) - I
        return self.F.where(U < 0, self.F.zeros_like(I), U)

    def IoU(self, other):
        '''
        交併比
        '''
        I = self & other
        U = self | other
        return self.F.where(U == 0, self.F.zeros_like(I), I / U)

咱們先測試一下：

a = BoxTransform(nd, nd.array([[[0, 0, 15, 15]]]))
b = BoxTransform(nd, 1+nd.array([[[0, 0, 15, 15]]]))
c = BoxTransform(nd, nd.array([[[-1, -1, -1, -1]]]))

建立了兩個簡單有效的 bbox（a, b）和一個無效的 bbox（c），接着看看它們的運算：

a & b, a | b, a.IoU(b)

輸出結果：

(
 [[[196.]]] <NDArray 1x1x1 @cpu(0)>, [[[316.]]] <NDArray 1x1x1 @cpu(0)>, [[[0.62025315]]] <NDArray 1x1x1 @cpu(0)>)

而與無效的邊界框的計算即是：

a & c, a | c, a.IoU(c)

輸出結果：

(
 [[[0.]]]
 <NDArray 1x1x1 @cpu(0)>,
 [[[257.]]]
 <NDArray 1x1x1 @cpu(0)>,
 [[[0.]]]
 <NDArray 1x1x1 @cpu(0)>)

若是把無效的邊界框看做是空集，則上面的運算結果便符合常識。下面討論如何標記訓練集？

參考 MSCOCO 數據標註詳解能夠知道：COCO 的 bbox 是 (x,y,w,h) 格式的，可是這裏的 (x,y) 不是中心座標，而是左上角座標。爲了統一，須要將其轉換爲 (xmin,ymin,xmax,ymax) 格式：

labels = nd.array(label)  # (x,y,w,h)，其中(x,y) 是左上角座標
cwh = (labels[:,2:4]-1) * 0.5
labels[:,2:4] = labels[:,:2] + cwh # 轉換爲 (xmin,ymin,xmax,ymax)

下面計算真實邊界框與預測邊界框的 ious:

# 將 scores 按照從大到小進行排序，並返回其索引
orders = scores.reshape((0, -1)).argsort()[:, ::-1][:,:pre_nms]
scores = scores[0][orders[0]]  # 得分降序排列
rois = rois[0][orders[0]]  # 按得分降序排列 rois
# 預測邊界框
G = BoxTransform(nd, rois.expand_dims(0))
# 真實邊界框
GT = BoxTransform(nd, labels[:,:4].expand_dims(0))
ious = G.IoU(GT).T
ious.shape

輸出結果：