Faster R-CNN論文閱讀摘要

　　論文連接: https://arxiv.org/pdf/1506.01497.pdfios

　　代碼下載: https://github.com/ShaoqingRen/faster_rcnn (MATLAB)
　　　　　　 https://github.com/rbgirshick/py-faster-rcnn (Python)git

Abstract

State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations.Advances like SPPnet [1] and Fast R-CNN [2] have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck.
#State-of-the-art目標檢測網絡依賴於region proposal算法來預測目標的定位.SPPnet[1]and Fast R-CNN[2]在下降檢測網絡運行時間的進展，揭示了region proposal運算是一個瓶頸.

In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals.An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position.
#在這篇做品中，咱們引入了一個Region Proposal(RPN)網絡，這個網絡共用檢測網絡的全鏈接卷積特徵，所以region proposal基本不佔用運算資源.RPN是能夠同時對邊框和分類進行預測的全鏈接網絡.

The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features—using the recently popular terminology of neural networks with 「attention」 mechanisms, the RPN component tells the unified network where to look. 
#RPN經過端到端的訓練產生高質量的可用於Fast R-CNN檢測任務的region proposal.咱們進一步引入最新流形的神經網絡術語＂注意力＂機制，使RPN和Fast R-CNN融合成一個網絡達到共享卷積特徵的效果,其中RPN告訴網絡應該關注哪裏.

For the very deep VGG-16 model [3],our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks. Code has been made publicly available.
#對於很是深的網絡例如VGG-16，咱們的檢測算法能夠在GPU實現5fps的效果，其中每張圖使用300個region proposals，同時在PASCAL VOC 2007,2012及MS COCO數據集上實現了state-of-the-art的目標檢測準確率.在ILSVRC及COCO 2015挑戰賽中,Faster R-CNN和RPN是一些最佳跟蹤算法的基礎.算法的源碼能夠被公開下載.

Introduction

Recent advances in object detection are driven by the success of region proposal methods (e.g., [4]) and region-based convolutional neural networks (R-CNNs) [5].Now, proposals are the test-time computational bottleneck in state-of-the-art detection systems.
#區域建議方法以及基於區域卷積神經網絡的成功極大地促進了目標檢測的進步.目前Proposals算法已經成爲影響state-of-art檢測系統測試時間的瓶頸.

Region proposal methods typically rely on inexpensive features and economical inference schemes.Selective Search [4], one of the most popular methods, greedily merges superpixels based on engineered low-level features. Yet when compared to efficient detection networks [2], Selective Search is an order of magnitude slower, at 2 seconds per image in a CPU implementation. EdgeBoxes [6] currently provides the best tradeoff between proposal quality and speed,at 0.2 seconds per image. Nevertheless, the region proposal step still consumes as much running time as the detection network.
#區域建議方法一般依賴於經濟的特徵提取及推理方案.Selective Search[4],一種最流行的方法,使用貪婪算法合併基於低維度特徵的超像素.固然與文獻[2]所述的高效檢測網絡相比,Selective Proposal運行速度依然低了一個數量級,其使用CPU處理每張照片所需的時間爲2秒.而Edge Boxes[6]最近實現了區域質量和速度的折衷,達到了每張圖片0.2秒.儘管如此,區域建議這個步驟仍然是檢測網絡中最消耗時間的.

One may note that fast region-based CNNs taken advantage of GPUs, while the region proposal methods used in research are implemented on the CPU,making such runtime comparisons inequitable. An obvious way to accelerate proposal computation is to reimplement it for the GPU.
#你可能會注意到基於區域的快速CNN有效利用了GPU,而研究中所使用的區域建議方法仍然是基於CPU實現的,這樣進行運行時間上的對比有點不太公平.一個明顯的方法是經過在GPU上運行達到加速區域建議運算的效果.

In this paper, we show that an algorithmic change-computing proposals with a deep convolutional neural network—leads to an elegant and effective solution where proposal computation is nearly cost-free given the detection network’s computation.
#在這篇論文中,咱們展現了一種算法上的改進,使用深度神經網絡改進區域建議,從而實現一個優雅而高效的解決方案,而其中區域建議幾乎不佔用運算資源.

Our observation is that the convolutional feature maps used by region-based detectors, like Fast R-CNN, can also be used for generating region proposals. On top of these convolutional features, we construct an RPN by adding a few additional convolutional layers that simultaneously regress region bounds and objectness scores at each location on a regular grid.
#咱們發現相似Fast R-CNN這種基於區域的檢測網絡卷積特徵圖也能夠用於產生區域建議.在這些卷積特徵的頂部,咱們經過添加少數額外的卷積層來構建RPN實現常規尺度下對區域邊界和分類得分的同時迴歸.

RPNs are designed to efficiently predict region proposals with a wide range of scales and aspect ratios. In contrast to prevalent methods [8], [9], [1], [2] that use pyramids of images (Figure 1, a) or pyramids of filters(Figure 1, b), we introduce novel 「anchor」 boxes that serve as references at multiple scales and aspect ratios. Our scheme can be thought of as a pyramid of regression references (Figure 1, c), which avoids enumerating images or filters of multiple scales or aspect ratios.
#RPN被設計用於產生不一樣尺度和寬高比的區域建議.相比於流形的使用圖像金字塔(圖1,a)或濾波金字塔(圖1,b),咱們引入了"anchor" box概念用於充當多尺度和寬高比的參考.咱們的方法被認爲能夠充當迴歸參考金字塔(圖1,c),這份方法避免了對不一樣尺度及寬高比圖像及濾波器的枚舉.

To unify RPNs with Fast R-CNN [2] object detection networks, we propose a training scheme that alternates between fine-tuning for the region proposal task and then fine-tuning for object detection, while keeping the proposals fixed.
#爲了將RPNs與Fast R-CNN目標檢測網絡聯合使用,咱們提出一個選擇區域建議任務fine-tuning及固定區域目標檢測fine-tuning的訓練機制.

We comprehensively evaluate our method on the PASCAL VOC detection benchmarks [11] where RPNs with Fast R-CNNs produce detection accuracy better than the strong baseline of Selective Search with Fast R-CNNs.
#咱們在PASCAL VOC檢測基準上綜合評估了咱們的方法,其中使用RPNs的Fast R-CNN檢測準確率因爲strong baseline之使用Selective Search的Fast R-CNN.

A preliminary version of this manuscript was published previously [10].
#這篇手稿的初步版本以前在文獻[10]發表過.

Related Work

　　　a) Object Proposalsgithub

There is a large literature on object proposal methods. Comprehensive surveys and comparisons of object proposal methods can be found in [19], [20], [21].
#關於object proposal方法有不少的文獻描述,其中文獻[19],[20],[21]給出了不一樣的proposal methods的綜合調查和對比.

　　　b) Deep Networks for Obeject Detection算法

The R-CNN method [5] trains CNNs end-to-end to classify the proposal regions into object categories or background.R-CNN mainly plays as a classifier, and it does not predict object bounds (except for refining by bounding box regression). Its accuracy depends on the performance of the region proposal module (see comparisons in [20]). Several papers have proposed ways of using deep networks for predicting object bounding boxes [25], [9], [26], [27].
#文獻5所述CNN方法使用端到端訓練方式將proposal regions分類成物體目錄或背景.其中R-CNN主要充當一個分類器的做用,不對目標邊界進行預測(除非對bounding box迴歸從新進行定義).它的準確率依賴於region proposal模塊的表現(見文獻20中的對比).有一些論文也提到使用深度神經網絡來預測目標bounding boxes[25],[9],[26],[27].

Shared computation of convolutions [9], [1], [29],[7], [2] has been attracting increasing attention for efficient, yet accurate, visual recognition.
#應用於高效準確視覺識別任務的共享神經網絡[9],[1],[29],[7],[2]開始愈來愈多引發人們的關注.

Faster R-CNN

Our object detection system, called Faster R-CNN, is composed of two modules. The first module is a deep fully convolutional network that proposes regions,and the second module is the Fast R-CNN detector [2]that uses the proposed regions.In Section 3.1 we introduce the designs and properties of the network for region proposal. In Section 3.2 we develop algorithms for training both modules with features shared.
#咱們的目標檢測系統,又稱Faster R-CNN,包含兩個模塊.第一個模塊是用於區域建議的深度全鏈接網絡,第二個模塊是使用建議區域的Fast R-CNN檢測器.在3.1中咱們將介紹區域建議網絡的設計及性質.在3.2中咱們引入共享特徵網絡的訓練算法.

　　　3.1 Region Proposal Netwokswindows

A Region Proposal Network (RPN) takes an image (of any size) as input and outputs a set of rectangular object proposals, each with an objectness score. We model this process with a fully convolutional network[7], which we describe in this section. Because our ultimate goal is to share computation with a Fast R-CNN object detection network [2], we assume that both nets share a common set of convolutional layers. In our experiments, we investigate the Zeiler and Fergus model[32] (ZF), which has 5 shareable convolutional layers and the Simonyan and Zisserman model [3] (VGG-16), which has 13 shareable convolutional layers.
#區域建議網絡(RPN)使用任意尺寸的圖像做爲輸出,輸出一系列長方形的區域建議,每一個區域建議都附帶一個objectness得分.咱們用一個全鏈接網絡[7]對這個過程進行建模,這個過程將在本節進行討論.由於咱們的終極目標是共享Fast R-CNN目標檢測網絡[2]的運算,咱們假定全部的網絡都有一系列相同的卷積層.在咱們的實驗中,研究了具備5層共享卷積層的Zeiler和Fergus模型[32](ZF)以及具備13層共享卷積層的Simonyan和Zisserman模型[3](VGG-16).

To generate region proposals, we slide a small network over the convolutional feature map output by the last shared convolutional layer.
#爲了產生區域建議,咱們在最後一個共享卷積層輸出的卷積特徵圖使用一個小的網絡進行滑窗運算.

　　　a). Anchors網絡

At each sliding-window location, we simultaneously predict multiple region proposals, where the number of maximum possible proposals for each location is denoted as k. So the reg layer has 4k outputs encoding the coordinates of k boxes, and the cls layer outputs 2k scores that estimate probability of object or not object for each proposal 4 . The k proposals are parameterized relative to k reference boxes, which we call anchors. An anchor is centered at the sliding window in question, and is associated with a scale and aspect ratio (Figure 3, left).
#在每一個滑動窗口位置,咱們同時預測多個region proposals,其中每一個位置可能的最大proposal數量經過k來定義,因此reg層有4k個輸出用於編碼k個boxes的座標,cls層有2k個輸出每一個proposal時候爲目標的機率.這些k個proposals使用k個參數化的boxes,這個reference boxes咱們稱之爲anchors.Anchors是滑動窗口中心,與尺寸與寬高比無關(圖3左邊)

　　　　Translation-Invariant Anchorsapp

An important property of our approach is that it is translation invariant, both in terms of the anchors and the functions that compute proposals relative to the anchors. If one translates an object in an image,the proposal should translate and the same function should be able to predict the proposal in either location. This translation-invariant property is guaranteed by our method.
#咱們所使用的方法一個很重要的性質是不管是ancors的形式或是與計算anchors的proposals的函數,都具備平移不變性.假如平移一張圖像上的物體,對應的proposal也應該平移,同一個函數須要在這兩個位置都可以預測proposal.這個平移不變性在咱們提出的方法中獲得了保證.

The translation-invariant property also reduces the model size.
#平移不變的特性也起到下降模型尺寸的做用.

　　　　Multi-Scale Anchors as Regression Referencescors

Our design of anchors presents a novel scheme for addressing multiple scales (and aspect ratios). As shown in Figure 1, there have been two popular ways for multi-scale predictions. The first way is based on image/feature pyramids, e.g., in DPM [8] and CNN-based methods [9], [1], [2].The second way is to use sliding windows of multiple scales (and/or aspect ratios) on the feature maps.
#咱們的anchor設計展現了一個強調多尺度(以及寬高比)的概念機制.如圖1所示,有兩種流形的多尺度預測方式.第一種是基於圖像/特徵金字塔(例如文獻8所述DPM)和基於CNN的方法(例如文獻9,1,2).第二種方法是在特徵圖上使用不一樣尺度(和/或寬高比)的滑動窗口

As a comparison, our anchor-based method is built on a pyramid of anchors, which is more cost-efficient.
#做爲對比,咱們基於anchor的方法是創建在anchors金字塔上的,這種方法具備更高的計算效率.

The design of multi-scale anchors is a key component for sharing features without extra cost for addressing scales.
#多尺度anchors的設計是權值共享的關鍵,而無需強調額外的尺度.

　　　b). Loss Functionless

For training RPNs, we assign a binary class label (of being an object or not) to each anchor. We assign a positive label to two kinds of anchors: (i) the anchor/anchors with the highest Intersection-over-Union (IoU) overlap with a ground-truth box, or (ii) an anchor that has an IoU overlap higher than 0.7 with any ground-truth box.Note that a single ground-truth box may assign positive labels to multiple anchors.Usually the second condition is sufficient to determine the positive samples; but we still adopt the first condition for the reason that in some rare cases the second condition may find no positive sample. We assign a negative label to a non-positive anchor if its IoU ratio is lower than 0.3 for all ground-truth boxes. Anchors that are neither positive nor negative do not contribute to the training objective.
#爲了訓練RPNs,咱們爲每一個anchor分配了一個二進制分類標籤(是否爲某個物體).咱們爲兩種anchors分配正向標籤:(1)與ground-truth擁有最高IoU的anchor/anchors,或者(ii)與任意ground-truth相交IoU超過0.7.注意一個單獨的groung-truth box可能也會在不一樣的尺度下被賦予正向標籤.一般第二種狀況足以肯定正向樣本.但考慮到特定場景下考慮找不到任何的正向樣本,咱們仍然引入第一種狀況.當一個非正向anchor與全部的ground-truth boxes的IoU都低於0.3時,咱們賦予一個正向標籤.即不是正向也不是反向的Anchors對訓練目標沒有任何貢獻.

Our loss function for an image is defined as:
#咱們的損失函數定義以下:

The two terms are normalized by N_cls and N_reg and weighted by a balancing parameter λ.
#這兩個損失經過N_cls以及N_reg實現標準化,並經過λ控制平衡.

For bounding box regression, we adopt the parameterizations of the 4 coordinates following [5]:
#對於bounding box迴歸,咱們引入了4個相關的公式:

where x, y, w, and h denote the box’s center coordinates and its width and height.
#其中x,y,w及h分別表明box的中間座標系和寬,高.

　　　c).Training RPNsdom

The RPN can be trained end-to-end by backpropagation and stochastic gradient descent (SGD)[35]. We follow the 「image-centric」 sampling strategy from [2] to train this network. Each mini-batch arises from a single image that contains many positive and negative example anchors.
#RPN可使用反向傳播和隨機梯度降低實現端到端訓練,咱們遵循文獻[2]所述的"圖像居中"採樣原則來訓練這個網絡.每一個mini-batch來自於一個包含不少正向和反向樣本anchors.

We randomly initialize all new layers by drawing weights from a zero-mean Gaussian distribution with standard deviation 0.01. All other layers (i.e., the shared convolutional layers) are initialized by pre-training a model for ImageNet classification [36], as is standard practice [5]. We tune all layers of the ZF net, and conv3 1 and up for the VGG net to conserve memory [2]. We use a learning rate of 0.001 for 60k mini-batches, and 0.0001 for the next 20k mini-batches on the PASCAL VOC dataset. We use a momentum of 0.9 and a weight decay of 0.0005 [37].Our implementation uses Caffe [38].
#咱們使用方差爲的零均值高斯分佈對全部新的網絡層進行隨機初始化.常規的作法是其餘全部網絡層(包含共享卷積層)使用ImageNet分類的一個預訓練模型進行初始化.出於減小內存使用的考慮,咱們調整ZF網絡的全部層和卷積層3_1.訓練數據集的前60k mini-batches咱們使用0.001的學習率,接下來的20k mini-bathes使用0.0001的學習率.咱們使用momentum爲0.9及權值decay爲0.0005,Caffe實現.

　　　3.2 Sharing Features for RPN and Fast R-CNN

Thus far we have described how to train a network for region proposal generation, without considering the region-based object detection CNN that will utilize these proposals. For the detection network, we adopt Fast R-CNN [2]. Next we describe algorithms that learn a unified network composed of RPN and Fast R-CNN with shared convolutional layers (Figure 2).
#至此咱們已經描述了用於region proposal的網絡訓練,而沒有考慮到利用這些proposals的基於region的目標檢測CNN.對於檢測網絡,咱們應用了Fast R-CNN[2].以後咱們會討論共用卷積層的RPN和Fast R-CNN聯合訓練算法.

Both RPN and Fast R-CNN, trained independently,will modify their convolutional layers in different ways. We therefore need to develop a technique that allows for sharing convolutional layers between the two networks, rather than learning two separate networks. We discuss three ways for training networks with features shared:
#RPN和Fast R-CNN都會被獨立訓練,以各自不一樣的方式修改它們的卷積層.咱們所以須要開發一個容許兩個網絡共享卷積層的技術,而不是分別訓練兩個網絡.咱們將以三個角度探討共享特徵的網絡訓練方法:

(i)Alternating training. In this solution, we first train RPN, and use the proposals to train Fast R-CNN.The network tuned by Fast R-CNN is then used to initialize RPN, and this process is iterated. This is the solution that is used in all experiments in this paper.
#(i)選擇性訓練.在這一節中,咱們首先對RPN進行訓練,而後用獲得的proposals訓練Fast R-CNN.針對Fast R-CNN微調後的網絡被用於RPN的初始化,這個過程是重複進行的.這是本篇論文中全部實驗的方案.

(ii)Approximate joint training. In this solution, the RPN and Fast R-CNN networks are merged into one network during training as in Figure 2.
#(ii)近似joint訓練.在這一步驟中,RPN和Fast R-CNN在訓練時將被融合層一個網絡(如圖2所示)

(iii)Non-approximate joint training. As discussed above, the bounding boxes predicted by RPN are also functions of the input. 
#(iii)非近似jpint訓練.如上文討論的,RPN預測的bounding boxes也會做爲函數的輸入.

　　　a). 4-Step Alternating Training

In this paper, we adopt a pragmatic 4-step training algorithm to learn shared features via alternating optimization. In the first step,we train the RPN as described in Section 3.1.3. This network is initialized with an ImageNet-pre-trained model and fine-tuned end-to-end for the region proposal task. 
#在這篇文章中,咱們引入了一個實用的4步訓練算法經過選擇性優化來學習共享特徵.第一步,按照節3.1.3來訓練RPN.網絡使用一個ImageNet預訓練模式初始化,而後針對region proposal進行端到端微調.

In the second step, we train a separate detection network by Fast R-CNN using the proposal sgenerated by the step-1 RPN. This detection network is also initialized by the ImageNet-pre-trained model. At this point the two networks do not share convolutional layers.
#第二步,咱們使用步驟1中分割的proposal訓練一個單獨的Fast R-CNN檢測網絡.這個檢測網絡也是使用ImageNet預訓練模型進行初始化的.從這個角度出發,兩個網絡還沒有實現卷積層共享.

In the third step, we use the detector network to initialize RPN training, but we fix the shared convolutional layers and only fine-tune the layers unique to RPN. Now the two networks share convolutional layers.
#第三步,咱們使用檢測網絡對初始化RPN訓練,可是咱們固定共享卷積層,只對RPN特有的層進行fine-tune.如今兩個網絡實現了卷積層共享.

Finally, keeping the shared convolutional layers fixed, we fine-tune the unique layers of Fast R-CNN. As such, both networks share the same convolutional layers and form a unified network.
#最後,保持共享卷積層固定,咱們對Fast R-CNN特有的層進行fine-tune.至此,兩個網絡共用相同的卷積層,而且造成獨特的網絡.

　　　3.3 Implementation Details

We train and test both region proposal and object detection networks on images of a single scale [1], [2].We re-scale the images such that their shorter side is s = 600 pixels [2]. Multi-scale feature extraction(using an image pyramid) may improve accuracy but does not exhibit a good speed-accuracy trade-off [2]. #咱們在單一尺度圖像集上[1],[2]訓練和測試region proposal網絡和目標檢測網絡.咱們將圖像進行縮放以實現全部短邊都等於600像素[2].多尺度特徵提取(使用圖像金字塔)可能會提升準確率,可是並無表現出與速度上的很好權衡. On the re-scaled images, the total stride for both ZF and VGG nets on the last convolutional layer is 16 pixels, and thus is ∼10 pixels on a typical PASCAL image before resizing (∼500×375). Even such a large stride provides good results, though accuracy may be further improved with a smaller stride. #在通過縮放的圖像上,ZF和VGG網絡最後卷積層的步長都是16像素,所以至關於縮放前(約500*375像素)典型PASCAL圖像的10像素.這麼大的步長仍然保持比較好的效果,儘管使用更小的步長可能能夠進一步提升準確率.

For anchors, we use 3 scales with box areas of 128²,256², and 512² pixels, and 3 aspect ratios of 1:1, 1:2,and 2:1. These hyper-parameters are not carefully chosen for a particular dataset, and we provide ablation experiments on their effects in the next section.
#咱們使用3種不一樣的尺度,分別對應128*128,256*256,512*512,三種不一樣的寬高比,分別對應1:1,1:2和2:1.這些超參數並無通過特定數據集上的精心篩選,而後咱們將在下一節中經過experiments講述每一個部分的做用.

As discussed, our solution does not need an image pyramid or filter pyramid to predict regions of multiple scales,saving considerable running time. Figure 3 (right) shows the capability of our method for a wide range of scales and aspect ratios. Table 1 shows the learned average proposal size for each anchor using the ZFnet.
#一如討論的同樣,咱們的解決方案並不須要圖像金字塔或者濾波器金字塔來預測不一樣尺度的regions,所以節省了可觀的時間.圖3(右邊)展現了咱們的方法應對不一樣尺度及寬高比的能力.表1展現了ZFnet中每一個anchor的平均proposal尺寸.

We note that our algorithm allows predictions that are larger than the underlying receptive field.Such predictions are not impossible—one may still roughly infer the extent of an object if only the middle of the object is visible.
#咱們注意到咱們的算法容許預測(區域)能夠比潛在的感覺野(receptive field)大.這些預測並不是不可能的,只要圖像中間物體是可見的,仍然能夠粗略推測出物體的內容.

The anchor boxes that cross image boundaries need to be handled with care. During training, we ignore all cross-boundary anchors so they do not contribute to the loss.
#橫跨圖像邊界的anchor boxes須要被仔細對待.在訓練過程當中,咱們忽略了全部跨邊界的anchors,因此他們不會對loss產生影響.

For a typical 1000 × 600 image, there will be roughly 20000 (≈ 60 × 40 × 9) anchors in total. With the cross-boundary anchors ignored, there are about 6000 anchors per image for training.
#對於一個典型的1000*600的圖片,總共大概有近20000(約60*40*9)個anchors.在忽略掉跨邊界的anchors後,每張圖像約有6000個anchors須要訓練.

If the boundary-crossing outliers are not ignored in training,they introduce large, difficult to correct error terms in the objective, and training does not converge. During testing, however, we still apply the fully convolutional RPN to the entire image. This may generate cross-boundary proposal boxes, which we clip to the image boundary.
#若是訓練中不忽略跨邊界的極端值,它們對消除錯誤檢測形成較大的困難,同時訓練也不會收斂.可是,在測試中,咱們仍然對整張圖片應用全卷積RPN.當咱們剪切到邊界時,這可能不會產生跨邊界的proposal boxes.

Some RPN proposals highly overlap with each other. To reduce redundancy, we adopt non-maximum suppression (NMS) on the proposal regions based on their cls scores.We fix the IoU threshold for NMS at 0.7, which leaves us about 2000 proposal regions per image.
#有一些RPN proposals可能會與其餘proposal嚴重重疊.爲了下降重複運算,咱們在proposal regions上基於它們的cls得分引入了非極大值抑制(NMS).咱們將IoU的閾值修改到0.7,使咱們在訓練中每張圖剩下2000個proposal regions.

As we will show, NMS does not harm the ultimate detection accuracy, but substantially reduces the number of proposals. After NMS, we use the top-N ranked proposal regions for detection. In the following, we train Fast R-CNN using 2000 RPN proposals, but evaluate different numbers of proposals at test-time.
#正如咱們即將呈現的,NMS並不會損害最終的檢測準確率,可是實際上至關多的下降了proposals的數目.在極大值抑制以後,咱們選擇top-N排列的proposal regions用於檢測.隨後,網絡使用Fast R-CNN對2000個RPN proposals進行訓練,可是在測試階段使用不一樣數量的proposals進行評估.

Experiments

　　　4.1 Experiments on PASCAL VOC

　　　4.2 Experiments on MS COCO

　　　4.3 From MS COCO to PASCAL VOC

Conclusion

We have presented RPNs for efficient and accurate region proposal generation.By sharing convolutional features with the down-stream detection network, the region proposal step is nearly cost-free. 
#咱們展現了能夠高效準確地產生region proposal的RPNs.經過與下游檢測網絡共享卷積層特徵的方式,region proposal階段幾乎不佔用運算資源.

Our method enables a unified, deep-learning-based object detection system to run at near real-time frame rates. The learned RPN also improves region proposal quality and thus the overall object detection accuracy.
#咱們的方法使一個聯合的基於深度學習的目標檢測系統幾乎以實時的幀率運行.訓練的RPN也提升了region proposal的質量,所以提升了整體的目標檢測準確率.