目標檢測經典論文——R-CNN論文翻譯：Rich feature hierarchies for accurate object detection and semantic segmentation

時間 2020-12-30

標籤深度學習經典論文翻譯 R-CNN 目標檢測語義分隔 Region Proposal PASCAL 简体版

原文原文鏈接

Rich feature hierarchies for accurate object detection and semantic segmentation——Tech report (v5)

用於精確物體定位和語義分割的豐富特徵層次結構——技術報告（第5版）

Ross Girshick Jeff Donahue Trevor Darrell Jitendra Malik

UC Berkeley

加州大學伯克利分校

{frbg,jdonahue,trevor,malikg}@eecs.berkeley.edu

Abstract

Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012—achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks

(CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also compare R-CNN to OverFeat, a recently proposed sliding-window detector based on a similar CNN architecture. We find that R-CNN outperforms OverFeat by a large margin on the 200-class ILSVRC2013 detection dataset. Source code for the complete system is available at http://www.cs.berkeley.edu/˜rbg/rcnn.

摘要

過去幾年，在經典數據集PASCAL上，物體檢測的效果已經達到一個穩定水平。效果最好的方法是融合了多種低維圖像特徵和高維上下文環境的複雜集成系統。在這篇論文裏，我們提出了一種簡單並且可擴展的檢測算法，可以在VOC2012最好結果的基礎上將mAP值提高30%以上——達到了53.3%。我們的方法結合了兩個關鍵的思想：（1）在候選區域上自下而上使用大型卷積神經網絡(CNNs)，用以定位和分割物體。（2）當帶標籤的訓練數據不足時，先針對輔助任務進行有監督預訓練，再進行特定任務的調優，就可以產生明顯的性能提升。

因爲我們把region proposal和CNNs結合起來，所以該方法被稱爲R-CNN：Regions with CNN features。我們也把R-CNN效果跟OverFeat比較了下，OverFeat是最近提出的基於類似的CNN特徵並採用滑動窗口進行目標檢測的一種方法，結果發現R-CNN在200個類別的ILSVRC2013檢測數據集上的性能明顯優於OVerFeat。系統完整的源代碼見：http://www.cs.berkeley.edu/˜rbg/rcnn。（譯者注：網址已失效，github新地址：https://github.com/rbgirshick/rcnn）

1. Introduction

Features matter. The last decade of progress on various visual recognition tasks has been based considerably on the use of SIFT [29] and HOG [7]. But if we look at performance on the canonical visual recognition task, PASCAL VOC object detection [15], it is generally acknowledged that progress has been slow during 2010-2012, with small gains obtained by building ensemble systems and employing minor variants of successful methods.

1. 前言

特徵的重要性。在過去十年，各類視覺識別任務基本都建立在對SIFT[29]和HOG[7]特徵的使用。但如果我們關注一下PASCAL VOC對象檢測[15]這個經典的視覺識別任務，就會發現2010-2012年進展緩慢，取得的微小進步都是通過構建一些集成系統和採用一些成功方法的變種才達到的。

SIFT and HOG are blockwise orientation histograms, a representation we could associate roughly with complex cells in V1, the first cortical area in the primate visual pathway. But we also know that recognition occurs several stages downstream, which suggests that there might be hierarchical, multi-stage processes for computing features that are even more informative for visual recognition.

SIFT和HOG是塊方向直方圖(blockwise orientation histograms)，一種類似大腦初級皮層V1層複雜細胞的表示方法。但我們知道識別發生在多個下游階段，（我們是先看到了一些特徵，然後才意識到這是什麼東西）也就是說對於視覺識別來說，更有價值的信息，是層次化的，多個階段的特徵。

Fukushima’s 「neocognitron」 [19], a biologically-inspired hierarchical and shift-invariant model for pattern recognition, was an early attempt at just such a process. The neocognitron, however, lacked a supervised training algorithm. Building on Rumelhart et al. [33], LeCun et al. [26] showed that stochastic gradient descent via backpropagation was effective for training convolutional neural networks (CNNs), a class of models that extend the neocognitron.

Fukushima的「neocognitron，一種受生物學啓發用於模式識別的層次化、移動不變性模型，算是這方面最早的嘗試。然而，neocognitron缺乏監督訓練算法。基於Rumelhart等人的研究，Lecun等人提出反向傳播的隨機梯度下降(SGD)對訓練卷積神經網絡（CNNs）非常有效，CNNs被認爲是neocognitron的一種擴展。

CNNs saw heavy use in the 1990s (e.g., [27]), but then fell out of fashion with the rise of support vector machines. In 2012, Krizhevsky et al. [25] rekindled interest in CNNs by showing substantially higher image classification accuracy on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [9, 10]. Their success resulted from training a large CNN on 1.2 million labeled images, together with a few twists on LeCun’s CNN (e.g., max(x; 0) rectifying non-linearities and 「dropout」 regularization).

CNNs在1990年代被廣泛使用，但之後隨着SVM的崛起便淡出研究主流。2012年，Krizhevsky等人在ImageNet大規模視覺識別挑戰賽(ILSVRC)上的出色表現重新燃起了對CNNs的興趣。他們的成功在於在120萬標註的圖像上使用了一個大型的CNN，並且對LeCUN的CNN進行了一些改造（比如ReLU和Dropout正則化）。

The significance of the ImageNet result was vigorously debated during the ILSVRC 2012 workshop. The central issue can be distilled to the following: To what extent do the CNN classification results on ImageNet generalize to object detection results on the PASCAL VOC Challenge?

這個ImangeNet的結果的重要性在ILSVRC2012 workshop上得到了熱烈的討論。提煉出來的核心問題是：ImageNet上的CNN分類結果在何種程度上能夠應用到PASCAL VOC挑戰的物體檢測任務上？

We answer this question by bridging the gap between image classification and object detection. This paper is the first to show that a CNN can lead to dramatically higher object detection performance on PASCAL VOC as compared to systems based on simpler HOG-like features. To achieve this result, we focused on two problems: localizing objects with a deep network and training a high-capacity model with only a small quantity of annotated detection data.

我們通過彌合圖像分類和目標檢測差別，回答了這個問題。本論文是第一個說明在PASCAL VOC的物體檢測任務上CNN比基於簡單的類似HOG特徵的系統有大幅的性能提升。我們主要關注了兩個問題：使用深度網絡定位目標和在小規模的標註數據集上進行大型網絡模型的訓練。

Unlike image classification, detection requires localizing (likely many) objects within an image. One approach frames localization as a regression problem. However, work from Szegedy et al. [38], concurrent with our own, indicates that this strategy may not fare well in practice (they report a mAP of 30.5% on VOC 2007 compared to the 58.5% achieved by our method). An alternative is to build a sliding-window detector. CNNs have been used in this way for at least two decades, typically on constrained object categories, such as faces [32, 40] and pedestrians [35]. In order to maintain high spatial resolution, these CNNs typically only have two convolutional and pooling layers. We also considered adopting a sliding-window approach. However, units high up in our network, which has five convolutional layers, have very large receptive fields (195*195 pixels) and strides (32*32 pixels) in the input image, which makes precise localization within the sliding-window paradigm an open technical challenge.

與圖像分類不同，目標檢測需要定位圖像中的目標（可能有多個）。一個方法是將框定位看做是迴歸問題。但Szegedy等人的研究以及我們自己的研究表明這種策略在實際應用中並不可行（在VOC2007上他們的mAP是30.5%，而我們的達到了58.5%）。另一種方法是使用滑動窗口檢測器。通過這種方法使用CNNs至少已經有20年的時間了，通常用於一些特定目標種類的檢測，例如人臉檢測、行人檢測等。爲了獲得較高的空間分辨率，這些CNNs普遍採用了兩個卷積層和兩個池化層。我們本來也考慮過使用滑動窗口的方法。但是由於我們網絡有5個卷積層，具有更深的層和更多的神經元，使得輸入圖片有非常大的感受野（195×195）和步長（32×32），這使得采用滑動窗口的精確定位方法充滿挑戰。

Instead, we solve the CNN localization problem by operating within the 「recognition using regions」 paradigm [21], which has been successful for both object detection [39] and semantic segmentation [5]. At test time, our method generates around 2000 category-independent region proposals for the input image, extracts a fixed-length feature vector from each proposal using a CNN, and then classifies each region with category-specific linear SVMs. We use a simple technique (affine image warping) to compute a fixed-size CNN input from each region proposal, regardless of the region’s shape. Figure 1 presents an overview of our method and highlights some of our results. Since our system combines region proposals with CNNs, we dub the method R-CNN: Regions with CNN features.

爲了解決CNN的定位，我們是通過操作」recognition using regions」範式，這種方法已經成功用於目標檢測和語義分隔。測試時，每張圖片產生了接近2000個與類別無關的region proposal，然後分別通過CNN提取了一個固定長度的特徵向量，最後使用特定類別的線性SVM對每個region進行分類。不論region的形狀，我們使用一種簡單的方法（仿射圖像變形）將每個region proposal轉換成固定尺寸的大小作爲CNN的輸入。圖1展示了我們方法的全貌並突出展示了一些實驗結果。由於我們的模型結合了Region proposals和CNNs，所以把這種方法稱爲R-CNN，即Regions with CNN features。

Figure 1: Object detection system overview. Our system (1) takes an input image, (2) extracts around 2000 bottom-up region proposals, (3) computes features for each proposal using a large convolutional neural network (CNN), and then (4) classifies each region using class-specific linear SVMs. R-CNN achieves a mean average precision (mAP) of 53.7% on PASCAL VOC 2010. For comparison, [39] reports 35.1% mAP using the same region proposals, but with a spatial pyramid and bag-of-visual-words approach. The popular deformable part models perform at 33.4%. On the 200-class ILSVRC2013 detection dataset, R-CNN’s mAP is 31.4%, a large improvement over OverFeat [34], which had the previous best result at 24.3%.

圖1：物體檢測系統概述。我們的系統（1）輸入一張圖像、（2）提取大約2000個自下而上的region proposals、（3）使用大型卷積神經網絡（CNN）計算每個region proposals的特徵向量、（4）使用特定類別的線性SVM對每個region進行分類。R-CNN在PASCAL VOC 2010上的mAP爲53.7％。對比[39]文獻中使用相同region proposals方法，並用使用採用空間金字塔和bag-of-visual-words方法的模型，其mAP只有35.1％。流行的可變部件的模型的性能也只有33.4％。在200個類別的ILSVRC2013檢測數據集上，R-CNN的mAP爲31.4％，比最佳結果24.3％OverFeat有了很大改進。

In this updated version of this paper, we provide a head-to-head comparison of R-CNN and the recently proposed OverFeat [34] detection system by running R-CNN on the 200-class ILSVRC2013 detection dataset. OverFeat uses a sliding-window CNN for detection and until now was the best performing method on ILSVRC2013 detection. We show that R-CNN significantly outperforms OverFeat, with a mAP of 31.4% versus 24.3%.

本篇論文的更新版本中，我們提供了R-CNN和最近提出的OverFeat檢測系統在ILSVRC2013的200分類檢測數據集上對比結果。OverFeat使用了滑動窗口CNN做檢測，目前爲止是ILSVRC2013檢測集上表現最好的方法。我們的結果顯示，R-CNN的mAP達到31.4%，顯著超越了OverFeat的24.3%的結果。

A second challenge faced in detection is that labeled data is scarce and the amount currently available is insufficient for training a large CNN. The conventional solution to this problem is to use unsupervised pre-training, followed by supervised fine-tuning (e.g., [35]). The second principle contribution of this paper is to show that supervised pre-training on a large auxiliary dataset (ILSVRC), followed by domain-specific fine-tuning on a small dataset (PASCAL), is an effective paradigm for learning high-capacity CNNs when data is scarce. In our experiments, fine-tuning for detection improves mAP performance by 8 percentage points. After fine-tuning, our system achieves a mAP of 54% on VOC 2010 compared to 33% for the highly-tuned, HOG-based deformable part model (DPM) [17, 20]. We also point readers to contemporaneous work by Donahue et al. [12], who show that Krizhevsky’s CNN can be used (without finetuning) as a blackbox feature extractor, yielding excellent performance on several recognition tasks including scene classification, fine-grained sub-categorization, and domain adaptation.

檢測任務第二個挑戰是標籤數據太少，手頭可用於訓練一個大型卷積網絡的數據往往很不充足。傳統解決方法通常採用無監督預訓練，再進行有監督調優的方法（如[35]）。本文的第二個核心貢獻是在輔助數據集（ILSVRC）上進行有監督預訓練，再在小數據集（PASCAL）上針對特定問題進行調優，這種方法在訓練數據稀少的情況下訓練大型卷積神經網絡是非常有效的。我們的實驗中，針對檢測的調優將mAP提高了8個百分點。調優後，我們的系統在VOC2010上達到了54%的mAP，遠遠超過高度優化的基於HOG的可變性部件模型（deformable part model，DPM）[17, 20]。另外提醒讀者朋友們關注Donahue等人同時期的工作，他們的研究表明Krizhevsky的CNN（譯者注：Krizhevsky的CNN指AlexNet網絡）可以用來作爲一個黑盒特徵提取器，（沒有調優的情況下）在多個識別任務上包括場景分類、細粒度的子分類和領域自適應（domain adaptation）方面均表現很出色。

Our system is also quite efficient. The only class-specific computations are a reasonably small matrix-vector product and greedy non-maximum suppression. This computational property follows from features that are shared across all categories and that are also two orders of magnitude lower-dimensional than previously used region features (cf. [39]).

同時，我們的模型系統也很高效。都是相當小型的矩陣向量相乘和貪婪的非極大值抑制這些特定類別的計算。這個計算特性源自於所有類別都共享的特徵，同時這些特徵比之前使用的區域特徵（[39]）少了兩個數量級的維度。

Understanding the failure modes of our approach is also critical for improving it, and so we report results from the detection analysis tool of Hoiem et al. [23]. As an immediate consequence of this analysis, we demonstrate that a simple bounding-box regression method significantly reduces mislocalizations, which are the dominant error mode.

分析我們方法的失敗案例對進一步改進和提升很有幫助，所以我們藉助Hoiem等人的定位分析工具[23]做實驗結果的報告和分析。作爲本次分析的直接結果，我們發現一個簡單的邊界框迴歸的方法會明顯地降低錯誤定位問題，而錯誤定位是我們的模型系統的主要誤差。

Before developing technical details, we note that because R-CNN operates on regions it is natural to extend it to the task of semantic segmentation. With minor modifications, we also achieve competitive results on the PASCAL VOC segmentation task, with an average segmentation accuracy of 47.9% on the VOC 2011 test set.

開發技術細節之前，我們注意到由於R-CNN是在推薦區域上進行操作，所以可以很自然地擴展到語義分割任務上。經過微小的改動，我們就在PASCAL VOC語義分割任務上達到了很有競爭力的結果，在VOC 2011測試集上平均語義分割精度達到了47.9%。

2. Object detection with R-CNN

Our object detection system consists of three modules. The first generates category-independent region proposals. These proposals define the set of candidate detections available to our detector. The second module is a large convolutional neural network that extracts a fixed-length feature vector from each region. The third module is a set of classs-pecific linear SVMs. In this section, we present our design decisions for each module, describe their test-time usage, detail how their parameters are learned, and show detection results on PASCAL VOC 2010-12 and on ILSVRC2013.

2. 使用R-CNN做物體檢測

我們的物體檢測系統有三個模塊構成。第一個模塊產生類別無關的region proposals。這些proposals組成了一個模型可用的候選檢測區域的集合。第二個模塊是一個大型卷積神經網絡，從每個region提取固定長度的特徵向量。第三個模塊是特定類別線性SVM的集合。這一節將展示每個模塊的設計，並介紹它們的測試階段的用法，以及一些參數學習的細節，並得出在PASCAL VOC 2010-12和ILSVRC2013上的檢測結果。

2.1. Module design

Region proposals. A variety of recent papers offer methods for generating category-independent region proposals. Examples include: objectness [1], selective search [39], category-independent object proposals [14], constrained parametric min-cuts (CPMC) [5], multi-scale combinatorial grouping [3], and Cires¸an et al. [6], who detect mitotic cells by applying a CNN to regularly-spaced square crops, which are a special case of region proposals. While R-CNN is agnostic to the particular region proposal method, we use selective search to enable a controlled comparison with prior detection work (e.g., [39, 41]).

2.1 模塊設計

區域推薦（Region Proposals）。近來有很多研究都提出了生成類別無關的區域推薦的方法。比如objectness [1]、selective search [39]、category-independent object proposals [14]、constrained parametric min-cuts (CPMC) [5]、multi-scale combinatorial grouping [3]，以及Ciresan等人提出的將CNN用在規律空間塊裁剪上以檢測有絲分裂細胞的方法，也算是一種特殊的區域推薦類型。由於R-CNN對特定區域算法是不關心的，所以我們採用了選擇性搜索以方便和之前的工作[39, 41]進行可控的比較。

Feature extraction. We extract a 4096-dimensional feature vector from each region proposal using the Caffe [24] implementation of the CNN described by Krizhevsky et al. [25]. Features are computed by forward propagating a mean-subtracted 227*227 RGB image through five convolutional layers and two fully connected layers. We refer readers to [24, 25] for more network architecture details.

特徵提取。我們使用Krizhevsky等人[25]所描述的CNN（譯者注：AlexNet）的一個Caffe[24]實現版本對每個推薦區域提取一個4096維度的特徵向量。減去像素均值的277×277大小的RGB輸入圖像通過五個卷積層和兩個全連接層，最終計算得到特徵向量。讀者可以參考[24, 25]獲得更多的網絡架構細節。

In order to compute features for a region proposal, we must first convert the image data in that region into a form that is compatible with the CNN (its architecture requires inputs of a fixed 227*227 pixel size). Of the many possible transformations of our arbitrary-shaped regions, we opt for the simplest. Regardless of the size or aspect ratio of the candidate region, we warp all pixels in a tight bounding box around it to the required size. Prior to warping, we dilate the tight bounding box so that at the warped size there are exactly p pixels of warped image context around the original box (we use p = 16). Figure 2 shows a random sampling of warped training regions. Alternatives to warping are discussed in Appendix A.

Figure 2: Warped training samples from VOC 2007 train.

爲了計算推薦區域的特徵，首先需要將輸入的圖像數據進行轉變，使得推薦的區域變成CNN可以接受的方式（我們架構中的CNN只能接受像素寬高比爲227*227固定大小的圖像）。有很多種方法可以對我們任意形狀的區域進行變換，我們選擇了最簡單的一種。無論候選區域是什麼尺寸或者任意長寬比，我們將區域放入無縫的邊框內變形到希望的尺寸。變形之前，先放大緊邊框以便在新的變形後的尺寸上保證變形圖像上下文的p的像素都圍繞在原始框上（我們使用p=16）。圖2展示了一些變形訓練圖像的例子。其它的變形方法可以參考附錄A。

圖2：VOC 2007訓練集部分圖像變形成訓練樣本。

2.2. Test-time detection

At test time, we run selective search on the test image to extract around 2000 region proposals (we use selective search’s 「fast mode」 in all experiments). We warp each proposal and forward propagate it through the CNN in order to compute features. Then, for each class, we score each extracted feature vector using the SVM trained for that class. Given all scored regions in an image, we apply a greedy non-maximum suppression (for each class independently) that rejects a region if it has an intersection-overunion (IoU) overlap with a higher scoring selected region larger than a learned threshold.

2.2 測試階段的物體檢測

在測試階段，在測試圖像上使用selective search提取2000個推薦區域（所有實驗中我們使用了selective search的加速版本）。對每一個推薦區域變形後通過CNN前向傳播計算出特徵。然後我們使用訓練過特定類別的SVM給特徵向量中的每個類別單獨打分。然後給出一張圖像中所有的打分區域，然後使用貪婪非最大值抑制算法（每個類別是獨立進行的）捨棄那些與大於學習閾值更高得分的推薦區域有重疊（intersection-overunion (IoU)）的區域。

Run-time analysis. Two properties make detection efficient. First, all CNN parameters are shared across all categories. Second, the feature vectors computed by the CNN are low-dimensional when compared to other common approaches, such as spatial pyramids with bag-of-visual-word encodings. The features used in the UVA detection system [39], for example, are two orders of magnitude larger than ours (360k vs. 4k-dimensional).

運行時分析。兩個特性讓檢測變得很高效。首先，所有的CNN參數都是跨類別共享的。其次，通過CNN計算的特徵向量維度相比其他常見方法（比如spatial pyramids with bag-of-visual-word encodings）計算特徵的維度是很低的。例如，UVA檢測系統[39]中使用的特徵比我們的要多兩個數量級(360k維相比於4k維)。

The result of such sharing is that the time spent computing region proposals and features (13s/image on a GPU or 53s/image on a CPU) is amortized over all classes. The only class-specific computations are dot products between features and SVM weights and non-maximum suppression. In practice, all dot products for an image are batched into a single matrix-matrix product. The feature matrix is typically 2000*4096 and the SVM weight matrix is 4096*N, where N is the number of classes.

這種共享的結果就是計算推薦區域和特徵的耗時可以分攤到所有類別上（GPU：每張圖13s，CPU：每張圖53s）。唯一和類別有關的計算都是特徵和SVM權重以及最大化抑制之間的點積。實際應用中，所有的點積都可以批量化成一個單獨矩陣間運算。特徵矩陣的通常是2000×4096（譯者注：通常產生2000個左右的推薦區域，每個推薦區域經過CNN產生4096維的向量），SVM權重的矩陣是4096*N，其中N是類別數目。

This analysis shows that R-CNN can scale to thousands of object classes without resorting to approximate techniques, such as hashing. Even if there were 100k classes, the resulting matrix multiplication takes only 10 seconds on a modern multi-core CPU. This efficiency is not merely the result of using region proposals and shared features. The UVA system, due to its high-dimensional features, would be two orders of magnitude slower while requiring 134GB of memory just to store 100k linear predictors, compared to just 1.5GB for our lower-dimensional features.

分析表明R-CNN可以擴展到上千個類別，而不需要訴諸近似技術，例如hashing。即使有10萬個類別，計算矩陣乘法在現代多核CPU上只需要10s而已。這種高效不僅僅是因爲使用了區域推薦和共享的特徵。，UVA系統由於其高維特徵需要134GB的內存來存10萬個線性預測器，而我們只要1.5GB，這使得UVA系統比我們慢了兩個數量級。

It is also interesting to contrast R-CNN with the recent work from Dean et al. on scalable detection using DPMs and hashing [8]. They report a mAP of around 16% on VOC 2007 at a run-time of 5 minutes per image when introducing 10k distractor classes. With our approach, 10k detectors can run in about a minute on a CPU, and because no approximations are made mAP would remain at 59% (Section 3.2).

更有趣的是R-CCN和最近Dean等使用了DPMs和hashing[8]進行大規模檢測任務對比。當他們用了1萬個干擾類時每五分鐘可以處理一張圖片，在VOC2007上的mAP能達到16%。我們的方法1萬個檢測器由於沒有做近似，可以在CPU上一分鐘跑完，達到59%的mAP（3.2節）。

2.3. Training

Supervised pre-training. We discriminatively pre-trained the CNN on a large auxiliary dataset (ILSVRC2012 classification) using image-level annotations only (bounding-box labels are not available for this data). Pre-training was performed using the open source Caffe CNN library [24]. In brief, our CNN nearly matches the performance of Krizhevsky et al. [25], obtaining a top-1 error rate 2.2 percentage points higher on the ILSVRC2012 classification validation set. This discrepancy is due to simplifications in the training process.

2.3 訓練

有監督預訓練。我們僅使用圖像級註釋的大型輔助數據集（ILSVRC2012分類任務）上有區別地預訓練了CNN（該數據集沒有邊界框標籤）。預訓練採用了開源的Caffe CNN庫[24]。簡單地說，我們的CNN十分接近krizhevsky等人網絡的性能，在ILSVRC2012分類驗證集上top-1錯誤率比他們高2.2%。差異主要來自於訓練過程的簡化。

Domain-specific fine-tuning. To adapt our CNN to the new task (detection) and the new domain (warped proposal windows), we continue stochastic gradient descent (SGD) training of the CNN parameters using only warped region proposals. Aside from replacing the CNN’s ImageNet-specific 1000-way classification layer with a randomly initialized (N+1)-way classification layer (where N is the number of object classes, plus 1 for background), the CNN architecture is unchanged. For VOC, N = 20 and for ILSVRC2013, N = 200. We treat all region proposals with >= 0.5 IoU overlap with a ground-truth box as positives for that box’s class and the rest as negatives. We start SGD at a learning rate of 0.001 (1/10th of the initial pre-training rate), which allows fine-tuning to make progress while not clobbering the initialization. In each SGD iteration, we uniformly sample 32 positive windows (over all classes) and 96 background windows to construct a mini-batch of size 128. We bias the sampling towards positive windows because they are extremely rare compared to background.

特定領域的參數調優。爲了讓我們的CNN適應新的任務（即檢測任務）和新的領域（變形後的推薦窗口），我們只使用變形後的推薦區域對CNN參數進行SGD訓練。我們替掉了ImageNet特定的1000類分類層，換成了一個隨機初始化的(N+1)類的分類層（其中N是目標類別數目，1代表背景），而卷積層架構沒有改變。對於VOC，N=20，對於ILSVRC2013，N=200。對於所有的推薦區域，如果與真實標註框的IoU重疊大於等於0.5就認爲該推薦區域代表的類是正例，否則就是負例。SGD初始學習率爲0.001（初始化預訓練時的十分之一），這使得調優得以有效進行而不會破壞初始化的成果。每輪SGD迭代，我們統一使用32個正例窗口（跨所有類別）和96個背景窗口組成大小爲128的mini-batch。另外我們傾向於採樣正例窗口，因爲和背景相比他們很稀少。

Object category classifiers. Consider training a binary classifier to detect cars. It’s clear that an image region tightly enclosing a car should be a positive example. Similarly, it’s clear that a background region, which has nothing to do with cars, should be a negative example. Less clear is how to label a region that partially overlaps a car. We resolve this issue with an IoU overlap threshold, below which regions are defined as negatives. The overlap threshold, 0:3, was selected by a grid search over f0; 0:1; : : : ; 0:5g on a validation set. We found that selecting this threshold carefully is important. Setting it to 0:5, as in [39], decreased mAP by 5 points. Similarly, setting it to 0 decreased mAP by 4 points. Positive examples are defined simply to be the ground-truth bounding boxes for each class.

目標類別分類器。思考一下檢測汽車的二分類器。很顯然，一個圖像區域緊緊包裹着一輛汽車應該就是正例。相似的，背景區域應該看不到任何汽車，就是負例。較爲不明晰的是怎樣標註哪些只和汽車部分重疊的區域。我們使用IoU重疊閾值來解決這個問題，低於這個閾值的就是負例。這個閾值我們選擇了0.3，是在驗證集上基於{0, 0.1, … 0.5}通過網格搜索得到的。我們發現認真選擇這個閾值很重要。如果設置爲0.5，如[39]，可以提升mAP5個點，設置爲0，就會降低4個點。正例就嚴格的是標註的框。

Once features are extracted and training labels are applied, we optimize one linear SVM per class. Since the training data is too large to fit in memory, we adopt the standard hard negative mining method [17, 37]. Hard negative mining converges quickly and in practice mAP stops increasing after only a single pass over all images.

一旦特徵提取出來，就應用標籤數據，然後優化每個類的線性SVM。由於訓練數據太大，難以裝進內存，我們選擇了標準的hard negative mining method（高難負例挖掘算法？用途就是正負例數量不均衡，而負例分散代表性又不夠的問題）[17, 37]。高難負例挖掘算法收斂很快，實踐中只要經過一輪mAP就可以基本停止增加了。

In Appendix B we discuss why the positive and negative examples are defined differently in fine-tuning versus SVM training. We also discuss the trade-offs involved in training detection SVMs rather than simply using the outputs from the final softmax layer of the fine-tuned CNN.

附錄B中，我們討論了，正例和負例在調優和SVM訓練兩個階段的爲什麼定義得如此不同。我們也會討論訓練檢測SVM的平衡問題，而不只是簡單地使用來自調優後的CNN的最終softmax層的輸出。

2.4. Results on PASCAL VOC 2010-12

Following the PASCAL VOC best practices [15], we validated all design decisions and hyperparameters on the VOC 2007 dataset (Section 3.2). For final results on the VOC 2010-12 datasets, we fine-tuned the CNN on VOC 2012 train and optimized our detection SVMs on VOC 2012 trainval. We submitted test results to the evaluation server only once for each of the two major algorithm variants (with and without bounding-box regression).

2.4. 在PASCAL VOC 2010-12上的結果

按照PASCAL VOC的最佳實踐步驟，我們在VOC2007的數據集上驗證了我們所有的設計思想和參數處理，對於在2010-2012數據庫中，我們在VOC2012上訓練和優化了我們的支持向量機檢測器，我們一種方法（帶BBox和不帶BBox）只提交了一次評估服務器。

Table 1 shows complete results on VOC 2010. We compare our method against four strong baselines, including SegDPM [18], which combines DPM detectors with the output of a semantic segmentation system [4] and uses additional inter-detector context and image-classifier rescoring. The most germane comparison is to the UVA system from Uijlings et al. [39], since our systems use the same region proposal algorithm. To classify regions, their method builds a four-level spatial pyramid and populates it with densely sampled SIFT, Extended OpponentSIFT, and RGBSIFT descriptors, each vector quantized with 4000-word codebooks. Classification is performed with a histogram intersection kernel SVM. Compared to their multi-feature, non-linear kernel SVM approach, we achieve a large improvement in mAP, from 35.1% to 53.7% mAP, while also being much faster (Section 2.2). Our method achieves similar performance (53.3% mAP) on VOC 2011/12 test.

Table 1: Detection average precision (%) on VOC 2010 test. R-CNN is most directly comparable to UVA and Regionlets since all methods use selective search region proposals. Bounding-box regression (BB) is described in Section C. At publication time, SegDPM was the top-performer on the PASCAL VOC leaderboard. yDPM and SegDPM use context rescoring not used by the other methods.

表1展示了（本方法）在VOC2010的結果，我們將自己的方法同四種先進基準方法作對比，其中包括SegDPM，這種方法將DPM檢測子與語義分割系統相結合並且使用附加的內核的環境和圖片檢測器打分。更加恰當的比較是同Uijling的UVA系統比較，因爲我們的方法同樣基於候選框算法。對於候選區域的分類，他們通過構建一個四層的金字塔，並且將之與SIFT模板結合，SIFT爲擴展的OpponentSIFT和RGB-SIFT描述子，每一個向量被量化爲4000詞的codebook。分類任務由一個交叉核的支持向量機承擔，對比這種方法的多特徵方法，非線性內核的SVM方法，我們在mAP達到一個更大的提升，從35.1%提升至53.7%，而且速度更快。我們的方法在VOC2011/2012數據達到了相似的檢測效果mAP53.3%。

表1：VOC 2010測試集上的檢測平均精度（％）。R-CNN與UVA和Regionlet直接比較，因爲所有方法都使用selective search的region proposals方法。Bounding-box迴歸（BB）在C節中進行了描述。本文發佈時，SegDPM是PASCAL VOC排行榜上性能最優的算法。DPM和SegDPM使用其他方法未使用的context rescoring。

2.5. Results on ILSVRC2013 detection

We ran R-CNN on the 200-class ILSVRC2013 detection dataset using the same system hyperparameters that we used for PASCAL VOC. We followed the same protocol of submitting test results to the ILSVRC2013 evaluation server only twice, once with and once without bounding-box regression.

2.5. 在ILSVRC2013上檢測任務結果

我們使用與用於PASCAL VOC相同的系統超參數在200類ILSVRC2013檢測數據集上運行R-CNN。我們遵循相同的協議，僅將測試結果提交給ILSVRC2013評估服務器兩次，一次帶有邊界框迴歸，一次帶沒有邊界框迴歸。

Figure 3 compares R-CNN to the entries in the ILSVRC 2013 competition and to the post-competition OverFeat result [34]. R-CNN achieves a mAP of 31.4%, which is significantly ahead of the second-best result of 24.3% from OverFeat. To give a sense of the AP distribution over classes, box plots are also presented and a table of perclass APs follows at the end of the paper in Table 8. Most of the competing submissions (OverFeat, NEC-MU, UvAEuvision, Toronto A, and UIUC-IFP) used convolutional neural networks, indicating that there is significant nuance in how CNNs can be applied to object detection, leading to greatly varying outcomes. In Section 4, we give an overview of the ILSVRC2013 detection dataset and provide details about choices that we made when running R-CNN on it.

Figure 3: (Left) Mean average precision on the ILSVRC2013 detection test set. Methods preceeded by * use outside training data (images and labels from the ILSVRC classification dataset in all cases). (Right) Box plots for the 200 average precision values per method. A box plot for the post-competition OverFeat result is not shown because per-class APs are not yet available (per-class APs for R-CNN are in Table 8 and also included in the tech report source uploaded to arXiv.org; see R-CNN-ILSVRC2013-APs.txt). The red line marks the median AP, the box bottom and top are the 25th and 75th percentiles. The whiskers extend to the min and max AP of each method. Each AP is plotted as a green dot over the whiskers (best viewed digitally with zoom).

Table 8: Per-class average precision (%) on the ILSVRC2013 detection test set.

圖3將R-CNN與ILSVRC 2013競賽中的參賽作品以及競賽後的OverFeat結果進行了比較[34]。 R-CNN的mAP達到31.4％，大大超過了OverFeat的第二佳結果24.3％。爲了讓您瞭解AP在各個類別中的分佈情況，還提供了箱形圖，並在表8的末尾列出了每個類別的AP。大多數競爭者（OverFeat，NEC-MU，UvAEuvision，Toronto A，和UIUC-IFP）使用了卷積神經網絡，這表明CNN如何應用於目標檢測有很大的細微差別，導致結果差異很大。在第4節中，我們概述了ILSVRC2013檢測數據集，並提供了有關在其上運行R-CNN時所做選擇的詳細信息。

圖3 ：（左圖）ILSVRC2013檢測測試集的mAP。*開頭的方法使用外部訓練數據（所有方法都使用ILSVRC分類數據集中的圖像和標籤）。（右圖）每種方法的200個平均精度值的箱形圖。競賽後的OverFeat結果的箱形圖未顯示，因爲無法獲得按類別的AP（R-CNN按類別的AP在表8中，並且也包含在上傳到arXiv.org的技術報告資源中；詳細見：R-CNN-ILSVRC2013-APs.txt）。紅線標記AP的中位數，方框的底部和頂部分別是第25個和第75個百分點。whiskers擴展到每種方法的AP最小值和最大值。將每個AP繪製爲whiskers上的綠點（最好通過電子版縮放進行查看）（譯者注：需要將右圖放大看，否則可能看不清楚）。

表8：ILSVRC 2013檢測測試集上的每類平均精度（％）。

3. Visualization, ablation, and modes of error

3.1. Visualizing learned features

First-layer filters can be visualized directly and are easy to understand [25]. They capture oriented edges and opponent colors. Understanding the subsequent layers is more challenging. Zeiler and Fergus present a visually attractive deconvolutional approach in [42]. We propose a simple (and complementary) non-parametric method that directly shows what the network learned.

3. 可視化、消融研究和模型誤差

3.1. 可視化學習到的特徵

直接可視化第一層特徵過濾器非常容易理解[25]，它們主要捕獲方向性邊緣和對比色。難以理解的是後面的層。Zeiler and Fergus提出了一種可視化的很棒的反捲積辦法[42]。我們則使用了一種簡單的非參數化方法，直接展示網絡學到的東西。這個想法是單一輸出網絡中一個特定單元（特徵），然後把它當做一個正確類別的物體檢測器來使用。

The idea is to single out a particular unit (feature) in the network and use it as if it were an object detector in its own right. That is, we compute the unit’s activations on a large set of held-out region proposals (about 10 million), sort the proposals from highest to lowest activation, perform nonmaximum suppression, and then display the top-scoring regions. Our method lets the selected unit 「speak for itself」 by showing exactly which inputs it fires on. We avoid averaging in order to see different visual modes and gain insight into the invariances computed by the unit.

方法是這樣的，先計算所有抽取出來的推薦區域（大約1000萬），計算每個區域所導致的對應單元的激活值，然後按激活值對這些區域進行排序，然後進行最大值抑制，最後展示分值最高的若干個區域。這個方法讓被選中的單元在遇到他想激活的輸入時「自己說話」。我們避免平均化是爲了看到不同的視覺模式和深入觀察單元計算出來的不變性。

We visualize units from layer pool5, which is the maxpooled output of the network’s fifth and final convolutional layer. The pool5 feature map is 6 6 256 = 9216-dimensional. Ignoring boundary effects, each pool5 unit has a receptive field of 195195 pixels in the original 227227 pixel input. A central pool5 unit has a nearly global view, while one near the edge has a smaller, clipped support.

我們可視化了第五層的池化層pool5，是卷積網絡的最後一層，feature_map(卷積核和特徵數的總稱)的大小是6 x 6 x 256 = 9216維。忽略邊界效應，每個pool5單元擁有195×195的感受野，輸入是227×227。pool5中間的單元，幾乎是一個全局視角，而邊緣的單元有較小的帶裁切的支持。

Each row in Figure 4 displays the top 16 activations for a pool5 unit from a CNN that we fine-tuned on VOC 2007 trainval. Six of the 256 functionally unique units are visualized (Appendix D includes more). These units were selected to show a representative sample of what the network learns. In the second row, we see a unit that fires on dog faces and dot arrays. The unit corresponding to the third row is a red blob detector. There are also detectors for human faces and more abstract patterns such as text and triangular structures with windows. The network appears to learn a representation that combines a small number of class-tuned features together with a distributed representation of shape, texture, color, and material properties. The subsequent fully connected layer fc6 has the ability to model a large set of compositions of these rich features.

Figure 4: Top regions for six pool5 units. Receptive fields and activation values are drawn in white. Some units are aligned to concepts, such as people (row 1) or text (4). Other units capture texture and material properties, such as dot arrays (2) and specular reflections (6).

圖4的每一行顯示了對於一個pool5單元的最高16個激活區域情況，這個實例來自於VOC 2007上我們調優的CNN，這裏只展示了256個單元中的6個（附錄D包含更多）。我們看看這些單元都學到了什麼。第二行，有一個單元看到狗和斑點的時候就會激活，第三行對應紅斑點，還有人臉，當然還有一些抽象的模式，比如文字和帶窗戶的三角結構。這個網絡似乎學到了一些類別調優相關的特徵，這些特徵都是形狀、紋理、顏色和材質特性的分佈式表示。而後續的fc6層則對這些豐富的特徵建立大量的組合來表達各種不同的事物。

圖4：六個pool5神經元的頂部區域。感受野和激活值以白線繪製。有些神經元與概念保持一致，例如人（第1行）或文本（第4行）。其他單元捕獲紋理和材料屬性，例如點陣列（第2行）和鏡面反射（第6行）。

3.2. Ablation studies

Performance layer-by-layer, without fine-tuning. To understand which layers are critical for detection performance, we analyzed results on the VOC 2007 dataset for each of the CNN’s last three layers. Layer pool5 was briefly described in Section 3.1. The final two layers are summarized below.

3.2. 消融研究

沒有調優的各層性能。爲了理解哪一層對於檢測的性能十分重要，我們分析了CNN最後三層的每一層在VOC2007上面的結果。Pool5在3.1中做過剪短的表述。最後兩層下面來總結一下。

Layer fc6 is fully connected to pool5. To compute features, it multiplies a 40969216 weight matrix by the pool5 feature map (reshaped as a 9216-dimensional vector) and then adds a vector of biases. This intermediate vector is component-wise half-wave rectified (x max(0; x)).

fc6是一個與pool5連接的全連接層。爲了計算特徵，它和pool5的feature map（reshape成一個9216維度的向量）做了一個4096×9216的矩陣乘法，並添加了一個bias向量。中間的向量是逐個組件的半波整流（component-wise half-wave rectified）ReLU(x <– max(0,="" x))。<="" p="">

Layer fc7 is the final layer of the network. It is implemented by multiplying the features computed by fc6 by a 4096 4096 weight matrix, and similarly adding a vector of biases and applying half-wave rectification.

fc7是網絡的最後一層。跟fc6之間通過一個4096×4096的矩陣相乘。也是添加了bias向量和應用了ReLU。

We start by looking at results from the CNN without fine-tuning on PASCAL, i.e. all CNN parameters were pre-trained on ILSVRC 2012 only. Analyzing performance layer-by-layer (Table 2 rows 1-3) reveals that features from fc7 generalize worse than features from fc6. This means that 29%, or about 16.8 million, of the CNN’s parameters can be removed without degrading mAP. More surprising is that removing both fc7 and fc6 produces quite good results even though pool5 features are computed using only 6% of the CNN’s parameters. Much of the CNN’s representational power comes from its convolutional layers, rather than from the much larger densely connected layers. This finding suggests potential utility in computing a dense feature map, in the sense of HOG, of an arbitrary-sized image by using only the convolutional layers of the CNN. This representation would enable experimentation with sliding-window detectors, including DPM, on top of pool5 features.

我們先來看看沒有調優的CNN在PASCAL上的表現，沒有調優是指所有的CNN參數就是在ILSVRC2012上訓練後的狀態。分析每一層的性能顯示來自fc7的特徵泛化能力不如fc6的特徵。這意味29%的CNN參數，也就是1680萬的參數可以移除掉，而且不影響mAP。更多的驚喜是即使同時移除fc6和fc7，僅僅使用pool5的特徵，只使用CNN參數的6%也能有非常好的結果。可見CNN的主要表達力來自於卷積層，而不是全連接層。這個發現提醒我們也許可以在計算一個任意尺寸的圖片的稠密特徵圖（dense feature map）時使僅僅使用CNN的卷積層。這種表示可以直接在pool5的特徵上進行滑動窗口檢測的實驗。

Performance layer-by-layer, with fine-tuning. We now look at results from our CNN after having fine-tuned its parameters on VOC 2007 trainval. The improvement is striking (Table 2 rows 4-6): fine-tuning increases mAP by 8.0 percentage points to 54.2%. The boost from fine-tuning is much larger for fc6 and fc7 than for pool5, which suggests that the pool5 features learned from ImageNet are general and that most of the improvement is gained from learning domain-specific non-linear classifiers on top of them.

調優後的各層性能。我們來看看調優後在VOC2007上的結果表現。提升非常明顯，mAP提升了8個百分點，達到了54.2%。fc6和fc7的提升明顯優於pool5，這說明pool5從ImageNet學習的特徵通用性很強，在它之上層的大部分提升主要是在學習領域相關的非線性分類器。

Comparison to recent feature learning methods. Relatively few feature learning methods have been tried on PASCAL VOC detection. We look at two recent approaches that build on deformable part models. For reference, we also include results for the standard HOG-based DPM [20].

對比其他特徵學習方法。相當少的特徵學習方法應用與VOC數據集。我們找到的兩個最近的方法都是基於固定探測模型。爲了參照的需要，我們也將基於基本HOG的DFM方法的結果加入比較。

The first DPM feature learning method, DPM ST [28], augments HOG features with histograms of 「sketch token」 probabilities. Intuitively, a sketch token is a tight distribution of contours passing through the center of an image patch. Sketch token probabilities are computed at each pixel by a random forest that was trained to classify 3535 pixel patches into one of 150 sketch tokens or background.

第一個DPM的特徵學習方法，DPM ST,將HOG中加入略圖表徵的概率直方圖。直觀的，一個略圖就是通過圖片中心輪廓的狹小分佈。略圖表徵概率通過一個被訓練出來的分類35*35像素路徑爲一個150略圖表徵的的隨機森林方法計算。

The second method, DPM HSC [31], replaces HOG with histograms of sparse codes (HSC). To compute an HSC, sparse code activations are solved for at each pixel using a learned dictionary of 100 7 7 pixel (grayscale) atoms. The resulting activations are rectified in three ways (full and both half-waves), spatially pooled, unit `2 normalized, and then power transformed (x sign(x)jxj).

第二個方法，DPM HSC，將HOG特徵替換成一個稀疏編碼的直方圖。爲了計算HSC（HSC的介紹略）

All R-CNN variants strongly outperform the three DPM baselines (Table 2 rows 8-10), including the two that use feature learning. Compared to the latest version of DPM, which uses only HOG features, our mAP is more than 20 percentage points higher: 54.2% vs. 33.7%—a 61% relative improvement. The combination of HOG and sketch tokens yields 2.5 mAP points over HOG alone, while HSC improves over HOG by 4 mAP points (when compared internally to their private DPM baselines—both use nonpublic implementations of DPM that underperform the open source version [20]). These methods achieve mAPs of 29.1% and 34.3%, respectively.

Table 2: Detection average precision (%) on VOC 2007 test. Rows 1-3 show R-CNN performance without fine-tuning. Rows 4-6 show results for the CNN pre-trained on ILSVRC 2012 and then fine-tuned (FT) on VOC 2007 trainval. Row 7 includes a simple bounding-box regression (BB) stage that reduces localization errors (Section C). Rows 8-10 present DPM methods as a strong baseline. The first uses only HOG, while the next two use different feature learning approaches to augment or replace HOG.

所有的RCNN變種算法都要強於這三個DPM方法（表2 8-10行），包括兩種特徵學習的方法（特徵學習不同於普通的HOG方法？）與最新版本的DPM方法比較，我們的mAP要多大約20個百分點，61%的相對提升。略圖表徵與HOG現結合的方法比單純HOG的性能高出2.5%，而HSC的方法相對於HOG提升四個百分點（當內在的與他們自己的DPM基準比價，全都是用的非公共DPM執行，這低於開源版本）。這些方法分別達到了29.1%和34.3%。

表2：VOC 2007測試集上的檢測平均精度（％）。第1-3行顯示了沒有進行fine-tuning的R-CNN性能。第4-6行顯示了在ILSVRC 2012上進行了預訓練並在VOC 2007上進行了fine-tuning（FT）的CNN的結果。第7行包括一個簡單的bounding-box迴歸（BB）階段，可減少定位錯誤（詳見C節）。第8-10行顯示將DPM方法作爲強baseline的結果。前一種僅使用HOG，而後兩種使用不同的特徵學習方法來增強或替代HOG。

3.3. Network architectures

Most results in this paper use the network architecture from Krizhevsky et al. [25]. However, we have found that the choice of architecture has a large effect on R-CNN detection performance. In Table 3 we show results on VOC 2007 test using the 16-layer deep network recently proposed by Simonyan and Zisserman [43]. This network was one of the top performers in the recent ILSVRC 2014 classification challenge. The network has a homogeneous structure consisting of 13 layers of 3 3 convolution kernels, with five max pooling layers interspersed, and topped with three fully-connected layers. We refer to this network as 「O-Net」 for OxfordNet and the baseline as 「T-Net」 for TorontoNet.

Table 3: Detection average precision (%) on VOC 2007 test for two different CNN architectures. The first two rows are results from Table 2 using Krizhevsky et al.’s architecture (T-Net). Rows three and four use the recently proposed 16-layer architecture from Simonyan and Zisserman (O-Net) [43].

3.3. 網絡架構

本文中的大部分結果所採用的架構都來自於Krizhevsky等人的工作[25]。然後我們也發現架構的選擇對於R-CNN的檢測性能會有很大的影響。表3中我們展示了VOC2007測試時採用了16層的深度網絡，由Simonyan和Zisserman[43]剛剛提出來。這個網絡在ILSVRC2014分類挑戰上是最佳表現。這個網絡採用了完全同構的13層3×3卷積核，中間穿插了5個最大池化層，頂部有三個全連接層。我們稱這個網絡爲O-Net表示OxfordNet，將我們的基準網絡稱爲T-Net表示TorontoNet。

表3：兩種不同CNN架構在VOC 2007測試集上的檢測平均精度（％）。前兩行源於表2中使用Krizhevsky等人架構（T-Net）的結果。第三和第四行使用Simonyan和Zisserman（O-Net）[43]最近提出的16層架構。

To use O-Net in R-CNN, we downloaded the publicly available pre-trained network weights for the VGG ILSVRC 16 layers model from the Caffe Model Zoo.1 We then fine-tuned the network using the same protocol as we used for T-Net. The only difference was to use smaller minibatches (24 examples) as required in order to fit within GPU memory. The results in Table 3 show that RCNN with O-Net substantially outperforms R-CNN with TNet, increasing mAP from 58.5% to 66.0%. However there is a considerable drawback in terms of compute time, with the forward pass of O-Net taking roughly 7 times longer than T-Net.

爲了使用O-Net，我們從Caffe模型庫中下載了他們訓練好的權重VGG_ILSVRC_16_layers。然後使用和T-Net上一樣的操作過程進行調優。唯一的不同是使用了更小的Batch Size（24），主要是爲了適應GPU的內存。表3中的結果顯示使用O-Net的R-CNN表現優越，將mAP從58.5%提升到了66.0%。然後它有個明顯的缺陷就是計算耗時。O-Net的前向傳播耗時大概是T-Net的7倍。

3.4. Detection error analysis

We applied the excellent detection analysis tool from Hoiem et al. [23] in order to reveal our method’s error modes, understand how fine-tuning changes them, and to see how our error types compare with DPM. A full summary of the analysis tool is beyond the scope of this paper and we encourage readers to consult [23] to understand some finer details (such as 「normalized AP」). Since the analysis is best absorbed in the context of the associated plots, we present the discussion within the captions of Figure 5 and Figure 6.

Figure 5: Distribution of top-ranked false positive (FP) types. Each plot shows the evolving distribution of FP types as more FPs are considered in order of decreasing score. Each FP is categorized into 1 of 4 types: Loc—poor localization (a detection with an IoU overlap with the correct class between 0.1 and 0.5, or a duplicate); Sim—confusion with a similar category; Oth—confusion with a dissimilar object category; BG—a FP that fired on background. Compared with DPM (see [23]), significantly more of our errors result from poor localization, rather than confusion with background or other object classes, indicating that the CNN features are much more discriminative than HOG. Loose localization likely results from our use of bottom-up region proposals and the positional invariance learned from pre-training the CNN for whole-image classification. Column three shows how our simple bounding-box regression method fixes many localization errors.

Figure 6: Sensitivity to object characteristics. Each plot shows the mean (over classes) normalized AP (see [23]) for the highest and lowest performing subsets within six different object characteristics (occlusion, truncation, bounding-box area, aspect ratio, viewpoint, part visibility). We show plots for our method (R-CNN) with and without fine-tuning (FT) and bounding-box regression (BB) as well as for DPM voc-release5. Overall, fine-tuning does not reduce sensitivity (the difference between max and min), but does substantially improve both the highest and lowest performing subsets for nearly all characteristics. This indicates that fine-tuning does more than simply improve the lowest performing subsets for aspect ratio and bounding-box area, as one might conjecture based on how we warp network inputs. Instead, fine-tuning improves robustness for all characteristics including occlusion, truncation, viewpoint, and part visibility.

3.4. 檢測誤差分析

爲了揭示出我們方法的錯誤之處，我們使用Hoiem提出的優秀的檢測分析工具，來理解fine-tuning是怎樣改變他們，並且觀察相對於DPM方法，我們的錯誤形式（譯者注：即錯誤的各種來源）。這個分析方法全部的介紹超出了本篇文章的範圍，我們建議讀者查閱文獻23來了解更加詳細的介紹（例如歸一化AP的介紹），由於這些分析是不太有關聯性，所以我們放在圖4和圖5的題注中討論。

圖5：排名最高的假陽性（FP）類型分佈。每個�>Figure 5: Distribution of top-ranked false positive (FP) types. Each plot shows the evolving distribution of FP types as more FPs are considered in order of decreasing score. Each FP is categorized into 1 of 4 types: Loc—poor localization (a detection with an IoU overlap with the correct class between 0.1 and 0.5, or a duplicate); Sim—confusion with a similar category; Oth—confusion with a dissimilar object category; BG—a FP that fired on background. Compared with DPM (see [23]), significantly more of our errors result from poor localization, rather than confusion with background or other object classes, indicating that the CNN features are much more discriminative than HOG. Loose localization likely results from our use of bottom-up region proposals and the positional invariance learned from pre-training the CNN for whole-image classification. Column three shows how our simple bounding-box regression method fixes many localization errors.

3.4. 檢測誤差分析

圖5：排名最高的假陽性（FP）類型分佈。每個圖都顯示了FP類型的演變分佈，因爲按照得分遞減的順序考慮了更多FP。每個FP分爲以下4種類型：Loc——定位較差（檢測結果與正確類別的IoU重疊在0.1和0.5之間，或重複）；Sim——與相似類別混淆；Oth——與非相似類別混淆；BG——將背景當作檢測目標的假陽性。與DPM相比（參見[23]），我們的錯誤明顯更多是由於錯誤定位所致，而不是與背景或其他對象類別造成的混淆，這表明CNN特徵比HOG具有更大的判別力。寬鬆的定位很可能是由於我們使用了自下而上的region proposals，以及從對神經網絡進行整個圖像分類的預訓練中學到的位置不變性。第三列顯示了我們的簡單bounding-box迴歸方法如何解決許多定位錯誤。

圖6：目標特性的敏感性。每個圖都顯示了六個不同目標特徵（遮擋、截斷、bounding-box區域、長寬比、viewpoint、部分可見性）子集性能的最高和最低均值（基於每個類別）標準化AP（詳見[23]）。我們展示了進行和沒有進行fine-tuning（FT）、bounding-box迴歸（BB）以及DPM voc-release5的方法（R-CNN）的結果圖。總體而言，fine-tuning沒有降低靈敏度（最大值和最小值之間的差異），但會實質上改善幾乎所有特性子集的最高和最低性能。這表明，fine-tuning不僅可以改善長寬比和bounding-box區域的性能最低的子集，還可以根據我們扭曲網絡輸入的方式來推測。相反，fine-tuning可以提高所有特性的魯棒性，包括遮擋、截斷、viewpoint,和部分可見性。

3.5. Bounding-box regression

Based on the error analysis, we implemented a simple method to reduce localization errors. Inspired by the bounding-box regression employed in DPM [17], we train a linear regression model to predict a new detection window given the pool5 features for a selective search region proposal. Full details are given in Appendix C. Results in Table 1, Table 2, and Figure 5 show that this simple approach fixes a large number of mislocalized detections, boosting mAP by 3 to 4 points.

3.5. Bounding-box迴歸

基於對誤差的分析，我們使用了一種簡單的方法減小定位誤差。受到DPM[17]中使用的bounding-box迴歸的啓發，我們訓練了一個線性迴歸模型在給定一個選擇區域的pool5特徵時去預測一個新的檢測窗口。詳細的細節參考附錄C。表1、表2和圖5的結果說明這個簡單的方法，改善了大量的錯誤定位檢測結果，mAP提升了3-4個百分點。

3.6. Qualitative results

Qualitative detection results on ILSVRC2013 are presented in Figure 8 and Figure 9 at the end of the paper. Each image was sampled randomly from the val2 set and all detections from all detectors with a precision greater than 0.5 are shown. Note that these are not curated and give a realistic impression of the detectors in action. More qualitative results are presented in Figure 10 and Figure 11, but these have been curated. We selected each image because it contained interesting, surprising, or amusing results. Here, also, all detections at precision greater than 0.5 are shown.

Figure 8: Example detections on the val2 set from the configuration that achieved 31.0% mAP on val2. Each image was sampled randomly (these are not curated). All detections at precision greater than 0.5 are shown. Each detection is labeled with the predicted class and the precision value of that detection from the detector’s precision-recall curve. Viewing digitally with zoom is recommended.

Figure 9: More randomly selected examples. See Figure 8 caption for details. Viewing digitally with zoom is recommended.

Figure 10: Curated examples. Each image was selected because we found it impressive, surprising, interesting, or amusing. Viewing digitally with zoom is recommended.

Figure 11: More curated examples. See Figure 10 caption for details. Viewing digitally with zoom is recommended.

3.6. 定性結果

ILSVRC2013數據集上的定性檢測結果如文章末尾圖8和圖9所示。從val2集中隨機採樣每個圖像，並顯示所有檢測器的所有檢測結果，其精度均大於0.5。請注意這些不是經過精心策劃的，它們會給實際使用中的檢測器以真實的印象。圖10和圖11給出了更多的定性結果，但是這些結果已經得到確認。我們選擇每張圖像是因爲它包含有趣，令人驚訝或有趣的結果。此處，還顯示了所有精度大於0.5的檢測。

圖8：通過在val2上達到31.0％ mAP的配置對val2數據子集進行檢測cs. Each plot shows the mean (over classes) normalized AP (see [23]) for the highest and lowest performing subsets within six different object characteristics (occlusion, truncation, bounding-box area, aspect ratio, viewpoint, part visibility). We show plots for our method (R-CNN) with and without fine-tuning (FT) and bounding-box regression (BB) as well as for DPM voc-release5. Overall, fine-tuning does not reduce sensitivity (the difference between max and min), but does substantially improve both the highest and lowest performing subsets for nearly all characteristics. This indicates that fine-tuning does more than simply improve the lowest performing subsets for aspect ratio and bounding-box area, as one might conjecture based on how we warp network inputs. Instead, fine-tuning improves robustness for all characteristics including occlusion, truncation, viewpoint, and part visibility.

3.4. 檢測誤差分析

3.5. Bounding-box regression

3.5. Bounding-box迴歸

3.6. Qualitative results

Figure 9: More randomly selected examples. See Figure 8 caption for details. Viewing digitally with zoom is recommended.

Figure 10: Curated examples. Each image was selected because we found it impressive, surprising, interesting, or amusing. Viewing digitally with zoom is recommended.

Figure 11: More curated examples. See Figure 10 caption for details. Viewing digitally with zoom is recommended.

3.6. 定性結果

圖8：通過在val2上達到31.0％ mAP的配置對val2數據子集進行檢測的結果。每個圖像都是隨機採樣的（這些圖像不是精心挑選的）。顯示了所有精度大於0.5的檢測結果。每個檢測器都標記有預測類別和檢測器的precision-recall曲線中的檢測值的精確度。建議數字化放大觀看。

3.4. 檢測誤差分析

3.5. Bounding-box regression

3.5. Bounding-box迴歸

3.6. Qualitative results

Figure 9: More randomly selected examples. See Figure 8 caption for details. Viewing digitally with zoom is recommended.

Figure 10: Curated examples. Each image was selected because we found it impressive, surprising, interesting, or amusing. Viewing digitally with zoom is recommended.

Figure 11: More curated examples. See Figure 10 caption for details. Viewing digitally with zoom is recommended.