MyDLNote-Detection: DETR : End-to-End Object Detection with Transformers

End-to-End Object Detection with Transformers

[paper] https://arxiv.org/pdf/2005.12872.pdf

[github] https://github.com/facebookresearch/detr

 

 

 


Abstract

We present a new method that views object detection as a direct set prediction problem.

本文做了啥:提出了一種將目標檢測看作直接集預測問題的新方法。

Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components like a non-maximum suppression procedure or anchor generation that explicitly encode our prior knowledge about the task.

本文工作的亮點:簡化了檢測流程,有效地消除了許多手工設計的組件的需求,比如一個非最大抑制程序或錨的生成,顯式地編碼了關於任務的先驗知識。

The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss that forces unique predictions via bipartite matching, and a transformer encoder-decoder architecture. Given a fixed small set of learned object queries, DETR reasons about the relations of the objects and the global image context to directly output the final set of predictions in parallel. The new model is conceptually simple and does not require a specialized library, unlike many other modern detectors.

方法的具體內容:

新框架的主要成分,被稱爲 DEtection TRansformer or DETR。

DETR 有兩個 ingredients:

1. 損失函數部分:基於集的 global 損失,迫使唯一的預測通過二部匹配,

2. 網絡結構部分:transformer 編解碼架構。

DETR demonstrates accuracy and run-time performance on par with the well-established and highly-optimized Faster RCNN baseline on the challenging COCO object detection dataset. Moreover, DETR can be easily generalized to produce panoptic segmentation in a unified manner. We show that it significantly outperforms competitive baselines.


實驗結果:給定一小組固定的學習對象查詢,DETR 根據對象和全局圖像上下文的關係,直接並行輸出最終的預測集。與許多其他現代探測器不同,新模型在概念上很簡單,不需要專門的庫。

DETR 在 COCO 對象檢測數據集上的結果是,準確性和運行時間與 Faster-RCNN 相當。此外,DETR可以很容易地推廣,以統一的方式產生全景分割。

 


Introduction

The goal of object detection is to predict a set of bounding boxes and category labels for each object of interest. Modern detectors address this set prediction task in an indirect way, by defining surrogate regression and classification problems on a large set of proposals [37,5], anchors [23], or window centers [53,46]. Their performances are significantly influenced by post-processing steps to collapse near-duplicate predictions, by the design of the anchor sets and by the heuristics that assign target boxes to anchors [52]. To simplify these pipelines, we propose a direct set prediction approach to bypass the surrogate tasks. This end-to-end philosophy has led to significant advances in complex structured prediction tasks such as machine translation or speech recognition, but not yet in object detection: previous attempts [43,16,4,39] either add other forms of prior knowledge, or have not proven to be competitive with strong baselines on challenging benchmarks. This paper aims to bridge this gap.

本文的 motivation:

過去的及於深度學習目標識別模型都不是端到端的,需要藉助很多後處理過程,例如爲了避免重複的預測,需要設計 anchor 集和將目標框分配給 anchor 的啓發式方法等。即便在一些嘗試工作 [43,16,4,39] 實現了端到端的目標識別,但他們都需要其他形式的先驗知識,且不具備較好的性能。

本文的目的就是提出一種端到端的、高性能的目標識別網絡。

 

之後的三段,描述了 DETR 的特點。

We streamline the training pipeline by viewing object detection as a direct set prediction problem. We adopt an encoder-decoder architecture based on transformers [47], a popular architecture for sequence prediction. The self-attention mechanisms of transformers, which explicitly model all pairwise interactions between elements in a sequence, make these architectures particularly suitable for specific constraints of set prediction such as removing duplicate predictions.

網絡結構採用的技術:本文將目標檢測視爲直接集預測問題,從而簡化了 training pipeline。我們採用了一種 transformers 的編解碼器結構,這是一種常用的序列預測結構。transformers 的自注意機制以序列明確地爲元素之間的成對交互建模,使這些結構特別適合於集合預測的特定約束,如刪除重複預測。

Our DEtection TRansformer (DETR, see Figure 1) predicts all objects at once, and is trained end-to-end with a set loss function which performs bipartite matching between predicted and ground-truth objects. DETR simplifies the detection pipeline by dropping multiple hand-designed components that encode prior knowledge, like spatial anchors or non-maximal suppression. Unlike most existing detection methods, DETR doesn’t require any customized layers, and thus can be reproduced easily in any framework that contains standard CNN and transformer classes.

網絡損失函數:DETR 一次性預測所有目標,並使用集合損失函數(set loss function)進行端到端訓練,該函數對預測的目標和 GT 目標進行二部圖匹配 (bipartite matching)。DETR通過丟棄多個手工設計的編碼先驗知識的組件,如空間 anchors 或 非最大抑制,簡化了檢測 pipeline。與大多數現有的檢測方法不同,DETR 不需要任何 customized 層,因此可以在任何包含標準 CNN 和transformer 類的框架中輕鬆複製。

Compared to most previous work on direct set prediction, the main features of DETR are the conjunction of the bipartite matching loss and transformers with (non-autoregressive) parallel decoding [29,12,10,8]. In contrast, previous work focused on autoregressive decoding with RNNs [43,41,30,36,42]. Our matching loss function uniquely assigns a prediction to a ground truth object, and is invariant to a permutation of predicted objects, so we can emit them in parallel.

與傳統目標檢測方法的對比:與之前大多數直接集預測的工作相比,DETR的主要特徵是 bipartite matching loss 和 transformers (非自迴歸) 並行譯碼的結合。相比之下,之前的研究主要關注 RNNs 的自迴歸譯碼。本文的匹配損失函數爲一個 ground truth 對象分配唯一地一個預測,並且對預測對象的排列是不變的,因此我們可以並行地輸出它們。

 

We evaluate DETR on one of the most popular object detection datasets, COCO [24], against a very competitive Faster R-CNN baseline [37]. Faster RCNN has undergone many design iterations and its performance was greatly improved since the original publication. Our experiments show that our new model achieves comparable performances. More precisely, DETR demonstrates significantly better performance on large objects, a result likely enabled by the non-local computations of the transformer. It obtains, however, lower performances on small objects. We expect that future work will improve this aspect in the same way the development of FPN [22] did for Faster R-CNN.

在 COCO 數據集上的實驗結果。

Training settings for DETR differ from standard object detectors in multiple ways. The new model requires extra-long training schedule and benefits from auxiliary decoding losses in the transformer. We thoroughly explore what components are crucial for the demonstrated performance.

本文算法的一些瑕疵:新的模型需要超長的訓練計劃和有效的輔助解碼損失。我們將深入研究哪些組件對演示的性能至關重要。

The design ethos of DETR easily extend to more complex tasks. In our experiments, we show that a simple segmentation head trained on top of a pretrained DETR outperfoms competitive baselines on Panoptic Segmentation [19], a challenging pixel-level recognition task that has recently gained popularity.

本文算法的一些優點。

 


Related work

Set Prediction

There is no canonical deep learning model to directly predict sets. The basic set prediction task is multilabel classification (see e.g., [40,33] for references in the context of computer vision) for which the baseline approach, one-vs-rest, does not apply to problems such as detection where there is an underlying structure between elements (i.e., near-identical boxes). The first difficulty in these tasks is to avoid near-duplicates. Most current detectors use postprocessings such as non-maximal suppression to address this issue, but direct set prediction are postprocessing-free. They need global inference schemes that model interactions between all predicted elements to avoid redundancy. For constant-size set prediction, dense fully connected networks [9] are sufficient but costly. A general approach is to use auto-regressive sequence models such as recurrent neural networks [48]. In all cases, the loss function should be invariant by a permutation of the predictions. The usual solution is to design a loss based on the Hungarian algorithm [20], to find a bipartite matching between ground-truth and prediction. This enforces permutation-invariance, and guarantees that each target element has a unique match. We follow the bipartite matching loss approach. In contrast to most prior work however, we step away from autoregressive models and use transformers with parallel decoding, which we describe below.

目前還沒有典型的深度學習模型來直接預測集合。

基本的集合預測任務是多標籤分類 (例如,[40,33] 參考計算機視覺上下文),其中基線方法 one-vs-rest 不適用於諸如檢測元素之間存在底層結構 (即類同的 boxes)。

在這些任務中,第一個困難是避免近似重複。目前大多數檢測器使用諸如非最大抑制等後處理來解決這個問題,但直接集預測是沒有後處理的。他們需要對所有預測元素之間的交互進行建模的全局推理方案,以避免冗餘。對於固定尺寸的集預測,稠密全連接網絡(dense fully connected networks)是可以實現的,但代價昂貴。一般的方法是使用自迴歸序列模型,如循環神經網絡[48]。在所有情況下,損失函數應該是不變的置換的預測。通常的解決方案是設計一個基於匈牙利算法(Hungarian algorithm) [20] 的損失,以找到一個 GT 和預測之間的雙邊匹配(bipartite matching)。這可以強制執行置換不變性,並保證每個目標元素具有唯一的匹配。本文采用雙邊匹配損失的方法。然而,與之前的工作相反,本文的方法脫離自迴歸模型,使用具有並行解碼的transformers。

 

The DETR model

Two ingredients are essential for direct set predictions in detection: (1) a set prediction loss that forces unique matching between predicted and ground truth boxes; (2) an architecture that predicts (in a single pass) a set of objects and models their relation. We describe our architecture in detail in Figure 2.

在檢測中,有兩個要素對直接的集預測至關重要:

(1) 一組預測損失,用於強制預測 boxe 與 真實(GT)boxes 間的 進行唯一匹配;

(2) 網絡體系結構,用於預測目標並對目標之間的關係進行建模。

圖2中詳細描述了體系結構。

 

Object detection set prediction loss

DETR infers a fixed-size set of N predictions, in a single pass through the decoder, where N is set to be significantly larger than the typical number of objects in an image. One of the main difficulties of training is to score predicted objects (class, position, size) with respect to the ground truth. Our loss produces an optimal bipartite matching between predicted and ground truth objects, and then optimize object-specific (bounding box) losses.

DETR 推斷出一個N個預測的固定大小的集合,通過解碼器的一次傳遞,其中N被設置爲明顯大於圖像中典型對象的數量。訓練的主要困難之一是根據 GT 對預測對象 (類別、位置、大小) 進行評分。本文的損失計算了預測目標和 GT 目標之間的最佳雙邊匹配,然後優化目標特定 (邊界盒 bounding box) 損失。

Let us denote by y the ground truth set of objects, and \widehat{y} = \{\widehat{y}_i\}^N_{i=1} the set of N predictions. Assuming N is larger than the number of objects in the image, we consider y also as a set of size N padded with ∅ (no object). To find a bipartite matching between these two sets we search for a permutation of N elements \sigma \in \mathfrak{S}_N with the lowest cost:

            (1)

where \mathcal{L}_{match}(y_i , \hat{y}_{\sigma(i) }) is a pair-wise matching cost between ground truth y_i and a prediction with index \sigma (i). This optimal assignment is computed efficiently with the Hungarian algorithm, following prior work (e.g. [43]).

損失函數的解釋:

y 表示對象的 ground truth 集合,並且 \widehat{y} = \{\widehat{y}_i\}^N_{i=1} 表示 N 個預測的集合。假設 N 大於圖像中對象的數量,y 也是大小爲 N 的且包含 ∅(無對象) 的集合。這兩組之間找到一個雙邊匹配,使得一個排列的 N 個元素 \sigma \in \mathfrak{S}_N 最低的成本:

這個最優分配是有效地計算與匈牙利算法 [43]。

[43] CVPR 2015 : End-to-end people detection in crowded scenes

The matching cost takes into account both the class prediction and the similarity of predicted and ground truth boxes. Each element i of the ground truth set can be seen as a y_i = (c_i , b_i) where c_i is the target class label (which may be ∅) and b_i \in [0, 1]^4 is a vector that defines ground truth box center coordinates and its height and width relative to the image size. For the prediction with index \sigma (i) we define probability of class c_i as \widehat{p}_{\sigma(i) }(ci) and the predicted box as \widehat{b}_{\sigma(i) } . With these notations we define \mathcal{L}_{match}(y_i , \hat{y}_{\sigma(i) }) as .

匹配代價同時考慮了類預測和預測 boxes 與 GT boxes 的相似性。

c_i 表示 ground truth 的類別;b_i \in [0, 1]^4 表示 ground truth box 的中心座標及其高度和寬度。

\widehat{p}_{\sigma(i) }(ci) 表示預測的類別;\widehat{b}_{\sigma(i) } 表示預測 box 的中心座標及其高度和寬度。

 

This procedure of finding matching plays the same role as the heuristic assignment rules used to match proposal [37] or anchors [22] to ground truth objects in modern detectors. The main difference is that we need to find one-to-one matching for direct set prediction without duplicates.

這種尋找匹配的過程與現代探測器中用於匹配提議[37]或將[22]錨定到地面真值對象的啓發式分配規則的作用相同。主要的區別是我們需要找到一對一匹配的直接集預測沒有重複。

The second step is to compute the loss function, the Hungarian loss for all pairs matched in the previous step. We define the loss similarly to the losses of common object detectors, i.e. a linear combination of a negative log-likelihood for class prediction and a box loss defined later:

     (2)

where \widehat{\sigma } is the optimal assignment computed in the first step (1). In practice, we down-weight the log-probability term when c_i = ∅ by a factor 10 to account for class imbalance. This is analogous to how Faster R-CNN training procedure balances positive/negative proposals by subsampling [37]. Notice that the matching cost between an object and ∅ doesn’t depend on the prediction, which means that in that case the cost is a constant. In the matching cost we use probabilities \widehat{p}_{\sigma(i) }(ci) instead of log-probabilities. This makes the class prediction term commensurable to \mathcal{L}_{box}(\cdot, \cdot) (described below), and we observed better empirical performances.

第二步是計算損失函數,匈牙利損失爲所有匹配在前一步。我們對損失的定義類似於普通對象檢測器的損失,即類預測的負對數似然和後面定義的盒損失的線性組合:

式中\widehat{\sigma}爲第一步(1)計算出的最優分配,實際中,當c_i =∅by a因子10時,將對數概率項減重,以解釋階層不平衡。這類似於更快的R-CNN訓練程序如何通過對[37]進行子採樣來平衡正/負建議。注意,對象和∅之間的匹配成本不依賴於預測,也就是說,這種情況下成本是常數。在匹配代價中,我們使用概率\widehat{p}_{\sigma(i)}(ci)而不是對數概率。這使得類預測項可通約度

 

  • Bounding box loss.

The second part of the matching cost and the Hungarian loss is Lbox(·) that scores the bounding boxes. Unlike many detectors that do box predictions as a ∆ w.r.t. some initial guesses, we make box predictions directly. While such approach simplify the implementation it poses an issue with relative scaling of the loss. The most commonly-used `1 loss will have different scales for small and large boxes even if their relative errors are similar. To mitigate this issue we use a linear combination of the `1 loss and the generalized IoU loss [38] Liou(·, ·) that is scale-invariant. Overall, our box loss is defined as  where λiou, λL1 ∈ R are hyperparameters. These two losses are normalized by the number of objects inside the batch.

 

DETR architecture

The overall DETR architecture is surprisingly simple and depicted in Figure 2. It contains three main components, which we describe below: a CNN backbone to extract a compact feature representation, an encoder-decoder transformer, and a simple feed forward network (FFN) that makes the final detection prediction. Unlike many modern detectors, DETR can be implemented in any deep learning framework that provides a common CNN backbone and a transformer architecture implementation with just a few hundred lines. Inference code for DETR can be implemented in less than 50 lines in PyTorch [32]. We hope that the simplicity of our method will attract new researchers to the detection community.

整個 DETR 體系結構非常簡單,如圖2所示。它包含三個主要部分:

1. CNN,提取一個緊湊的特徵表示;

2. 編解碼 transformer;

3. 前向網絡(FFN),做出最終的檢測預測。

DETR可以在任何深度學習框架中實現,該框架提供一個普通的 CNN 主幹和一個 transformer 結構實現,在PyTorch 中,DETR 的推理代碼可以在不到50行代碼中實現。

  • Backbone.

Starting from the initial image x_{img }\in \mathbb{R}^{3\times H_0 \times W_0} (with 3 color channels ), a conventional CNN backbone generates a lower-resolution activation map f\in \mathbb{R}^{C\times H \times W} . Typical values we use are C = 2048 and H, W = H_0/32 , W_0/32 .

主幹網絡是一個 CNN

將深入彩色圖像轉換爲 C = 2048H, W = H_0/32 , W_0/32 大小的特徵圖;

 

  • Transformer encoder.

First, a 1\times 1 convolution reduces the channel dimension of the high-level activation map f from C to a smaller dimension d. creating a new feature map z_0\in \mathbb{R}^{d\times H \times W}. The encoder expects a sequence as input, hence we collapse the spatial dimensions of z_0 into one dimension, resulting in a d\times HW feature map. Each encoder layer has a standard architecture and consists of a multi-head self-attention module and a feed forward network (FFN). Since the transformer architecture is permutation-invariant, we supplement it with fixed positional encodings [31,3] that are added to the input of each attention layer. We defer to the supplementary material the detailed definition of the architecture, which follows the one described in [Attention is all you need].

Transformer  編碼器:

1. 1\times 1 卷積層:對輸入的 2048 個卷積層進行通道降維到 d,得到特徵 z_0\in \mathbb{R}^{d\times H \times W}。再把 z_0 拉伸成一維張量 d\times HW

2. 每個編碼器層都有一個標準的體系結構,由一個多頭自注意模塊和一個前饋網絡 (FFN) 組成。

3. 由於 transformer 的結構是置換不變的,用固定的位置編碼來補充,這些編碼被添加到每個注意層的輸入中。

更多細節要看文章後面的 Appendix。

[3]   2019 ICCV : Attention augmented convolutional networks

[31] 2018 ICML : Image transformer

  • Transformer decoder.

The decoder follows the standard architecture of the transformer, transforming N embeddings of size d using multi-headed self- and encoder-decoder attention mechanisms. The difference with the original transformer is that our model decodes the N objects in parallel at each decoder layer, while Vaswani et al. [47] use an autoregressive model that predicts the output sequence one element at a time. We refer the reader unfamiliar with the concepts to the supplementary material. Since the decoder is also permutation-invariant, the N input embeddings must be different to produce different results. These input embeddings are learnt positional encodings that we refer to as object queries, and similarly to the encoder, we add them to the input of each attention layer. The N object queries are transformed into an output embedding by the decoder. They are then independently decoded into box coordinates and class labels by a feed forward network (described in the next subsection), resulting N final predictions. Using self- and encoder-decoder attention over these embeddings, the model globally reasons about all objects together using pair-wise relations between them, while being able to use the whole image as context.

該解碼器遵循 transformer 的標準架構,利用多頭自/編/解碼器注意機制轉換尺寸爲 dN 個嵌入件。與原始 transformer  不同的是,本文的模型在每個解碼器層並行解碼 N 個對象,而 Vaswani 等人 [47] 使用自迴歸模型,一次預測一個元素的輸出序列。由於解碼器也是置換不變的,N 個輸入嵌入必須是不同的,以產生不同的結果。這些輸入嵌入是學習得的位置編碼,我們稱爲對象查詢 object queries,類似於編碼器,把它們添加到每個注意層的輸入。N 個對象查詢被譯碼器轉換成一個嵌入的輸出。然後,它們被一個前饋網絡 (在下一小節中描述) 獨立解碼成 box 座標和類標籤,從而產生 N 個最終預測。利用自身和編碼器-解碼器對這些嵌入的關注,模型通過它們之間的成對關係對所有對象進行全局推理,同時能夠使用整個圖像作爲上下文。


Appendix

Detailed architecture

The detailed description of the transformer used in DETR, with positional encodings passed at every attention layer, is given in Fig. 10. Image features from the CNN backbone are passed through the transformer encoder, together with spatial positional encoding that are added to queries and keys at every multihead self-attention layer. Then, the decoder receives queries (initially set to zero), output positional encoding (object queries), and encoder memory, and produces the final set of predicted class labels and bounding boxes through multiple multihead self-attention and decoder-encoder attention. The first self-attention layer in the first decoder layer can be skipped.

圖 10 給出了 DETR 中使用的 transformer 的詳細描述,在每個注意層通過位置編碼。來自CNN主幹的圖像特徵通過 transformer編碼器傳遞,以及在查詢 Q 和鍵 K 中添加到每個多線程自注意層的的空間位置編碼。然後,解碼器接受的 queries (最初設置爲零),輸出位置編碼 (object queries) 和編碼器內存,並通過多個多頭自注意和解碼器-編碼器注意生成最終的預測類標籤和邊界boxes。


 

  • Prediction feed-forward networks (FFNs).

The final prediction is computed by a 3-layer perceptron with ReLU activation function and hidden dimension d, and a linear projection layer. The FFN predicts the normalized center coordinates, height and width of the box w.r.t. the input image, and the linear layer predicts the class label using a softmax function. Since we predict a fixed-size set of N bounding boxes, where N is usually much larger than the actual number of objects of interest in an image, an additional special class label ∅ is used to represent that no object is detected within a slot. This class plays a similar role to the 「background」 class in the standard object detection approaches.

FFN 的結構:感知器(包含 3 層)+ ReLU ,估計 box;

一個 linear projection 層:softmax,估計 class。

 

  • Auxiliary decoding losses.

We found helpful to use auxiliary losses [1] in decoder during training, especially to help the model output the correct number of objects of each class. We add prediction FFNs and Hungarian loss after each decoder layer. All predictions FFNs share their parameters. We use an additional shared layer-norm to normalize the input to the prediction FFNs from different decoder layers.

在訓練過程中,發現在解碼器中使用輔助損失函數很有幫助,特別是幫助模型輸出每個類的正確對象數量。

在每一層解碼器後加入預測 FFNs 和匈牙利損失。所有的預測 FFN 共享它們的參數。

使用一個額外的共享層規範來規範化來自不同解碼器層的預測 FFN 的輸入。