論文筆記+源碼 DETR:End-to-End Object Detection with Transformers

〇、本論文須要有的基礎知識

  • 目標檢測:瞭解傳統目標檢測的基本技術路線(如anchor-based、非最極大值抑制、one-stage、two-stage),大體瞭解近兩年的SOTA方法(如Faster-RCNN)
  • Transformer:瞭解Transformer的機制,知道self-attention機制
  • 二分圖匹配:瞭解圖論中的二分圖匹配,知道匈牙利算法
  •  


1、 摘要核心點

1. 相比傳統路線:去掉了不少手工設計模塊(hand-designed):如非極大值抑制、anchor的設計
這些手工設計的模塊裏均爲人爲對task先驗知識的必定程度上的「先驗的編碼(encode)」html

2. DETR核心內容
a set-based global loss → forces unique predictions via: 
(a) bipartite matching, and 
(b) a transformer encoder-decoder architecture.(本文用的Transformer網絡是non-autoregressive非自迴歸的)python

關於非自迴歸的介紹能夠參考https://zhuanlan.zhihu.com/p/82892975git

3. DETR能作到的事
· 輸入: a fixed small set of learned objects queries
· DETR輸出:github

(a) the relations of the objects
(b) the global image context to directly output the final set of prediction in parallel算法

4. 流程架構示意圖:網絡

更細節一些的流程架構示意圖↓:架構

 

2、 正文

1. 首先定性object detection問題爲set of predictionapp

2. 整個網絡設計是端到端(end-to-end)的,而後用一個「集合」損失函數(set loss function)來訓練,這個損失函數描述預測框和ground-truth框之間的二分圖匹配( performs bipartite matching between predicted and ground-truth objects)來訓練框架

3. DETR僅僅是架構上的創新,並無創新獨有的層(就好像resnet創新了跳連,DETR沒有在layer這個層面進行創新)ide

4. DETR用的「匹配」損失函數(matching loss function)將預測框「一一分配」給ground-truth框(uniquely assigns a prediction to a ground truth object,這裏的「一一分配」正是bipartite matching的自己含義);並且能保證對預測對象的排列順序保持不變(這也是用二分圖匹配建模的緣由,這裏特指無向二分圖)(uniquely assigns a prediction to a ground truth object)→這是可以並行化預測的一個緣由

「matching」這裏是圖論裏的概念,能夠參考https://www.renfei.org/blog/bipartite-matching.html

5. 對於建模爲「Set Prediction」(「集合」預測)的考慮:

一般「集合」預測任務是一種多標籤分類問題。多標籤分類問題的解決方法一般是「one-vs-rest」(「一對多」,one-vs-rest,又稱one-vs-all, 這裏指的是將label的類別做爲「一」,將其他類別當作一個總體做爲「多」,進行訓練),這種方法不適用於「元素」間有底層關係結構的狀況(「元素」e.g.幾乎如出一轍的預測框)(does not apply to problems such as detection where there is an underlying structure between elements (i.e., near-identical boxes)。這個方法會致使大量幾乎同樣的結果的狀況(near-duplicates),傳統的目標檢測方法會用後處理(如非極大值抑制)來解決這個問題(成堆的近乎同樣的預測結果),可是若是是建模爲set prediction就不用這些後處理。set prediction須要在全局上有個策略來對這些「元素」之間的關係建模,來避免預測過多的無用、複製的結果形成冗餘。

6. 對於採用「Bipartite Matching」(二分圖匹配)做爲「預測值→ground-truth值」的損失函數的考慮:

在Set Prediction問題中,損失函數必須知足「預測順序不變性」(invariant by a permutation of the predictions,即預測值/框的順序不能影響損失值),而二分圖匹配——這裏特指的是「無向」二分圖匹配將「預測值→ground-truth值」的關係建模爲了一個無向二分圖,這種圖的「匹配」不存在順序問題。特別地,用「匈牙利算法」來求解二分圖匹配問題。

· 「Bipartite Matching」(二分圖匹配)(1)能保證預測順序不變性」; (2)能保證二者間的「一一匹配」

7. 對於大物體的預測更準確:

文章中說「a result likely enabled by the non-local computations of the transformer」,這裏的「non-local computations」指的是Non-local Neural Networkshttps://arxiv.org/pdf/1711.07971.pdf)這篇文章中的Non-local概念。

non-local computations指的是計算「非局部」感覺野上的信息,能夠參考https://zhuanlan.zhihu.com/p/33345791

3、結果

 

4、源碼討論

爲了防止後面代碼項目有改動,我摘出來寫本文時候(2020.06.18)的最新的一次提交(1fcfc65)來作部分源碼說明

(1)DETR網絡結構一覽:

class DETR(nn.Module):
    """ This is the DETR module that performs object detection """
    def __init__(self, backbone, transformer, num_classes, num_queries, aux_loss=False):
        """ Initializes the model.
        Parameters:
            backbone: torch module of the backbone to be used. See backbone.py
            transformer: torch module of the transformer architecture. See transformer.py
            num_classes: number of object classes
            num_queries: number of object queries, ie detection slot. This is the maximal number of objects
                         DETR can detect in a single image. For COCO, we recommend 100 queries.
            aux_loss: True if auxiliary decoding losses (loss at each decoder layer) are to be used.
        """
        super().__init__()
        self.num_queries = num_queries
        self.transformer = transformer
        hidden_dim = transformer.d_model
        self.class_embed = nn.Linear(hidden_dim, num_classes + 1)
        self.bbox_embed = MLP(hidden_dim, hidden_dim, 4, 3)
        self.query_embed = nn.Embedding(num_queries, hidden_dim)
        self.input_proj = nn.Conv2d(backbone.num_channels, hidden_dim, kernel_size=1)
        self.backbone = backbone
        self.aux_loss = aux_loss

    def forward(self, samples: NestedTensor):
        """ The forward expects a NestedTensor, which consists of:
               - samples.tensor: batched images, of shape [batch_size x 3 x H x W]
               - samples.mask: a binary mask of shape [batch_size x H x W], containing 1 on padded pixels
            It returns a dict with the following elements:
               - "pred_logits": the classification logits (including no-object) for all queries.
                                Shape= [batch_size x num_queries x (num_classes + 1)]
               - "pred_boxes": The normalized boxes coordinates for all queries, represented as
                               (center_x, center_y, height, width). These values are normalized in [0, 1],
                               relative to the size of each individual image (disregarding possible padding).
                               See PostProcess for information on how to retrieve the unnormalized bounding box.
               - "aux_outputs": Optional, only returned when auxilary losses are activated. It is a list of
                                dictionnaries containing the two above keys for each decoder layer.
        """
        if not isinstance(samples, NestedTensor):
            samples = nested_tensor_from_tensor_list(samples)
        features, pos = self.backbone(samples) # backbone是一個CNN用於特徵提取

        src, mask = features[-1].decompose() #??
        assert mask is not None
        hs = self.transformer(self.input_proj(src), mask, self.query_embed.weight, pos[-1])[0]  # 這裏是吧features的其中一部分信息做爲src傳進Transformer,input_proj是一個卷積層,用來收縮輸入的維度,把維度控制到d_model的尺寸(model dimension)

        outputs_class = self.class_embed(hs)  # 爲了把Transformer應用於目標檢測問題上,做者引入了「類別嵌入網絡」和「框嵌入網絡」
        outputs_coord = self.bbox_embed(hs).sigmoid()  # 在框嵌入後加入一層sigmoid輸出框座標(原論文中提到是四點座標,可是要考慮到原圖片的尺寸)
        out = {'pred_logits': outputs_class[-1], 'pred_boxes': outputs_coord[-1]}
        if self.aux_loss:
            out['aux_outputs'] = self._set_aux_loss(outputs_class, outputs_coord)
        return out

    @torch.jit.unused
    def _set_aux_loss(self, outputs_class, outputs_coord):
        # this is a workaround to make torchscript happy, as torchscript
        # doesn't support dictionary with non-homogeneous values, such
        # as a dict having both a Tensor and a list.
        return [{'pred_logits': a, 'pred_boxes': b}
                for a, b in zip(outputs_class[:-1], outputs_coord[:-1])]

能夠看到 DETR的主體框架是:

(2)Backbone的源碼以下:

BackboneBase

class BackboneBase(nn.Module):

    def __init__(self, backbone: nn.Module, train_backbone: bool, num_channels: int, return_interm_layers: bool):
        super().__init__()
        for name, parameter in backbone.named_parameters():
            if not train_backbone or 'layer2' not in name and 'layer3' not in name and 'layer4' not in name:
                parameter.requires_grad_(False)
        if return_interm_layers:
            return_layers = {"layer1": "0", "layer2": "1", "layer3": "2", "layer4": "3"}
        else:
            return_layers = {'layer4': "0"}
        self.body = IntermediateLayerGetter(backbone, return_layers=return_layers)
        self.num_channels = num_channels

    def forward(self, tensor_list: NestedTensor):
        xs = self.body(tensor_list.tensors)
        out: Dict[str, NestedTensor] = {}
        for name, x in xs.items():
            m = tensor_list.mask
            assert m is not None
            mask = F.interpolate(m[None].float(), size=x.shape[-2:]).to(torch.bool)[0]
            out[name] = NestedTensor(x, mask)
        return out

Backbone:

class Backbone(BackboneBase):
    """ResNet backbone with frozen BatchNorm."""
    def __init__(self, name: str,
                 train_backbone: bool,
                 return_interm_layers: bool,
                 dilation: bool):
        backbone = getattr(torchvision.models, name)(
            replace_stride_with_dilation=[False, False, dilation],
            pretrained=is_main_process(), norm_layer=FrozenBatchNorm2d)
        num_channels = 512 if name in ('resnet18', 'resnet34') else 2048
        super().__init__(backbone, train_backbone, num_channels, return_interm_layers)

Backbone實際上是resnet的改編版,(1)一個是用了FrozenBatchNorm2d,凍結了部分參數,實際上這不是DETR的獨創,做者的同事(也是facebook)開源的maskrcnn-benchmark中就用到過這個FrozenBatchNorm2d

(2)把resnet做爲backbone套到了另外一個子網絡裏,這個子網絡主要是把送進去tensor list送進resnet網絡,而後逐個提取出來其中的節點(也就是裏面的Tensor),把每一個節點的「mask」提出來作一次採樣,而後再打包進自定義的「NestedTensor」中,按照「名稱」:Tensor的方式存入輸出的out。(這個NestedTensor一個Tensor裏打包存了兩個變量:x和mask)

 

TBC.(沒寫完的部分最近會補上,畢竟我也是邊看邊學而後記下來的……)