1. 相比傳統路線:去掉了不少手工設計模塊(hand-designed):如非極大值抑制、anchor的設計
這些手工設計的模塊裏均爲人爲對task先驗知識的必定程度上的「先驗的編碼(encode)」html
2. DETR核心內容:
a set-based global loss → forces unique predictions via:
(a) bipartite matching, and
(b) a transformer encoder-decoder architecture.(本文用的Transformer網絡是non-autoregressive非自迴歸的)python
關於非自迴歸的介紹能夠參考https://zhuanlan.zhihu.com/p/82892975git
3. DETR能作到的事:
· 輸入: a fixed small set of learned objects queries
· DETR輸出:github
(a) the relations of the objects
(b) the global image context to directly output the final set of prediction in parallel算法
4. 流程架構示意圖:網絡
更細節一些的流程架構示意圖↓:架構
1. 首先定性object detection問題爲set of predictionapp
2. 整個網絡設計是端到端(end-to-end)的,而後用一個「集合」損失函數(set loss function)來訓練,這個損失函數描述預測框和ground-truth框之間的二分圖匹配( performs bipartite matching between predicted and ground-truth objects)來訓練框架
3. DETR僅僅是架構上的創新,並無創新獨有的層(就好像resnet創新了跳連,DETR沒有在layer這個層面進行創新)ide
4. DETR用的「匹配」損失函數(matching loss function)將預測框「一一分配」給ground-truth框(uniquely assigns a prediction to a ground truth object,這裏的「一一分配」正是bipartite matching的自己含義);並且能保證對預測對象的排列順序保持不變(這也是用二分圖匹配建模的緣由,這裏特指無向二分圖)(uniquely assigns a prediction to a ground truth object)→這是可以並行化預測的一個緣由
「matching」這裏是圖論裏的概念,能夠參考https://www.renfei.org/blog/bipartite-matching.html
5. 對於建模爲「Set Prediction」(「集合」預測)的考慮:
一般「集合」預測任務是一種多標籤分類問題。多標籤分類問題的解決方法一般是「one-vs-rest」(「一對多」,one-vs-rest,又稱one-vs-all, 這裏指的是將label的類別做爲「一」,將其他類別當作一個總體做爲「多」,進行訓練),這種方法不適用於「元素」間有底層關係結構的狀況(「元素」e.g.幾乎如出一轍的預測框)(does not apply to problems such as detection where there is an underlying structure between elements (i.e., near-identical boxes)。這個方法會致使大量幾乎同樣的結果的狀況(near-duplicates),傳統的目標檢測方法會用後處理(如非極大值抑制)來解決這個問題(成堆的近乎同樣的預測結果),可是若是是建模爲set prediction就不用這些後處理。set prediction須要在全局上有個策略來對這些「元素」之間的關係建模,來避免預測過多的無用、複製的結果形成冗餘。
6. 對於採用「Bipartite Matching」(二分圖匹配)做爲「預測值→ground-truth值」的損失函數的考慮:
在Set Prediction問題中,損失函數必須知足「預測順序不變性」(invariant by a permutation of the predictions,即預測值/框的順序不能影響損失值),而二分圖匹配——這裏特指的是「無向」二分圖匹配將「預測值→ground-truth值」的關係建模爲了一個無向二分圖,這種圖的「匹配」不存在順序問題。特別地,用「匈牙利算法」來求解二分圖匹配問題。
· 「Bipartite Matching」(二分圖匹配)(1)能保證預測順序不變性」; (2)能保證二者間的「一一匹配」
7. 對於大物體的預測更準確:
文章中說「a result likely enabled by the non-local computations of the transformer」,這裏的「non-local computations」指的是Non-local Neural Networks(https://arxiv.org/pdf/1711.07971.pdf)這篇文章中的Non-local概念。
non-local computations指的是計算「非局部」感覺野上的信息,能夠參考https://zhuanlan.zhihu.com/p/33345791
3、結果
爲了防止後面代碼項目有改動,我摘出來寫本文時候(2020.06.18)的最新的一次提交(1fcfc65)來作部分源碼說明
(1)DETR網絡結構一覽:
class DETR(nn.Module): """ This is the DETR module that performs object detection """ def __init__(self, backbone, transformer, num_classes, num_queries, aux_loss=False): """ Initializes the model. Parameters: backbone: torch module of the backbone to be used. See backbone.py transformer: torch module of the transformer architecture. See transformer.py num_classes: number of object classes num_queries: number of object queries, ie detection slot. This is the maximal number of objects DETR can detect in a single image. For COCO, we recommend 100 queries. aux_loss: True if auxiliary decoding losses (loss at each decoder layer) are to be used. """ super().__init__() self.num_queries = num_queries self.transformer = transformer hidden_dim = transformer.d_model self.class_embed = nn.Linear(hidden_dim, num_classes + 1) self.bbox_embed = MLP(hidden_dim, hidden_dim, 4, 3) self.query_embed = nn.Embedding(num_queries, hidden_dim) self.input_proj = nn.Conv2d(backbone.num_channels, hidden_dim, kernel_size=1) self.backbone = backbone self.aux_loss = aux_loss def forward(self, samples: NestedTensor): """ The forward expects a NestedTensor, which consists of: - samples.tensor: batched images, of shape [batch_size x 3 x H x W] - samples.mask: a binary mask of shape [batch_size x H x W], containing 1 on padded pixels It returns a dict with the following elements: - "pred_logits": the classification logits (including no-object) for all queries. Shape= [batch_size x num_queries x (num_classes + 1)] - "pred_boxes": The normalized boxes coordinates for all queries, represented as (center_x, center_y, height, width). These values are normalized in [0, 1], relative to the size of each individual image (disregarding possible padding). See PostProcess for information on how to retrieve the unnormalized bounding box. - "aux_outputs": Optional, only returned when auxilary losses are activated. It is a list of dictionnaries containing the two above keys for each decoder layer. """ if not isinstance(samples, NestedTensor): samples = nested_tensor_from_tensor_list(samples) features, pos = self.backbone(samples) # backbone是一個CNN用於特徵提取 src, mask = features[-1].decompose() #?? assert mask is not None hs = self.transformer(self.input_proj(src), mask, self.query_embed.weight, pos[-1])[0] # 這裏是吧features的其中一部分信息做爲src傳進Transformer,input_proj是一個卷積層,用來收縮輸入的維度,把維度控制到d_model的尺寸(model dimension) outputs_class = self.class_embed(hs) # 爲了把Transformer應用於目標檢測問題上,做者引入了「類別嵌入網絡」和「框嵌入網絡」 outputs_coord = self.bbox_embed(hs).sigmoid() # 在框嵌入後加入一層sigmoid輸出框座標(原論文中提到是四點座標,可是要考慮到原圖片的尺寸) out = {'pred_logits': outputs_class[-1], 'pred_boxes': outputs_coord[-1]} if self.aux_loss: out['aux_outputs'] = self._set_aux_loss(outputs_class, outputs_coord) return out @torch.jit.unused def _set_aux_loss(self, outputs_class, outputs_coord): # this is a workaround to make torchscript happy, as torchscript # doesn't support dictionary with non-homogeneous values, such # as a dict having both a Tensor and a list. return [{'pred_logits': a, 'pred_boxes': b} for a, b in zip(outputs_class[:-1], outputs_coord[:-1])]
能夠看到 DETR的主體框架是:
(2)Backbone的源碼以下:
class BackboneBase(nn.Module): def __init__(self, backbone: nn.Module, train_backbone: bool, num_channels: int, return_interm_layers: bool): super().__init__() for name, parameter in backbone.named_parameters(): if not train_backbone or 'layer2' not in name and 'layer3' not in name and 'layer4' not in name: parameter.requires_grad_(False) if return_interm_layers: return_layers = {"layer1": "0", "layer2": "1", "layer3": "2", "layer4": "3"} else: return_layers = {'layer4': "0"} self.body = IntermediateLayerGetter(backbone, return_layers=return_layers) self.num_channels = num_channels def forward(self, tensor_list: NestedTensor): xs = self.body(tensor_list.tensors) out: Dict[str, NestedTensor] = {} for name, x in xs.items(): m = tensor_list.mask assert m is not None mask = F.interpolate(m[None].float(), size=x.shape[-2:]).to(torch.bool)[0] out[name] = NestedTensor(x, mask) return out
class Backbone(BackboneBase): """ResNet backbone with frozen BatchNorm.""" def __init__(self, name: str, train_backbone: bool, return_interm_layers: bool, dilation: bool): backbone = getattr(torchvision.models, name)( replace_stride_with_dilation=[False, False, dilation], pretrained=is_main_process(), norm_layer=FrozenBatchNorm2d) num_channels = 512 if name in ('resnet18', 'resnet34') else 2048 super().__init__(backbone, train_backbone, num_channels, return_interm_layers)
Backbone實際上是resnet的改編版,(1)一個是用了FrozenBatchNorm2d,凍結了部分參數,實際上這不是DETR的獨創,做者的同事(也是facebook)開源的maskrcnn-benchmark中就用到過這個FrozenBatchNorm2d
(2)把resnet做爲backbone套到了另外一個子網絡裏,這個子網絡主要是把送進去tensor list送進resnet網絡,而後逐個提取出來其中的節點(也就是裏面的Tensor),把每一個節點的「mask」提出來作一次採樣,而後再打包進自定義的「NestedTensor」中,按照「名稱」:Tensor的方式存入輸出的out。(這個NestedTensor一個Tensor裏打包存了兩個變量:x和mask)
TBC.(沒寫完的部分最近會補上,畢竟我也是邊看邊學而後記下來的……)