論文筆記：Decoders Matter for Semantic Segmentation: Data-Dependent Decoding Enables Flexible Feature Agg

時間 2019-12-05

標籤論文筆記 decoders matter semantic segmentation data dependent decoding enables flexible feature agg 欄目 Flex 简体版

原文原文鏈接

Decoders Matter for Semantic Segmentation: Data-Dependent Decoding Enables Flexible Feature Aggregation git

2019-04-24 16:53:25

github

Paper：https://arxiv.org/pdf/1903.02120.pdf 網絡

Code（unofficial PyTorch Implementation）：https://github.com/LinZhuoChen/DUpsampling app

1. Background and Motivation: 框架

常規的 encoder-decoder 模型中，decoder 部分採用的是雙線性插值的方法，進行分辨率的提高。可是，這種粗暴的方式，對分割問題適應嗎？做者提出一種新穎的模型來替換掉雙線性插值的方式，即依賴於數據的上採樣模型（data-dependent upsampling (DUpsampling) to replace bilinear）。這麼作的好處是：充分利用了語義分割問題 label space 的冗餘性，而且能夠恢復出 pixel-wise prediction。那麼，具體該怎麼作呢？在 DeepLabv3+ 中，decoder 的定義以下圖所示：函數

這種框架帶來了以下的問題：學習

1). encode 的整體步長必須用多個空洞卷積來下降。這種操做須要不少的計算代價。優化

2). decoder 一般須要在底層融合特徵。由於 bilinear 的問題，致使最終融合的擬合程度是由融合的底層特徵分辨率決定的。這就致使，爲了獲得高分辨率的預測結果，decoder 就必須融合底層高分辨率的特徵。這種約束，限制了 feature aggregation 的設計空間，從而獲得的是 suboptimcal 的特徵組合。在本文的實驗中，做者發現：若是能夠進行不受到分辨率約束的特徵聚合，那麼就能夠設計更好的特徵聚合的方法。 ui

2. Our Approach: this

2.1 Beyond Bilinear: Data-dependent Upsampling:

咱們用 F 表示用 encoder 對輸入圖像進行卷積以後的輸出特徵，Y 表示其真值。常規的分割任務中，用到的損失函數以下所示：

此處，損失函數一般是 cross-entropy loss，而 bilinear 用於上採樣 F 獲得與 Y 相同分辨率的圖像。做者認爲此處用雙線性插值的方式進行上採樣，並不是是最好的選擇。因此，做者在這裏不去計算 bilinear(F) 和 Y 之間的偏差，而是去計算將 Y 下降分辨率後的圖像和 F 之間的偏差。注意到，這裏 F 和下降分辨率後的 Y，是具備相同分辨率的。爲了將 Y 進行壓縮，做者用一種在一些度量方式下的轉換，來最小化 Y 和低分辨率 Y 之間的重構偏差。具體來講，做者首先將 Y 進行劃分，對於每個 sub-window S，將其 reshape 成一個 {0, 1} 向量 v。最終，咱們壓縮 v 爲低維度的向量 x，而後水平和豎直的進行堆疊 x，構成最終。

因此，這裏的轉換能夠用矩陣 P 和 W 來表示，即：

咱們能夠在訓練集上經過最小化重構偏差，來學習獲得 P 和 W：

做者用 PCA 的方法能夠獲得該函數的閉合解。從而，能夠獲得關於真值 Y 的壓縮版本真值。有了這個做爲學習的目標，咱們能夠 pre-train 一個網絡模型，經過計算其迴歸損失函數，以下所示：

因此，任何的迴歸損失，l2 能夠用於上述公式 4。可是，做者認爲更直觀的一種方式是計算在 Y 空間內的損失。因此，做者用學習到的重構矩陣 W 來上採樣 F，而後計算反壓縮的 F 和 Y 之間的偏差，而不是對 Y 進行壓縮處理：

這裏的 DUpsample（F）的過程，以下圖所示：

有了這個線性轉換的過程，DUpsample (F) 採用線性上採樣的方式對每個 feature f 進行處理 Wf。與公式 1 相比，咱們已經用一種 data-dependent upsampling 的方式，替換掉了 the bilinear upsampling 的方法，而這種轉換矩陣，是從真值 labels 上進行學習。這種上採樣的過程，與 1*1 卷積的相同，也是沿着 spatial dimension，卷積核是存在 W 中。這個 decompression 的過程，如上圖 3 所示。

2.2 Incorporatinig DUpsampling with Adaptive-temperature Softmax :

截止目前爲止，咱們已經介紹瞭如何將這種 DUpsampling 結合到 decoder 中，接下來將會介紹如何將其結合到 encoder-decoder framework 中。因爲 DUpsampling 能夠用 1*1 的卷積操做來實現，將其直接結合到該框架中，會遇到優化的問題，即：收斂很是慢。爲了解決這個問題，咱們採用 softmax function with temperature, 即在原始的 softmax 函數中，添加一個 temperature T，以 soften/sharpen the activation of softmax:

咱們發現，T 能夠自動的用反向傳播進行學習，而不須要微調。

2.3 Flexible Aggregation of Covolutional Features :

固然原始 feature aggregation 的方法是存在以下問題的：

主要問題是：

1). f is applied after upsampling. Since f is a CNN, whose amount of computation depends on the spatial size of inputs, this arrangement would render the decoder inefficient computationally. Moreover, the computational overhead prevents the decoder from exploiting features at a very low level.

2). The resolution of fused low-level features Fi is equivalent to that of F, which is typically around 1/4 resolution of the final prediction due to the incapable bilinear used to produce the final pixel-wise prediction. In order to obtain high-resolution prediction, the decoder can only choose the feature aggregation with high-resolution low-level features.