論文筆記：Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation

時間 2019-11-18

標籤論文筆記 auto deeplab hierarchical neural architecture search semantic image segmentation 简体版

原文原文鏈接

Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation
2019-03-18 14:45:44node

Paper：https://arxiv.org/pdf/1901.02985 git

Offical TensorFlow Code: https://github.com/tensorflow/models/blob/master/research/deeplab/core/nas_network.py github

PyTorch Code: https://github.com/Dawars/auto_deeplab-pytorch 網絡

Video Tutorial (韓語): https://www.youtube.com/watch?v=ltlhQXHGzgE app

做者主頁（Liang-Chieh Chen）：http://liangchiehchen.com/ ide

另一個關於 NAS 作語義分割的工做是：Nekrasov, Vladimir, Hao Chen, Chunhua Shen, and Ian Reid. "Fast Neural Architecture Search of Compact Semantic Segmentation Models via Auxiliary Cells." arXiv preprint arXiv:1810.10804 (2018). 函數

本文首次將 Neural Architecture Search（NAS）引入到 semantic segmentation 領域，自動搜索網絡結果，用於語義分割。優化

3. Architecture Search Space：編碼

本節描述了咱們的雙層等級結構搜索空間。對於 inner cell level，咱們從新利用了前人的工做，保持一致。對於 outer network level，在對許多工做進行總結和觀察以後，做者提出一種新的搜索空間。spa

3.1 Cell Level Search Space：

做者定義 cell 爲一個小的全卷機模塊，一般重複不少次，以造成整個的神經網絡。具體來講，一個 cell 是一個 directed acyclic graph，包含 B 個 blocks。

每一個 block 是一個 two-branch structure，將 2 個輸入tensors 映射爲 1 個輸出 tensor。在 cell l 中的 Block i 多是由五元組指定的（I1, I2, O1, O2, C），其中 I1，I2 是輸入 tensor 的選擇，O1，O2 是 layer types 的選擇，C 是用於組合 the two branches 的單獨輸出，以構成該 block 的輸出 tensor，$H_i^l$。該 cell 的輸出 tensor $H^l$ 僅僅是該模塊輸出 tensors 的簡單組合{$H_1^l, ... , H_B^l$}。

可能的輸入 tensors $I_i^l$ 的集合，包含前一個 cell $H^{l-1}$ 的輸出，前前個 cell $H^{l-2}$，以及前一個 block 在當前 cell {H_1^l, ... , H_i^l} 的輸出。因此，咱們在一個 cell 中，添加越多的 blocks，下一個 block 就可能會有更多的輸入來源。

可能的 layer types，O，包含下列 8 個操做符，都與當前的 CNNs 緊密相關：

對於可能的組合操做函數，C，做者這裏僅採用 element-wise addition。

3.2　Network Level Search Space：

在圖像分類的 NAS framework 中，一旦一個 cell structure 被發現，整個的網絡結構是用預先定義的模型來獲得的。因此，the network-level 不是結構搜索的一部分，因此，其搜索空間從未被探索過。

這種預先定義的模式是很是簡答和直觀的：一些「Normal cells」（Cells that keep the spatial resolution of the feature tensor）經過添加「reduction cells」（cells that divide the spatial resolution by 2 and multiply the number of filters by 2）被單獨的分離。這種保持 downsampling 的策略，在圖像分類的任務上是合理的。可是，在 dense image prediction 中，保持高分辨率一樣重要，從而致使了更多的網絡層次。

在進行 dense image prediction 的衆多網絡結構中，咱們注意到以下兩個原則是一致的：

1. the spatial resolution of the next layer is either twice as large, or twoice as small, or remains the same; (下一層的分辨率要麼是兩倍大，兩倍小，或者保持不變)

2. the smallest spatial resolution is downsampled by 32. （最小的空間分辨率降低爲32）

服從這些公共的準則，咱們提出以下的網絡級別的搜索空間。網絡的開始是一個 two-layer 「stem」 structure，每一次以幅度 2 來下降空間分辨率。在那以後，總共有 L layers 未知的空間分辨率，最大降低幅度爲 4，最小的分辨率被下采樣了 32. 因爲每一層在空間分辨率上最多兩個不一樣，在 stem 以後的第一層能夠被將分辨率 4 或者 8. 咱們在圖 1中，展現了咱們的網絡級別搜索空間。咱們的目標是在這 L層路徑上，找到一個較好的 path。

在圖 2 中，咱們代表：做者所提出的 search space 是一種 general 的方法，足夠 cover 到不少流行的網絡設計。在將來工做中，做者打算將該搜索空間，拓展到甚至包含 U-Net 結構。

因爲本文既考慮了 cell level architecture ，又考慮到了 cell level architecture，因此，咱們的搜索任務，相對於前人的工做，則更加具備挑戰性以及 general。

4. Methods：

咱們首先介紹 a continuous relaxation of the discrete architecture，而後介紹如何如何經過優化來實現結構化搜索，而後是在搜索結束後，如何編碼回一個離散的結構。

4.1 Continuous Relaxation of Architecture：

4.1.1 Cell Architecture：

做者採用前人提出的連續鬆弛，每個 block 的輸出向量是和全部的 hidden states 相連的：

此外，咱們用其連續的鬆弛 $\hat{O_{j->i}}$ 來估計每個 $O_{j->i}$ ，其定義以下：

其中，

另外，是 normalized scalars associated with each operator, 用 softmax 函數能夠很容易的實現。

回顧 3.1 小節，咱們獲得 cell level update 的方式：

4.2 Network Archtiecture:

在一個 cell 中，全部的 tensor 都擁有相同的 spatial size，以確保公式（1， 2）中加權求和。然而，就像圖 1所示，tensors 可能在 network level 包含不一樣的 size，因此，爲了設置連續的鬆弛，每個 layer l 將會最多包含四個 hidden states，上標符號表示 spatial resolution。

咱們設計 network level 連續鬆弛，以準確的匹配搜索空間。咱們給圖 1 的每個灰色的箭頭加一個 scalar，因而，network level 的 update 能夠定義爲：

其中，s = 4, 8, 16, 32 and l = 1,2, ... , L. 參數 $\beta$ 歸一下以下：

也是用 softmax 的方式進行。

公式（6）代表如何將 two-level hierarchical 的連續鬆弛進行集合。特別的，$\beta$ 控制着 the outer network level，因此，依賴於空間尺寸和 layer index。$\beta$ 的每一個 scalar 都控制了一個完整的 $\alpha$ 集合，然而 $\alpha$ 指定了 the same architecure that depends on neither spatial size nor layer index。

如圖 1 所示，ASPP （Astrous Spatial Pyramid Pooling）modules 對第 L-th layer 的每個空間分辨率的都連接了（atrous rates 能夠調整）。他們的輸出，在 sum 以前，是 bilinear upsample 到原始的分辨率，以產生預測。

4.2 Optimization：

將該連續的鬆弛引入進來的優點是：這些 scalar 控制了不一樣隱層狀態的連接強度（controlling the connection stength between different hidden states），are now part of the differentiable computation graph. 因此，這能夠經過 gradient descent 的方法來進行有效的優化。做者採用 first-order approximation，將訓練數據分爲兩個集合 trainA 和 trainB。其依次優化過程以下：

其中，損失函數 L 是依賴於語義分割的交叉熵。

4.3 Decoding Discrete Architecture:

Cell Architecture: 做者解碼該離散 cell architecture，首先，對每個 block，保持 2 個最強的 predecessors，而後，經過 argmax 來選擇最像的操做符。

Network Architecture：公式（7）代表：the "outgoing probability" at each of the blue nodes in Fig. 1 sums to 1. 實際上，$\beta$ 值能夠表示爲：沿着不一樣「時間步驟（layer number）」，不一樣「state」（Spatial resolution）之間的轉移機率（「transition probability」）。直觀的來講，咱們的目標是：從頭至尾，找到一個 path，使其得到最大化的機率（maximum probability）。該路徑能夠有效的經過 the classic Viterbi algorithm，來進行解碼。

5. Experimental Results:

在本節中，做者首先介紹了接收搜索的具體實現細節，以及搜索的結果。而後，介紹了語義分割在多個benchmark 數據集上的結果。

5.1 Architecture Search Implementation Details：

做者考慮到 12 層的網絡，而且設置一個 cell 中的 B = 5 blocks，該 network level search space 有 $2.9*10^4$ 個獨特的 path，cell structure 的個數爲 $5.6*10^{14}$。因此，聯合的，等級搜索空間的大小爲 $10^{19}$。

做者採用經常使用的套路，即：double the number of filters，當下降 feature tensor 的 width 和 height 時。圖1中的每一個綠色節點，都有 downsample rate s，擁有 B*F*s output filters，其中，F 是 filter multiplier 控制着模型的容量。在結構搜索的過程當中，咱們設置 F = 8。stride 爲 2 的 convolution 被用於全部的 s/2 到 s 的鏈接，都用於下降分辨率大小和增長濾波器的個數。在 1*1 的卷積後，用 bilinear upsampling 來用於 2s -> s 的連接，都用於增長分辨率和下降濾波器的個數。

ASPP module 擁有 5 個分支：one 1*1 convolution, three 3*3 convolution （不一樣的空洞率），以及 pooled image feature. 在搜索的過程當中，咱們簡化 ASPP 使其擁有 3 branches，經過僅適用一個 3*3 convolution （空洞率爲 96/s）。每一個 ASPP 分支產生的濾波器的個數爲 B*F*s。

咱們在 Cityscapes dataset 上進行網絡結構的搜索進行語義分割。具體來講，做者隨機的從 512*1024 的圖像上裁剪出 321*312 的圖像。而後隨機的從 train_fine 中選擇通常圖像放到 trainA 中，剩下的通常做爲 trainB。本文的一個亮點是：整個網絡結構的搜索過程僅僅在 P100 GPU 上搜索一天就完成了。做者嘗試了優化更多的時間，可是並未見到效果有顯著的提高。圖4，展現了驗證集精度的穩定變化曲線。

5.2 語義分割結果：

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。