場景分割：MIT Scene Parsing 與DilatedNet 擴展卷積網絡

時間 2019-12-14

標籤場景分割 mit scene parsing dilatednet 擴展網絡欄目系統網絡简体版

原文原文鏈接

MIT Scene Parsing Benchmark簡介
html

Scene parsing is to segment and parse an image into different image regions associated with semantic categories, such as sky, road, person, and bed. MIT Scene Parsing Benchmark (SceneParse150) provides a standard training and evaluation platform for the algorithms of scene parsing. The data for this benchmark comes fromADE20K Dataset which contains more than 20K scene-centric images exhaustivelyannotated with objects and object parts. Specifically, the benchmark is divided into 20K images for training, 2K images for validation, and another batch of held-out images for testing. There are totally 150 semantic categories included for evaluation, which include stuffs like sky, road, grass, and discrete objects like person, car, bed. Note that there are non-uniform distribution of objects occuring in the images, mimicking a more natural object occurrence in daily scene.git

scene Benchmark包含了150個物體類別，包括通常無定型的牆壁、水域、地板、道路，也包括常見的室內目標如窗戶、桌子、椅子、牀、杯子等粘附和非粘附目標，包含了COCO數據集的大多數類別。github

主頁連接：http://sceneparsing.csail.mit.edu/
網絡

預訓練模型： http://sceneparsing.csail.mit.edu/model/
ide

Model ZOO ： https://github.com/CSAILVision/sceneparsing/wiki/Model-Zoo
學習

一些State 的結果：https://drive.google.com/drive/folders/0B9CKOTmy0DyaQ2oxUHdtYUd2Mm8?usp=sharing
this

挑戰結果： http://placeschallenge.csail.mit.edu/results_challenge.html Face++ 暫時排在第一名
google

1. FCN與去卷積網絡
lua

deconv的其中一個用途是作upsampling，即增大圖像尺寸。而dilated conv並非作upsampling，而是增大感覺野。spa

參考：如何理解深度學習中的去卷積網絡層

(1) s>1，即卷積的同時作了downsampling，卷積後圖像尺寸減少；

(2) s=1，普通的步長爲1的卷積，好比在tensorflow中設置padding=SAME的話，卷積的圖像輸入和輸出有相同的尺寸大小；

(3) 0<s<1，fractionally strided convolution，至關於對圖像作upsampling。好比s=0.5時，意味着在圖像每一個像素之間padding一個空白的像素後，stride改成1作卷積，獲得的feature map尺寸增大一倍。

而dilated conv不是在像素之間padding空白的像素，而是在已有的像素上，skip掉一些像素，或者輸入不變，對conv的kernel參數中插一些0的weight，達到一次卷積看到的空間範圍變大的目的。

2. 所謂孔洞卷積

dilated conv，中文能夠叫作空洞卷積或者擴張卷積。

參考：如何理解擴展卷積網絡？下一段摘抄於此文

參考：Multi-scale context aggregation by dilated convolutions

誕生背景，在圖像分割領域，圖像輸入到CNN（典型的網絡好比FCN[3]）中，FCN先像傳統的CNN那樣對圖像作卷積再pooling，下降圖像尺寸的同時增大感覺野，可是因爲圖像分割預測是pixel-wise的輸出，因此要將pooling後較小的圖像尺寸upsampling到原始的圖像尺寸進行預測（upsampling通常採用deconv反捲積操做，deconv可參見知乎答案如何理解深度學習中的deconvolution networks？），以前的pooling操做使得每一個pixel預測都能看到較大感覺野信息。所以圖像分割FCN中有兩個關鍵，一個是pooling減少圖像尺寸增大感覺野，另外一個是upsampling擴大圖像尺寸。在先減少再增大尺寸的過程當中，確定有一些信息損失掉了，那麼能不能設計一種新的操做，不經過pooling也能有較大的感覺野看到更多的信息呢？答案就是dilated conv。

下面看一下dilated conv原始論文[4]中的示意圖：

(a)圖對應3x3的1-dilated conv，和普通的卷積操做同樣，(b)圖對應3x3的2-dilated conv，實際的卷積kernel size仍是3x3，可是空洞爲1，也就是對於一個7x7的圖像patch，只有9個紅色的點和3x3的kernel發生卷積操做，其他的點略過。也能夠理解爲kernel的size爲7x7，可是隻有圖中的9個點的權重不爲0，其他都爲0。能夠看到雖然kernel size只有3x3，可是這個卷積的感覺野已經增大到了7x7（若是考慮到這個2-dilated conv的前一層是一個1-dilated conv的話，那麼每一個紅點就是1-dilated的卷積輸出，因此感覺野爲3x3，因此1-dilated和2-dilated合起來就能達到7x7的conv）,(c)圖是4-dilated conv操做，同理跟在兩個1-dilated和2-dilated conv的後面，能達到15x15的感覺野。對比傳統的conv操做，3層3x3的卷積加起來，stride爲1的話，只能達到(kernel-1)*layer+1=7的感覺野，也就是和層數layer成線性關係，而dilated conv的感覺野是指數級的增加。

dilated的好處是不作pooling損失信息的狀況下，加大了感覺野，讓每一個卷積輸出都包含較大範圍的信息。在圖像須要全局信息或者語音文本須要較長的sequence信息依賴的問題中，都能很好的應用dilated conv，好比圖像分割[3]、語音合成WaveNet[2]、機器翻譯ByteNet[1]中。

能夠把網絡看作一個pooling層插值網絡。

參考：Yu, Fisher, and Vladlen Koltun. "Multi-scale context aggregation by dilated convolutions."arXiv preprint arXiv:1511.07122 (2015).

使用預訓練模型獲得的一些結果：

預處理模型效果不是很好，應該使用競賽排名top的幾個模型