CNN網絡架構演進：從LeNet到DenseNet

時間 2019-11-18

原文原文鏈接

卷積神經網絡可謂是如今深度學習領域中大紅大紫的網絡框架，尤爲在計算機視覺領域更是一枝獨秀。CNN從90年代的LeNet開始，21世紀初沉寂了10年，直到12年AlexNet開始又再煥發第二春，從ZF Net到VGG，GoogLeNet再到ResNet和最近的DenseNet，網絡愈來愈深，架構愈來愈複雜，解決反向傳播時梯度消失的方法也愈來愈巧妙。新年有假期，就好好總結一波CNN的各類經典架構吧，領略一下CNN的發展歷程中各路大神之間的智慧碰撞之美。算法

上面那圖是ILSVRC歷年的Top-5錯誤率，咱們會按照以上經典網絡出現的時間順序對他們進行介紹。服務器

本文將會談到如下經典的卷積神經網絡：網絡

LeNet
AlexNet
ZF
VGG
GoogLeNet
ResNet
DenseNet

開山之做：LeNet

閃光點：定義了CNN的基本組件，是CNN的鼻祖。架構

LeNet是卷積神經網絡的祖師爺LeCun在1998年提出，用於解決手寫數字識別的視覺任務。自那時起，CNN的最基本的架構就定下來了：卷積層、池化層、全鏈接層。現在各大深度學習框架中所使用的LeNet都是簡化改進過的LeNet-5（-5表示具備5個層），和原始的LeNet有些許不一樣，好比把激活函數改成了如今很經常使用的ReLu。app

LeNet-5跟現有的conv->pool->ReLU的套路不一樣，它使用的方式是conv1->pool->conv2->pool2再接全鏈接層，可是不變的是，卷積層後緊接池化層的模式依舊不變。框架

以上圖爲例，對經典的LeNet-5作深刻分析：ide

首先輸入圖像是單通道的28*28大小的圖像，用矩陣表示就是[1,28,28]
第一個卷積層conv1所用的卷積核尺寸爲5*5，滑動步長爲1，卷積核數目爲20，那麼通過該層後圖像尺寸變爲24，28-5+1=24，輸出矩陣爲[20,24,24]。
第一個池化層pool核尺寸爲2*2，步長2，這是沒有重疊的max pooling，池化操做後，圖像尺寸減半，變爲12×12，輸出矩陣爲[20,12,12]。
第二個卷積層conv2的卷積核尺寸爲5*5，步長1，卷積核數目爲50，卷積後圖像尺寸變爲8,這是由於12-5+1=8，輸出矩陣爲[50,8,8].
第二個池化層pool2核尺寸爲2*2，步長2，這是沒有重疊的max pooling，池化操做後，圖像尺寸減半，變爲4×4，輸出矩陣爲[50,4,4]。
pool2後面接全鏈接層fc1，神經元數目爲500，再接relu激活函數。
再接fc2，神經元個數爲10，獲得10維的特徵向量，用於10個數字的分類訓練，送入softmaxt分類，獲得分類結果的機率output。

LeNet的Keras實現：函數

def LeNet():
    model = Sequential()
    model.add(Conv2D(32,(5,5),strides=(1,1),input_shape=(28,28,1),padding='valid',activation='relu',kernel_initializer='uniform'))
    model.add(MaxPooling2D(pool_size=(2,2)))
    model.add(Conv2D(64,(5,5),strides=(1,1),padding='valid',activation='relu',kernel_initializer='uniform'))
    model.add(MaxPooling2D(pool_size=(2,2)))
    model.add(Flatten())
    model.add(Dense(100,activation='relu'))
    model.add(Dense(10,activation='softmax'))
    return model

王者歸來：AlexNet

AlexNet在2012年ImageNet競賽中以超過第二名10.9個百分點的絕對優點一舉奪冠，今後深度學習和卷積神經網絡名聲鵲起，深度學習的研究如雨後春筍般出現，AlexNet的出現可謂是卷積神經網絡的王者歸來。性能

閃光點：學習

更深的網絡
數據增廣
ReLU
dropout
LRN

以上圖AlexNet架構爲例，這個網絡前面5層是卷積層，後面三層是全鏈接層，最終softmax輸出是1000類，取其前兩層進行詳細說明。

AlexNet共包含5層卷積層和三層全鏈接層，層數比LeNet多了很多，但卷積神經網絡總的流程並無變化，只是在深度上加了很多。
AlexNet針對的是1000類的分類問題，輸入圖片規定是256×256的三通道彩色圖片，爲了加強模型的泛化能力，避免過擬合，做者使用了隨機裁剪的思路對原來256×256的圖像進行隨機裁剪，獲得尺寸爲3×224×224的圖像，輸入到網絡訓練。

由於使用多GPU訓練，因此能夠看到第一層卷積層後有兩個徹底同樣的分支，以加速訓練。
針對一個分支分析：第一層卷積層conv1的卷積核尺寸爲11×11，滑動步長爲4，卷積核數目爲48。卷積後獲得的輸出矩陣爲[48,55,55]。這裏的55是個難以理解的數字，做者也沒有對此說明，若是按照正常計算的話(224-11)/4+1 != 55的，因此這裏是作了padding再作卷積的，即先padiing圖像至227×227，再作卷積(227-11)/4+1 = 55。這些像素層通過relu1單元的處理，生成激活像素層，尺寸仍爲2組48×55×55的像素層數據
。而後通過歸一化處理，歸一化運算的尺度爲5*5。第一卷積層運算結束後造成的像素層的規模爲48×27×27。
輸入矩陣是[48,55,55].接着是池化層，作max pooling操做，池化運算的尺度爲3*3，運算的步長爲2，則池化後圖像的尺寸爲(55-3)/2+1=27。因此獲得的輸出矩陣是[48,27,27]。後面層再也不重複敘述。

AlexNet用到訓練技巧：

數據增廣技巧來增長模型泛化能力。
用ReLU代替Sigmoid來加快SGD的收斂速度
Dropout:Dropout原理相似於淺層學習算法的中集成算法，該方法經過讓全鏈接層的神經元（該模型在前兩個全鏈接層引入Dropout）以必定的機率失去活性（好比0.5）失活的神經元再也不參與前向和反向傳播，至關於約有一半的神經元再也不起做用。在測試的時候，讓全部神經元的輸出乘0.5。Dropout的引用，有效緩解了模型的過擬合。
Local Responce Normalization：局部響應歸一層的基本思路是，假如這是網絡的一塊，好比是 13×13×256， LRN 要作的就是選取一個位置，好比說這樣一個位置，從這個位置穿過整個通道，能獲得 256 個數字，並進行歸一化。進行局部響應歸一化的動機是，對於這張 13×13 的圖像中的每一個位置來講，咱們可能並不須要太多的高激活神經元。可是後來，不少研究者發現 LRN 起不到太大做用，由於並不重要，並且咱們如今並不用 LRN 來訓練網絡。

AlexNet的Keras實現：

def AlexNet():

    model = Sequential()
    model.add(Conv2D(96,(11,11),strides=(4,4),input_shape=(227,227,3),padding='valid',activation='relu',kernel_initializer='uniform'))
    model.add(MaxPooling2D(pool_size=(3,3),strides=(2,2)))
    model.add(Conv2D(256,(5,5),strides=(1,1),padding='same',activation='relu',kernel_initializer='uniform'))
    model.add(MaxPooling2D(pool_size=(3,3),strides=(2,2)))
    model.add(Conv2D(384,(3,3),strides=(1,1),padding='same',activation='relu',kernel_initializer='uniform'))
    model.add(Conv2D(384,(3,3),strides=(1,1),padding='same',activation='relu',kernel_initializer='uniform'))
    model.add(Conv2D(256,(3,3),strides=(1,1),padding='same',activation='relu',kernel_initializer='uniform'))
    model.add(MaxPooling2D(pool_size=(3,3),strides=(2,2)))
    model.add(Flatten())
    model.add(Dense(4096,activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(4096,activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(1000,activation='softmax'))
    return model

穩步前行：ZF-Net

ZFNet是2013ImageNet分類任務的冠軍，其網絡結構沒什麼改進，只是調了調參，性能較Alex提高了很多。ZF-Net只是將AlexNet第一層卷積核由11變成7，步長由4變爲2，第3，4，5卷積層轉變爲384，384，256。這一年的ImageNet仍是比較平靜的一屆，其冠軍ZF-Net的名堂也沒其餘屆的經典網絡架構響亮。

ZF-Net的Keras實現：

def ZF_Net():
    model = Sequential()  
    model.add(Conv2D(96,(7,7),strides=(2,2),input_shape=(224,224,3),padding='valid',activation='relu',kernel_initializer='uniform'))  
    model.add(MaxPooling2D(pool_size=(3,3),strides=(2,2)))  
    model.add(Conv2D(256,(5,5),strides=(2,2),padding='same',activation='relu',kernel_initializer='uniform'))  
    model.add(MaxPooling2D(pool_size=(3,3),strides=(2,2)))  
    model.add(Conv2D(384,(3,3),strides=(1,1),padding='same',activation='relu',kernel_initializer='uniform'))  
    model.add(Conv2D(384,(3,3),strides=(1,1),padding='same',activation='relu',kernel_initializer='uniform'))  
    model.add(Conv2D(256,(3,3),strides=(1,1),padding='same',activation='relu',kernel_initializer='uniform'))  
    model.add(MaxPooling2D(pool_size=(3,3),strides=(2,2)))  
    model.add(Flatten())  
    model.add(Dense(4096,activation='relu'))  
    model.add(Dropout(0.5))  
    model.add(Dense(4096,activation='relu'))  
    model.add(Dropout(0.5))  
    model.add(Dense(1000,activation='softmax'))  
    return model

越走越深：VGG-Nets

VGG-Nets是由牛津大學VGG（Visual Geometry Group）提出，是2014年ImageNet競賽定位任務的第一名和分類任務的第二名的中的基礎網絡。VGG能夠當作是加深版本的AlexNet. 都是conv layer + FC layer，在當時看來這是一個很是深的網絡了，由於層數高達十多層，咱們從其論文名字就知道了（《Very Deep Convolutional Networks for Large-Scale Visual Recognition》），固然以如今的目光看來VGG真的稱不上是一個very deep的網絡。

上面一個表格是描述的是VGG-Net的網絡結構以及誕生過程。爲了解決初始化（權重初始化）等問題，VGG採用的是一種Pre-training的方式，這種方式在經典的神經網絡中常常見獲得，就是先訓練一部分小網絡，而後再確保這部分網絡穩定以後，再在這基礎上逐漸加深。表1從左到右體現的就是這個過程，而且當網絡處於D階段的時候，效果是最優的，所以D階段的網絡也就是VGG-16了！E階段獲得的網絡就是VGG-19了！VGG-16的16指的是conv+fc的總層數是16，是不包括max pool的層數！

下面這個圖就是VGG-16的網絡結構。

由上圖看出，VGG-16的結構很是整潔，深度較AlexNet深得多，裏面包含多個conv->conv->max_pool這類的結構,VGG的卷積層都是same的卷積，即卷積事後的輸出圖像的尺寸與輸入是一致的，它的下采樣徹底是由max pooling來實現。

VGG網絡後接3個全鏈接層，filter的個數（卷積後的輸出通道數）從64開始，而後沒接一個pooling後其成倍的增長，12八、512，VGG的注意貢獻是使用小尺寸的filter，及有規則的卷積-池化操做。

閃光點：

卷積層使用更小的filter尺寸和間隔

與AlexNet相比，能夠看出VGG-Nets的卷積核尺寸仍是很小的，好比AlexNet第一層的卷積層用到的卷積核尺寸就是11*11，這是一個很大卷積核了。而反觀VGG-Nets，用到的卷積核的尺寸無非都是1×1和3×3的小卷積核，能夠替代大的filter尺寸。

3×3卷積核的優勢：

多個3×3的卷基層比一個大尺寸filter卷基層有更多的非線性，使得判決函數更加具備判決性
多個3×3的卷積層比一個大尺寸的filter有更少的參數，假設卷基層的輸入和輸出的特徵圖大小相同爲C，那麼三個3×3的卷積層參數個數3×（3×3×C×C）=27CC；一個7×7的卷積層參數爲49CC；因此能夠把三個3×3的filter當作是一個7×7filter的分解（中間層有非線性的分解）

1*1卷積核的優勢：

做用是在不影響輸入輸出維數的狀況下，對輸入進行線性形變，而後經過Relu進行非線性處理，增長網絡的非線性表達能力。

VGG-16的Keras實現：

def VGG_16():   
    model = Sequential()
    
    model.add(Conv2D(64,(3,3),strides=(1,1),input_shape=(224,224,3),padding='same',activation='relu',kernel_initializer='uniform'))
    model.add(Conv2D(64,(3,3),strides=(1,1),padding='same',activation='relu',kernel_initializer='uniform'))
    model.add(MaxPooling2D(pool_size=(2,2)))
    
    model.add(Conv2D(128,(3,2),strides=(1,1),padding='same',activation='relu',kernel_initializer='uniform'))
    model.add(Conv2D(128,(3,3),strides=(1,1),padding='same',activation='relu',kernel_initializer='uniform'))
    model.add(MaxPooling2D(pool_size=(2,2)))
    
    model.add(Conv2D(256,(3,3),strides=(1,1),padding='same',activation='relu',kernel_initializer='uniform'))
    model.add(Conv2D(256,(3,3),strides=(1,1),padding='same',activation='relu',kernel_initializer='uniform'))
    model.add(Conv2D(256,(3,3),strides=(1,1),padding='same',activation='relu',kernel_initializer='uniform'))
    model.add(MaxPooling2D(pool_size=(2,2)))
    
    model.add(Conv2D(512,(3,3),strides=(1,1),padding='same',activation='relu',kernel_initializer='uniform'))
    model.add(Conv2D(512,(3,3),strides=(1,1),padding='same',activation='relu',kernel_initializer='uniform'))
    model.add(Conv2D(512,(3,3),strides=(1,1),padding='same',activation='relu',kernel_initializer='uniform'))
    model.add(MaxPooling2D(pool_size=(2,2)))
    
    model.add(Conv2D(512,(3,3),strides=(1,1),padding='same',activation='relu',kernel_initializer='uniform'))
    model.add(Conv2D(512,(3,3),strides=(1,1),padding='same',activation='relu',kernel_initializer='uniform'))
    model.add(Conv2D(512,(3,3),strides=(1,1),padding='same',activation='relu',kernel_initializer='uniform'))
    model.add(MaxPooling2D(pool_size=(2,2)))
    
    model.add(Flatten())
    model.add(Dense(4096,activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(4096,activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(1000,activation='softmax'))
    
    return model

大浪推手：GoogLeNet

GoogLeNet在2014的ImageNet分類任務上擊敗了VGG-Nets奪得冠軍，其實力確定是很是深厚的，GoogLeNet跟AlexNet,VGG-Nets這種單純依靠加深網絡結構進而改進網絡性能的思路不同，它另闢幽徑，在加深網絡的同時（22層），也在網絡結構上作了創新，引入Inception結構代替了單純的卷積+激活的傳統操做（這思路最先由Network in Network提出）。GoogLeNet進一步把對卷積神經網絡的研究推上新的高度。

閃光點：

引入Inception結構
中間層的輔助LOSS單元
後面的全鏈接層所有替換爲簡單的全局平均pooling

上圖結構就是Inception，結構裏的卷積stride都是1，另外爲了保持特徵響應圖大小一致，都用了零填充。最後每一個卷積層後面都馬上接了個ReLU層。在輸出前有個叫concatenate的層，直譯的意思是「並置」，即把4組不一樣類型但大小相同的特徵響應圖一張張並排疊起來，造成新的特徵響應圖。Inception結構裏主要作了兩件事：1. 經過3×3的池化、以及1×一、3×3和5×5這三種不一樣尺度的卷積核，一共4種方式對輸入的特徵響應圖作了特徵提取。2. 爲了下降計算量。同時讓信息經過更少的鏈接傳遞以達到更加稀疏的特性，採用1×1卷積核來實現降維。

這裏想再詳細談談1×1卷積核的做用，它到底是怎麼實現降維的。如今運算以下：下面圖1是3×3卷積核的卷積，圖2是1×1卷積核的卷積過程。對於單通道輸入，1×1的卷積確實不能起到降維做用，但對於多通道輸入，就不不一樣了。假設你有256個特徵輸入，256個特徵輸出，同時假設Inception層只執行3×3的卷積。這意味着總共要進行 256×256×3×3的卷積（589000次乘積累加（MAC）運算）。這可能超出了咱們的計算預算，比方說，在Google服務器上花0.5毫秒運行該層。做爲替代，咱們決定減小須要卷積的特徵的數量，好比減小到64（256/4）個。在這種狀況下，咱們首先進行256到64的1×1卷積，而後在全部Inception的分支上進行64次卷積，接着再使用一個64到256的1×1卷積。

256×64×1×1 = 16000
64×64×3×3 = 36000
64×256×1×1 = 16000

如今的計算量大約是70000(即16000+36000+16000)，相比以前的約600000，幾乎減小了10倍。這就經過小卷積覈實現了降維。

如今再考慮一個問題：爲何必定要用1×1卷積核，3×3不也能夠嗎？考慮[50,200,200]的矩陣輸入，咱們可使用20個1×1的卷積核進行卷積，獲得輸出[20,200,200]。有人問，我用20個3×3的卷積核不是也能獲得[20,200,200]的矩陣輸出嗎，爲何就使用1×1的卷積核？咱們計算一下卷積參數就知道了，對於1×1的參數總數：20×200×200×（1×1），對於3×3的參數總數：20×200×200×（3×3），能夠看出，使用1×1的參數總數僅爲3×3的總數的九分之一！因此咱們使用的是1×1卷積核。

GoogLeNet網絡結構中有3個LOSS單元，這樣的網絡設計是爲了幫助網絡的收斂。在中間層加入輔助計算的LOSS單元，目的是計算損失時讓低層的特徵也有很好的區分能力，從而讓網絡更好地被訓練。在論文中，這兩個輔助LOSS單元的計算被乘以0.3，而後和最後的LOSS相加做爲最終的損失函數來訓練網絡。

GoogLeNet還有一個閃光點值得一提，那就是將後面的全鏈接層所有替換爲簡單的全局平均pooling，在最後參數會變的更少。而在AlexNet中最後3層的全鏈接層參數差很少佔總參數的90%，使用大網絡在寬度和深度容許GoogleNet移除全鏈接層，但並不會影響到結果的精度，在ImageNet中實現93.3%的精度，並且要比VGG還要快。

GoogLeNet的Keras實現：

def Conv2d_BN(x, nb_filter,kernel_size, padding='same',strides=(1,1),name=None):
    if name is not None:
        bn_name = name + '_bn'
        conv_name = name + '_conv'
    else:
        bn_name = None
        conv_name = None

    x = Conv2D(nb_filter,kernel_size,padding=padding,strides=strides,activation='relu',name=conv_name)(x)
    x = BatchNormalization(axis=3,name=bn_name)(x)
    return x

def Inception(x,nb_filter):
    branch1x1 = Conv2d_BN(x,nb_filter,(1,1), padding='same',strides=(1,1),name=None)

    branch3x3 = Conv2d_BN(x,nb_filter,(1,1), padding='same',strides=(1,1),name=None)
    branch3x3 = Conv2d_BN(branch3x3,nb_filter,(3,3), padding='same',strides=(1,1),name=None)

    branch5x5 = Conv2d_BN(x,nb_filter,(1,1), padding='same',strides=(1,1),name=None)
    branch5x5 = Conv2d_BN(branch5x5,nb_filter,(1,1), padding='same',strides=(1,1),name=None)

    branchpool = MaxPooling2D(pool_size=(3,3),strides=(1,1),padding='same')(x)
    branchpool = Conv2d_BN(branchpool,nb_filter,(1,1),padding='same',strides=(1,1),name=None)

    x = concatenate([branch1x1,branch3x3,branch5x5,branchpool],axis=3)

    return x

def GoogLeNet():
    inpt = Input(shape=(224,224,3))
    #padding = 'same'，填充爲(步長-1）/2,還能夠用ZeroPadding2D((3,3))
    x = Conv2d_BN(inpt,64,(7,7),strides=(2,2),padding='same')
    x = MaxPooling2D(pool_size=(3,3),strides=(2,2),padding='same')(x)
    x = Conv2d_BN(x,192,(3,3),strides=(1,1),padding='same')
    x = MaxPooling2D(pool_size=(3,3),strides=(2,2),padding='same')(x)
    x = Inception(x,64)#256
    x = Inception(x,120)#480
    x = MaxPooling2D(pool_size=(3,3),strides=(2,2),padding='same')(x)
    x = Inception(x,128)#512
    x = Inception(x,128)
    x = Inception(x,128)
    x = Inception(x,132)#528
    x = Inception(x,208)#832
    x = MaxPooling2D(pool_size=(3,3),strides=(2,2),padding='same')(x)
    x = Inception(x,208)
    x = Inception(x,256)#1024
    x = AveragePooling2D(pool_size=(7,7),strides=(7,7),padding='same')(x)
    x = Dropout(0.4)(x)
    x = Dense(1000,activation='relu')(x)
    x = Dense(1000,activation='softmax')(x)
    model = Model(inpt,x,name='inception')
    return model

里程碑式創新：ResNet

2015年何愷明推出的ResNet在ISLVRC和COCO上橫掃全部選手，得到冠軍。ResNet在網絡結構上作了大創新，而再也不是簡單的堆積層數，ResNet在卷積神經網絡的新思路，絕對是深度學習發展歷程上里程碑式的事件。

閃光點：

層數很是深，已經超過百層
引入殘差單元來解決退化問題

從前面能夠看到，隨着網絡深度增長，網絡的準確度應該同步增長，固然要注意過擬合問題。可是網絡深度增長的一個問題在於這些增長的層是參數更新的信號，由於梯度是從後向前傳播的，增長網絡深度後，比較靠前的層梯度會很小。這意味着這些層基本上學習停滯了，這就是梯度消失問題。深度網絡的第二個問題在於訓練，當網絡更深時意味着參數空間更大，優化問題變得更難，所以簡單地去增長網絡深度反而出現更高的訓練偏差，深層網絡雖然收斂了，但網絡卻開始退化了，即增長網絡層數卻致使更大的偏差，好比下圖，一個56層的網絡的性能卻不如20層的性能好，這不是由於過擬合（訓練集訓練偏差依然很高），這就是煩人的退化問題。殘差網絡ResNet設計一種殘差模塊讓咱們能夠訓練更深的網絡。

這裏詳細分析一下殘差單元來理解ResNet的精髓。

從下圖能夠看出，數據通過了兩條路線，一條是常規路線，另外一條則是捷徑（shortcut），直接實現單位映射的直接鏈接的路線，這有點相似與電路中的「短路」。經過實驗，這種帶有shortcut的結構確實能夠很好地應對退化問題。咱們把網絡中的一個模塊的輸入和輸出關係看做是y=H(x)，那麼直接經過梯度方法求H(x)就會遇到上面提到的退化問題，若是使用了這種帶shortcut的結構，那麼可變參數部分的優化目標就再也不是H(x),若用F(x)來表明須要優化的部分的話，則H(x)=F(x)+x，也就是F(x)=H(x)-x。由於在單位映射的假設中y=x就至關於觀測值，因此F(x)就對應着殘差，於是叫殘差網絡。爲啥要這樣作，由於做者認爲學習殘差F(X)比直接學習H(X)簡單！設想下，如今根據咱們只須要去學習輸入和輸出的差值就能夠了，絕對量變爲相對量（H（x）-x 就是輸出相對於輸入變化了多少），優化起來簡單不少。

考慮到x的維度與F(X)維度可能不匹配狀況，需進行維度匹配。這裏論文中採用兩種方法解決這一問題(實際上是三種，但經過實驗發現第三種方法會使performance急劇降低，故不採用):

zero_padding:對恆等層進行0填充的方式將維度補充完整。這種方法不會增長額外的參數
projection:在恆等層採用1x1的卷積核來增長維度。這種方法會增長額外的參數

下圖展現了兩種形態的殘差模塊，左圖是常規殘差模塊，有兩個3×3卷積核卷積核組成，可是隨着網絡進一步加深，這種殘差結構在實踐中並非十分有效。針對這問題，右圖的「瓶頸殘差模塊」（bottleneck residual block）能夠有更好的效果，它依次由1×一、3×三、1×1這三個卷積層堆積而成，這裏的1×1的卷積可以起降維或升維的做用，從而令3×3的卷積能夠在相對較低維度的輸入上進行，以達到提升計算效率的目的。

ResNet-50的Keras實現：

def Conv2d_BN(x, nb_filter,kernel_size, strides=(1,1), padding='same',name=None):
    if name is not None:
        bn_name = name + '_bn'
        conv_name = name + '_conv'
    else:
        bn_name = None
        conv_name = None

    x = Conv2D(nb_filter,kernel_size,padding=padding,strides=strides,activation='relu',name=conv_name)(x)
    x = BatchNormalization(axis=3,name=bn_name)(x)
    return x

def Conv_Block(inpt,nb_filter,kernel_size,strides=(1,1), with_conv_shortcut=False):
    x = Conv2d_BN(inpt,nb_filter=nb_filter[0],kernel_size=(1,1),strides=strides,padding='same')
    x = Conv2d_BN(x, nb_filter=nb_filter[1], kernel_size=(3,3), padding='same')
    x = Conv2d_BN(x, nb_filter=nb_filter[2], kernel_size=(1,1), padding='same')
    if with_conv_shortcut:
        shortcut = Conv2d_BN(inpt,nb_filter=nb_filter[2],strides=strides,kernel_size=kernel_size)
        x = add([x,shortcut])
        return x
    else:
        x = add([x,inpt])
        return x

def ResNet50():
    inpt = Input(shape=(224,224,3))
    x = ZeroPadding2D((3,3))(inpt)
    x = Conv2d_BN(x,nb_filter=64,kernel_size=(7,7),strides=(2,2),padding='valid')
    x = MaxPooling2D(pool_size=(3,3),strides=(2,2),padding='same')(x)
    
    x = Conv_Block(x,nb_filter=[64,64,256],kernel_size=(3,3),strides=(1,1),with_conv_shortcut=True)
    x = Conv_Block(x,nb_filter=[64,64,256],kernel_size=(3,3))
    x = Conv_Block(x,nb_filter=[64,64,256],kernel_size=(3,3))
    
    x = Conv_Block(x,nb_filter=[128,128,512],kernel_size=(3,3),strides=(2,2),with_conv_shortcut=True)
    x = Conv_Block(x,nb_filter=[128,128,512],kernel_size=(3,3))
    x = Conv_Block(x,nb_filter=[128,128,512],kernel_size=(3,3))
    x = Conv_Block(x,nb_filter=[128,128,512],kernel_size=(3,3))
    
    x = Conv_Block(x,nb_filter=[256,256,1024],kernel_size=(3,3),strides=(2,2),with_conv_shortcut=True)
    x = Conv_Block(x,nb_filter=[256,256,1024],kernel_size=(3,3))
    x = Conv_Block(x,nb_filter=[256,256,1024],kernel_size=(3,3))
    x = Conv_Block(x,nb_filter=[256,256,1024],kernel_size=(3,3))
    x = Conv_Block(x,nb_filter=[256,256,1024],kernel_size=(3,3))
    x = Conv_Block(x,nb_filter=[256,256,1024],kernel_size=(3,3))
    
    x = Conv_Block(x,nb_filter=[512,512,2048],kernel_size=(3,3),strides=(2,2),with_conv_shortcut=True)
    x = Conv_Block(x,nb_filter=[512,512,2048],kernel_size=(3,3))
    x = Conv_Block(x,nb_filter=[512,512,2048],kernel_size=(3,3))
    x = AveragePooling2D(pool_size=(7,7))(x)
    x = Flatten()(x)
    x = Dense(1000,activation='softmax')(x)
    
    model = Model(inputs=inpt,outputs=x)
    return model

繼往開來：DenseNet

自Resnet提出之後，ResNet的變種網絡層出不窮，都各有其特色，網絡性能也有必定的提高。本文介紹的最後一個網絡是CVPR 2017最佳論文DenseNet，論文中提出的DenseNet（Dense Convolutional Network）主要仍是和ResNet及Inception網絡作對比，思想上有借鑑，但倒是全新的結構，網絡結構並不複雜，卻很是有效，在CIFAR指標上全面超越ResNet。能夠說DenseNet吸取了ResNet最精華的部分，並在此上作了更加創新的工做，使得網絡性能進一步提高。

閃光點：

密集鏈接：緩解梯度消失問題，增強特徵傳播，鼓勵特徵複用，極大的減小了參數量

DenseNet 是一種具備密集鏈接的卷積神經網絡。在該網絡中，任何兩層之間都有直接的鏈接，也就是說，網絡每一層的輸入都是前面全部層輸出的並集，而該層所學習的特徵圖也會被直接傳給其後面全部層做爲輸入。下圖是 DenseNet 的一個dense block示意圖，一個block裏面的結構以下，與ResNet中的BottleNeck基本一致：BN-ReLU-Conv(1×1)-BN-ReLU-Conv(3×3) ，而一個DenseNet則由多個這種block組成。每一個DenseBlock的之間層稱爲transition layers，由BN−>Conv(1×1)−>averagePooling(2×2)組成

密集鏈接不會帶來冗餘嗎？不會！密集鏈接這個詞給人的第一感受就是極大的增長了網絡的參數量和計算量。但實際上 DenseNet 比其餘網絡效率更高，其關鍵就在於網絡每層計算量的減小以及特徵的重複利用。DenseNet則是讓l層的輸入直接影響到以後的全部層，它的輸出爲：xl=Hl([X0,X1,…,xl−1])，其中[x0,x1,...,xl−1]就是將以前的feature map以通道的維度進行合併。而且因爲每一層都包含以前全部層的輸出信息，所以其只須要不多的特徵圖就夠了，這也是爲何DneseNet的參數量較其餘模型大大減小的緣由。這種dense connection至關於每一層都直接鏈接input和loss，所以就能夠減輕梯度消失現象，這樣更深網絡不是問題

須要明確一點，dense connectivity 僅僅是在一個dense block裏的，不一樣dense block 之間是沒有dense connectivity的，好比下圖所示。

天底下沒有免費的午飯，網絡天然也不例外。在同層深度下得到更好的收斂率，天然是有額外代價的。其代價之一，就是其恐怖如斯的內存佔用。

DenseNet-121的Keras實現：

def DenseNet121(nb_dense_block=4, growth_rate=32, nb_filter=64, reduction=0.0, dropout_rate=0.0, weight_decay=1e-4, classes=1000, weights_path=None):
    '''Instantiate the DenseNet 121 architecture,
        # Arguments
            nb_dense_block: number of dense blocks to add to end
            growth_rate: number of filters to add per dense block
            nb_filter: initial number of filters
            reduction: reduction factor of transition blocks.
            dropout_rate: dropout rate
            weight_decay: weight decay factor
            classes: optional number of classes to classify images
            weights_path: path to pre-trained weights
        # Returns
            A Keras model instance.
    '''
    eps = 1.1e-5

    # compute compression factor
    compression = 1.0 - reduction

    # Handle Dimension Ordering for different backends
    global concat_axis
    if K.image_dim_ordering() == 'tf':
      concat_axis = 3
      img_input = Input(shape=(224, 224, 3), name='data')
    else:
      concat_axis = 1
      img_input = Input(shape=(3, 224, 224), name='data')

    # From architecture for ImageNet (Table 1 in the paper)
    nb_filter = 64
    nb_layers = [6,12,24,16] # For DenseNet-121

    # Initial convolution
    x = ZeroPadding2D((3, 3), name='conv1_zeropadding')(img_input)
    x = Convolution2D(nb_filter, 7, 7, subsample=(2, 2), name='conv1', bias=False)(x)
    x = BatchNormalization(epsilon=eps, axis=concat_axis, name='conv1_bn')(x)
    x = Scale(axis=concat_axis, name='conv1_scale')(x)
    x = Activation('relu', name='relu1')(x)
    x = ZeroPadding2D((1, 1), name='pool1_zeropadding')(x)
    x = MaxPooling2D((3, 3), strides=(2, 2), name='pool1')(x)

    # Add dense blocks
    for block_idx in range(nb_dense_block - 1):
        stage = block_idx+2
        x, nb_filter = dense_block(x, stage, nb_layers[block_idx], nb_filter, growth_rate, dropout_rate=dropout_rate, weight_decay=weight_decay)

        # Add transition_block
        x = transition_block(x, stage, nb_filter, compression=compression, dropout_rate=dropout_rate, weight_decay=weight_decay)
        nb_filter = int(nb_filter * compression)

    final_stage = stage + 1
    x, nb_filter = dense_block(x, final_stage, nb_layers[-1], nb_filter, growth_rate, dropout_rate=dropout_rate, weight_decay=weight_decay)

    x = BatchNormalization(epsilon=eps, axis=concat_axis, name='conv'+str(final_stage)+'_blk_bn')(x)
    x = Scale(axis=concat_axis, name='conv'+str(final_stage)+'_blk_scale')(x)
    x = Activation('relu', name='relu'+str(final_stage)+'_blk')(x)
    x = GlobalAveragePooling2D(name='pool'+str(final_stage))(x)

    x = Dense(classes, name='fc6')(x)
    x = Activation('softmax', name='prob')(x)

    model = Model(img_input, x, name='densenet')

    if weights_path is not None:
      model.load_weights(weights_path)

    return model


def conv_block(x, stage, branch, nb_filter, dropout_rate=None, weight_decay=1e-4):
    '''Apply BatchNorm, Relu, bottleneck 1x1 Conv2D, 3x3 Conv2D, and option dropout
        # Arguments
            x: input tensor 
            stage: index for dense block
            branch: layer index within each dense block
            nb_filter: number of filters
            dropout_rate: dropout rate
            weight_decay: weight decay factor
    '''
    eps = 1.1e-5
    conv_name_base = 'conv' + str(stage) + '_' + str(branch)
    relu_name_base = 'relu' + str(stage) + '_' + str(branch)

    # 1x1 Convolution (Bottleneck layer)
    inter_channel = nb_filter * 4  
    x = BatchNormalization(epsilon=eps, axis=concat_axis, name=conv_name_base+'_x1_bn')(x)
    x = Scale(axis=concat_axis, name=conv_name_base+'_x1_scale')(x)
    x = Activation('relu', name=relu_name_base+'_x1')(x)
    x = Convolution2D(inter_channel, 1, 1, name=conv_name_base+'_x1', bias=False)(x)

    if dropout_rate:
        x = Dropout(dropout_rate)(x)

    # 3x3 Convolution
    x = BatchNormalization(epsilon=eps, axis=concat_axis, name=conv_name_base+'_x2_bn')(x)
    x = Scale(axis=concat_axis, name=conv_name_base+'_x2_scale')(x)
    x = Activation('relu', name=relu_name_base+'_x2')(x)
    x = ZeroPadding2D((1, 1), name=conv_name_base+'_x2_zeropadding')(x)
    x = Convolution2D(nb_filter, 3, 3, name=conv_name_base+'_x2', bias=False)(x)

    if dropout_rate:
        x = Dropout(dropout_rate)(x)

    return x


def transition_block(x, stage, nb_filter, compression=1.0, dropout_rate=None, weight_decay=1E-4):
    ''' Apply BatchNorm, 1x1 Convolution, averagePooling, optional compression, dropout 
        # Arguments
            x: input tensor
            stage: index for dense block
            nb_filter: number of filters
            compression: calculated as 1 - reduction. Reduces the number of feature maps in the transition block.
            dropout_rate: dropout rate
            weight_decay: weight decay factor
    '''

    eps = 1.1e-5
    conv_name_base = 'conv' + str(stage) + '_blk'
    relu_name_base = 'relu' + str(stage) + '_blk'
    pool_name_base = 'pool' + str(stage) 

    x = BatchNormalization(epsilon=eps, axis=concat_axis, name=conv_name_base+'_bn')(x)
    x = Scale(axis=concat_axis, name=conv_name_base+'_scale')(x)
    x = Activation('relu', name=relu_name_base)(x)
    x = Convolution2D(int(nb_filter * compression), 1, 1, name=conv_name_base, bias=False)(x)

    if dropout_rate:
        x = Dropout(dropout_rate)(x)

    x = AveragePooling2D((2, 2), strides=(2, 2), name=pool_name_base)(x)

    return x


def dense_block(x, stage, nb_layers, nb_filter, growth_rate, dropout_rate=None, weight_decay=1e-4, grow_nb_filters=True):
    ''' Build a dense_block where the output of each conv_block is fed to subsequent ones
        # Arguments
            x: input tensor
            stage: index for dense block
            nb_layers: the number of layers of conv_block to append to the model.
            nb_filter: number of filters
            growth_rate: growth rate
            dropout_rate: dropout rate
            weight_decay: weight decay factor
            grow_nb_filters: flag to decide to allow number of filters to grow
    '''

    eps = 1.1e-5
    concat_feat = x

    for i in range(nb_layers):
        branch = i+1
        x = conv_block(concat_feat, stage, branch, growth_rate, dropout_rate, weight_decay)
        concat_feat = merge([concat_feat, x], mode='concat', concat_axis=concat_axis, name='concat_'+str(stage)+'_'+str(branch))

        if grow_nb_filters:
            nb_filter += growth_rate

    return concat_feat, nb_filter

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。