深度學習基礎系列（九）| Dropout VS Batch Normalization? 是時候放棄Dropout了

時間 2019-11-12

標籤深度學習基礎系列 dropout batch normalization 時候放棄简体版

原文原文鏈接

　　Dropout是過去幾年很是流行的正則化技術，可有效防止過擬合的發生。但從深度學習的發展趨勢看，Batch Normalizaton(簡稱BN)正在逐步取代Dropout技術，特別是在卷積層。本文將首先引入Dropout的原理和實現，而後觀察現代深度模型Dropout的使用狀況，並與BN進行實驗比對，從原理和實測上來講明Dropout已經是過去式，你們應儘量使用BN技術。html

　1、Dropout原理

　　根據wikipedia定義，dropout是指在神經網絡中丟棄掉一些隱藏或可見單元。一般來講，是在神經網絡的訓練階段，每一次迭代時，都會隨機選擇一批單元，讓其被暫時忽略掉，所謂的忽略是不讓這些單元參與前向推理和後向傳播。git

　　上圖是標準的神經網絡，通過dropout後，則變成以下圖：github

　　通常來講，咱們在可能發生過擬合的狀況下才會使用dropout等正則化技術。那何時可能會發生呢？好比神經網絡過深，或訓練時間過長，或沒有足夠多的數據時。那爲何dropout能有效防止過擬合呢？能夠理解爲，咱們每次訓練迭代時，隨機選擇一批單元不參與訓練，這使得每一個單元不會依賴於特定的前綴單元，所以具備必定的獨立性；一樣能夠當作咱們拿一樣的數據在訓練不一樣的網絡，每一個網絡都有可能過擬合，但迭代屢次後，這種過擬合會被抵消掉。網絡

　　要注意的是，dropout是體如今訓練環節，訓練完成後，咱們認爲全部的單元都被訓練好了，在驗證或測試階段，咱們是拿完整的神經網絡去驗證或測試。架構

2、Dropout具體實現

　　以keras爲例，其代碼爲：keras.backend.dropout(x, level, noise_shape=None, seed=None)，其中x指的是輸入參數，level則是keep-prob，也就是這個單元有多少機率會被設置爲0。app

import tensorflow.keras.backend as K

input = K.random_uniform_variable(shape=(3, 3), low=0, high=1)

print("dropout with keep-prob 0.5:", K.eval(K.dropout(input, 0.5)))
print("dropout with keep-prob 0.2:", K.eval(K.dropout(input, 0.2)))
print("dropout with keep-prob 0.8:", K.eval(K.dropout(input, 0.8)))

　　看看輸出結果：dom

dropout with keep-prob 0.5: 
[[1.190095  0.        1.2999489]
 [0.        0.3164637 0.       ]
 [0.        0.        0.       ]]
dropout with keep-prob 0.2: 
[0.74380934 0.67237484 0.81246805]
 [0.8819132  0.19778982 1.2349881 ]
 [1.0369372  0.5945368  0.        ]]
dropout with keep-prob 0.8: 
[[0.        0.        0.       ]
 [0.        0.        4.9399524]
 [4.147749  2.3781471 0.       ]]

　　能夠看出，level值越大，每一個單元成爲0的機率也就越大。函數

　　在具體的keras應用中，dropout一般放在激活函數後，好比：post

model=keras.models.Sequential()
model.add(keras.layers.Dense(150, activation="relu"))
model.add(keras.layers.Dropout(0.5))

3、Dropout正在被拋棄

　　隨着深度學習的發展，Dropout在現代卷積架構中，已經逐步被BN（想要了解BN，你們能夠參見我以前寫的深度學習基礎系列（七）| Batch Normalization 一文，這裏再也不贅述）取代，BN也一樣擁有不亞於Dropout的正則化效果。性能

　　「We presented an algorithm for constructing, training, and performing inference with batch-normalized networks. The resulting networks can be trained with saturating nonlinearities, are more tolerant to increased training rates, and often do not require Dropout for regularization.」 -Ioffe and Svegedy 2015

　　至於爲什麼Dropout再也不受青睞，緣由以下：

Dropout在卷積層的正則效果有限。相比較於全鏈接層，卷積層的訓練參數較少，激活函數也能很好地完成特徵的空間轉換，所以正則化效果在卷積層不明顯；
Dropout也過期了，能發揮其做用的地方在全鏈接層，可當代的深度網絡中，全鏈接層也在慢慢被全局平均池化曾所取代，不但能減低模型尺寸，還能夠提高性能。

　　事實上，咱們能夠看看keras實現的現代經典模型，就能夠窺之dropout目前的處境。打開keras的地址：https://github.com/keras-team/keras-applications

　　縱觀不管是VGG、ResNet、Inception、MobileNetV2等模型，都不見了Dropout蹤跡。惟獨在MobileNetV1模型裏，還能夠找到Dropout，但不是在卷積層；並且在MobileNetV2後，已經再也不有全鏈接層，而是被全局平均池化層所取代。以下圖所示：

　　其餘模型也相似，紛紛拋棄了Dropout和全鏈接層。

4、Dropout VS BatchNormalization

　　咱們須要作一個簡單實驗來驗證上述理論的成立，實驗分五種測試模型：

沒有使用Dropout，也沒有使用BN;
使用了Dropout，不使用BN，使訓練單元爲0的機率爲0.2；
使用了Dropout，不使用BN，使訓練單元爲0的機率爲0.5；
使用了Dropout，不使用BN，使訓練單元爲0的機率爲0.8；
使用了BN，不使用Dropout

　　代碼以下：

import keras
from keras.datasets import cifar10
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten, BatchNormalization
from keras.layers import Conv2D, MaxPooling2D
from matplotlib import pyplot as plt
import numpy as np

# 爲保證公平起見，使用相同的隨機種子
np.random.seed(7)
batch_size = 32
num_classes = 10
epochs = 40
data_augmentation = True

# The data, split between train and test sets:
(x_train, y_train), (x_test, y_test) = cifar10.load_data()

# Convert class vectors to binary class matrices.
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255

def model(bn=False, dropout=False, level=0.5):
    model = Sequential()
    model.add(Conv2D(32, (3, 3), padding='same', input_shape=x_train.shape[1:]))
    if bn:
        model.add(BatchNormalization())
    model.add(Activation('relu'))
    model.add(Conv2D(32, (3, 3)))
    if bn:
        model.add(BatchNormalization())
    model.add(Activation('relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    if dropout:
        model.add(Dropout(level))

    model.add(Conv2D(64, (3, 3), padding='same'))
    if bn:
        model.add(BatchNormalization())
    model.add(Activation('relu'))
    model.add(Conv2D(64, (3, 3)))
    if bn:
        model.add(BatchNormalization())
    model.add(Activation('relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    if dropout:
        model.add(Dropout(level))

    model.add(Flatten())
    model.add(Dense(512))
    if bn:
        model.add(BatchNormalization())
    model.add(Activation('relu'))
    if dropout:
        model.add(Dropout(level))
    model.add(Dense(num_classes))
    model.add(Activation('softmax'))
    if bn:
        opt = keras.optimizers.rmsprop(lr=0.001, decay=1e-6)
    else:
        opt = keras.optimizers.rmsprop(lr=0.0001, decay=1e-6)

    model.compile(loss='categorical_crossentropy',
                             optimizer=opt,
                             metrics=['accuracy'])

    # 使用數據加強獲取更多的訓練數據
    datagen = ImageDataGenerator(width_shift_range=0.1, height_shift_range=0.1, horizontal_flip=True)
    datagen.fit(x_train)
    history = model.fit_generator(datagen.flow(x_train, y_train, batch_size=batch_size), epochs=epochs,
                                  validation_data=(x_test, y_test), workers=4)
    return history


no_dropout_bn_history = model(False, False)
dropout_low_history = model(False, True, 0.2)
dropout_medium_history = model(False, True, 0.5)
dropout_high_history = model(False, True, 0.8)
bn_history = model(True, False)

# 比較多種模型的精確度
plt.plot(no_dropout_bn_history.history['val_acc'])
plt.plot(dropout_low_history.history['val_acc'])
plt.plot(dropout_medium_history.history['val_acc'])
plt.plot(dropout_high_history.history['val_acc'])
plt.plot(bn_history.history['val_acc'])
plt.title('Model accuracy')
plt.ylabel('Validation Accuracy')
plt.xlabel('Epoch')
plt.legend(['No bn and dropout', 'Dropout with 0.2', 'Dropout with 0.5', 'Dropout with 0.8', 'BN'], loc='lower right')
plt.grid(True)
plt.show()

# 比較多種模型的損失率
plt.plot(no_dropout_bn_history.history['val_loss'])
plt.plot(dropout_low_history.history['val_loss'])
plt.plot(dropout_medium_history.history['val_loss'])
plt.plot(dropout_high_history.history['val_loss'])
plt.plot(bn_history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['No bn and dropout', 'Dropout with 0.2', 'Dropout with 0.5', 'Dropout with 0.8', 'BN'], loc='upper right')
plt.grid(True)
plt.show()

　　各模型的驗證準確率以下圖：

　　各模型的驗證損失率以下：

　　由上圖可知，Dropout在不一樣機率下，其表現差別較大，相對來講，Dropout with 0.2的表現接近於 No bn and dropout（能夠理解爲Dropout的keep-prob爲1的版本）。整體來講，BN在準確率和損失率上表現要優於Dropout，好比準確率上BN能達到85%，而Dropout接近爲79%。