感謝中國人民大學的胡鶴老師,人工智能課程講的頗有深度,與時俱進
git
因爲深度神經網絡(DNN)層數不少,每次訓練都是逐層由後至前傳遞。傳遞項<1,梯度可能變得很是小趨於0,以此來訓練網絡幾乎不會有什麼變化,即vanishing gradients problem;或者>1梯度很是大,以此修正網絡會不斷震盪,沒法造成一個收斂網絡。於是DNN的訓練中能夠造成不少tricks。。express
一、初始化權重網絡
起初採用正態分佈隨機化初始權重,會使得本來單位的variance逐漸變得很是大。例以下圖的sigmoid函數,靠近0點的梯度近似線性很敏感,但到了,即很強烈的輸入產生木訥的輸出。app
採用Xavier initialization,根據fan-in(輸入神經元個數)和fan-out(輸出神經元個數)設置權重。dom
並設計針對不一樣激活函數的初始化策略,以下圖(左邊是均態分佈,右邊正態分佈較爲經常使用)ide
二、激活函數函數
通常使用ReLU,可是不能有小於0的輸入(dying ReLUs)學習
a.Leaky RELU測試
改進方法Leaky ReLU=max(αx,x),小於0時保留一點微小特徵。大數據
具體應用
from tensorflow.examples.tutorials.mnist import input_data mnist = input_data.read_data_sets("/tmp/data/") reset_graph() n_inputs = 28 * 28 # MNIST n_hidden1 = 300 n_hidden2 = 100 n_outputs = 10 X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X") y = tf.placeholder(tf.int64, shape=(None), name="y") with tf.name_scope("dnn"): hidden1 = tf.layers.dense(X, n_hidden1, activation=leaky_relu, name="hidden1") hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=leaky_relu, name="hidden2") logits = tf.layers.dense(hidden2, n_outputs, name="outputs") with tf.name_scope("loss"): xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits) loss = tf.reduce_mean(xentropy, name="loss") learning_rate = 0.01 with tf.name_scope("train"): optimizer = tf.train.GradientDescentOptimizer(learning_rate) training_op = optimizer.minimize(loss) with tf.name_scope("eval"): correct = tf.nn.in_top_k(logits, y, 1) accuracy = tf.reduce_mean(tf.cast(correct, tf.float32)) init = tf.global_variables_initializer() saver = tf.train.Saver() n_epochs = 40 batch_size = 50 with tf.Session() as sess: init.run() for epoch in range(n_epochs): for iteration in range(mnist.train.num_examples // batch_size): X_batch, y_batch = mnist.train.next_batch(batch_size) sess.run(training_op, feed_dict={X: X_batch, y: y_batch}) if epoch % 5 == 0: acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch}) acc_test = accuracy.eval(feed_dict={X: mnist.validation.images, y: mnist.validation.labels}) print(epoch, "Batch accuracy:", acc_train, "Validation accuracy:", acc_test) save_path = saver.save(sess, "./my_model_final.ckpt")
b. ELU改進
另外一種改進ELU,在神經元小於0時採用指數變化
#just specify the activation function when building each layer X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X") hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.elu, name="hidden1")
c. SELU
最新提出的是SELU(僅給出關鍵代碼)
with tf.name_scope("dnn"): hidden1 = tf.layers.dense(X, n_hidden1, activation=selu, name="hidden1") hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=selu, name="hidden2") logits = tf.layers.dense(hidden2, n_outputs, name="outputs")
# train 過程 means = mnist.train.images.mean(axis=0, keepdims=True) stds = mnist.train.images.std(axis=0, keepdims=True) + 1e-10 with tf.Session() as sess: init.run() for epoch in range(n_epochs): for iteration in range(mnist.train.num_examples // batch_size): X_batch, y_batch = mnist.train.next_batch(batch_size) X_batch_scaled = (X_batch - means) / stds sess.run(training_op, feed_dict={X: X_batch_scaled, y: y_batch}) if epoch % 5 == 0: acc_train = accuracy.eval(feed_dict={X: X_batch_scaled, y: y_batch}) X_val_scaled = (mnist.validation.images - means) / stds acc_test = accuracy.eval(feed_dict={X: X_val_scaled, y: mnist.validation.labels}) print(epoch, "Batch accuracy:", acc_train, "Validation accuracy:", acc_test) save_path = saver.save(sess, "./my_model_final_selu.ckpt")
三、Batch Normalization
在2015年,有研究者提出,既然使用mini-batch進行操做,對每一批數據也可採用,在調用激活函數以前,先作一下normalization,使得輸出數據有一個較好的形狀,初始時,超參數scaling(γ)和shifting(β)進行適度縮放平移後傳遞給activation函數。步驟以下:
現今batch normalization已經被TensorFlow實現成一個單獨的層,直接調用
測試時,因爲沒有mini-batch,故訓練時直接使用訓練時的mean和standard deviation(),實現代碼以下
import tensorflow as tf n_inputs = 28 * 28 n_hidden1 = 300 n_hidden2 = 100 n_outputs = 10 batch_norm_momentum = 0.9 X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X") y = tf.placeholder(tf.int64, shape=(None), name="y") training = tf.placeholder_with_default(False, shape=(), name='training') with tf.name_scope("dnn"): he_init = tf.contrib.layers.variance_scaling_initializer() #至關於單獨一層 my_batch_norm_layer = partial( tf.layers.batch_normalization, training=training, momentum=batch_norm_momentum) my_dense_layer = partial( tf.layers.dense, kernel_initializer=he_init) hidden1 = my_dense_layer(X, n_hidden1, name="hidden1") bn1 = tf.nn.elu(my_batch_norm_layer(hidden1))# 激活函數使用ELU hidden2 = my_dense_layer(bn1, n_hidden2, name="hidden2") bn2 = tf.nn.elu(my_batch_norm_layer(hidden2)) logits_before_bn = my_dense_layer(bn2, n_outputs, name="outputs") logits = my_batch_norm_layer(logits_before_bn)# 輸出層也作一個batch normalization with tf.name_scope("loss"): xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits) loss = tf.reduce_mean(xentropy, name="loss") with tf.name_scope("train"): optimizer = tf.train.GradientDescentOptimizer(learning_rate) training_op = optimizer.minimize(loss) with tf.name_scope("eval"): correct = tf.nn.in_top_k(logits, y, 1) accuracy = tf.reduce_mean(tf.cast(correct, tf.float32)) init = tf.global_variables_initializer() saver = tf.train.Saver() n_epochs = 20 batch_size = 200 #須要顯示調用訓練時得出的方差均值,須要額外調用這些算子 extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) #在training和testing時不同 with tf.Session() as sess: init.run() for epoch in range(n_epochs): for iteration in range(mnist.train.num_examples // batch_size): X_batch, y_batch = mnist.train.next_batch(batch_size) sess.run([training_op, extra_update_ops], feed_dict={training: True, X: X_batch, y: y_batch}) accuracy_val = accuracy.eval(feed_dict={X: mnist.test.images, y: mnist.test.labels}) print(epoch, "Test accuracy:", accuracy_val) save_path = saver.save(sess, "./my_model_final.ckpt")
四、Gradient Clipp
處理gradient以後日後傳,必定程度上解決梯度爆炸問題。(但因爲有了batch normalization,此方法用的很少)
threshold = 1.0 optimizer = tf.train.GradientDescentOptimizer(learning_rate) grads_and_vars = optimizer.compute_gradients(loss) capped_gvs = [(tf.clip_by_value(grad, -threshold, threshold), var) for grad, var in grads_and_vars] training_op = optimizer.apply_gradients(capped_gvs)
五、重用以前訓練過的層(Reusing Pretrained Layers)
對以前訓練的模型稍加修改,節省時間,在深度模型訓練(因爲有不少層)中常用。
通常類似問題,分類數等和問題緊密相關的output層與最後一個直接與output相關的隱層不能夠直接用,仍需本身訓練。
以下圖所示,在已訓練出一個複雜net後,遷移到相對簡單的net時,hidden1和2固定不動,hidden3稍做變化,hidden4和output本身訓練。。這在沒有本身GPU狀況下是很是節省時間的作法。
# 只選取須要的操做 X = tf.get_default_graph().get_tensor_by_name("X:0") y = tf.get_default_graph().get_tensor_by_name("y:0") accuracy = tf.get_default_graph().get_tensor_by_name("eval/accuracy:0") training_op = tf.get_default_graph().get_operation_by_name("GradientDescent") # 若是你是原模型的做者,能夠賦給模型一個清楚的名字保存下來 for op in (X, y, accuracy, training_op): tf.add_to_collection("my_important_ops", op) # 若是你要使用這個模型 X, y, accuracy, training_op = tf.get_collection("my_important_ops")
# 訓練時 with tf.Session() as sess: saver.restore(sess, "./my_model_final.ckpt") for epoch in range(n_epochs): for iteration in range(mnist.train.num_examples // batch_size): X_batch, y_batch = mnist.train.next_batch(batch_size) sess.run(training_op, feed_dict={X: X_batch, y: y_batch}) accuracy_val = accuracy.eval(feed_dict={X: mnist.test.images, y: mnist.test.labels}) print(epoch, "Test accuracy:", accuracy_val) save_path = saver.save(sess, "./my_new_model_final.ckpt")
a. Freezing the Lower Layers
訓練時固定底層參數,達到Freezing the Lower Layers的目的
# 以MINIST爲例 n_inputs = 28 * 28 # MNIST n_hidden1 = 300 # reused n_hidden2 = 50 # reused n_hidden3 = 50 # reused n_hidden4 = 20 # new! n_outputs = 10 # new! X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X") y = tf.placeholder(tf.int64, shape=(None), name="y")
with tf.name_scope("dnn"): hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, name="hidden1") # reused frozen hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu, name="hidden2") # reused frozen hidden2_stop = tf.stop_gradient(hidden2) hidden3 = tf.layers.dense(hidden2_stop, n_hidden3, activation=tf.nn.relu, name="hidden3") # reused, not frozen hidden4 = tf.layers.dense(hidden3, n_hidden4, activation=tf.nn.relu, name="hidden4") # new! logits = tf.layers.dense(hidden4, n_outputs, name="outputs") # new!
with tf.name_scope("loss"): xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits) loss = tf.reduce_mean(xentropy, name="loss") with tf.name_scope("eval"): correct = tf.nn.in_top_k(logits, y, 1) accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name="accuracy") with tf.name_scope("train"): optimizer = tf.train.GradientDescentOptimizer(learning_rate) training_op = optimizer.minimize(loss)
reuse_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope="hidden[123]") # regular expression reuse_vars_dict = dict([(var.op.name, var) for var in reuse_vars]) restore_saver = tf.train.Saver(reuse_vars_dict) # to restore layers 1-3 init = tf.global_variables_initializer() saver = tf.train.Saver() with tf.Session() as sess: init.run() restore_saver.restore(sess, "./my_model_final.ckpt") for epoch in range(n_epochs): for iteration in range(mnist.train.num_examples // batch_size): X_batch, y_batch = mnist.train.next_batch(batch_size) sess.run(training_op, feed_dict={X: X_batch, y: y_batch}) accuracy_val = accuracy.eval(feed_dict={X: mnist.test.images, y: mnist.test.labels}) print(epoch, "Test accuracy:", accuracy_val) save_path = saver.save(sess, "./my_new_model_final.ckpt")
b. Catching the Frozen Layers
訓練時直接從lock層以後的層開始訓練,Catching the Frozen Layers
# 以MINIST爲例
n_inputs = 28 * 28 # MNIST n_hidden1 = 300 # reused n_hidden2 = 50 # reused n_hidden3 = 50 # reused n_hidden4 = 20 # new! n_outputs = 10 # new! X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X") y = tf.placeholder(tf.int64, shape=(None), name="y") with tf.name_scope("dnn"): hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, name="hidden1") # reused frozen hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu, name="hidden2") # reused frozen & cached hidden2_stop = tf.stop_gradient(hidden2) hidden3 = tf.layers.dense(hidden2_stop, n_hidden3, activation=tf.nn.relu, name="hidden3") # reused, not frozen hidden4 = tf.layers.dense(hidden3, n_hidden4, activation=tf.nn.relu, name="hidden4") # new! logits = tf.layers.dense(hidden4, n_outputs, name="outputs") # new! with tf.name_scope("loss"): xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits) loss = tf.reduce_mean(xentropy, name="loss") with tf.name_scope("eval"): correct = tf.nn.in_top_k(logits, y, 1) accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name="accuracy") with tf.name_scope("train"): optimizer = tf.train.GradientDescentOptimizer(learning_rate) training_op = optimizer.minimize(loss)
reuse_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope="hidden[123]") # regular expression reuse_vars_dict = dict([(var.op.name, var) for var in reuse_vars]) restore_saver = tf.train.Saver(reuse_vars_dict) # to restore layers 1-3 init = tf.global_variables_initializer() saver = tf.train.Saver()
import numpy as np n_batches = mnist.train.num_examples // batch_size with tf.Session() as sess: init.run() restore_saver.restore(sess, "./my_model_final.ckpt") h2_cache = sess.run(hidden2, feed_dict={X: mnist.train.images}) h2_cache_test = sess.run(hidden2, feed_dict={X: mnist.test.images}) # not shown in the book for epoch in range(n_epochs): shuffled_idx = np.random.permutation(mnist.train.num_examples) hidden2_batches = np.array_split(h2_cache[shuffled_idx], n_batches) y_batches = np.array_split(mnist.train.labels[shuffled_idx], n_batches) for hidden2_batch, y_batch in zip(hidden2_batches, y_batches): sess.run(training_op, feed_dict={hidden2:hidden2_batch, y:y_batch}) accuracy_val = accuracy.eval(feed_dict={hidden2: h2_cache_test, # not shown y: mnist.test.labels}) # not shown print(epoch, "Test accuracy:", accuracy_val) # not shown save_path = saver.save(sess, "./my_new_model_final.ckpt")
六、Unsupervised Pretraining
該方法的提出,讓人們對深度學習網絡的訓練有了一個新的認識,能夠利用不那麼昂貴的未標註數據,訓練數據時沒有標註的數據先作一個Pretraining訓練出一個差很少的網絡,再使用帶label的數據作正式的訓練進行反向傳遞,增進深度模型可用性
也能夠在類似模型中作pretraining
七、Faster Optimizers
在傳統的SGD上提出改進
有Momentum optimization(最先提出,利用慣性衝量),Nesterov Accelerated Gradient,AdaGrad(adaptive gradient每層降低不同),RMSProp,Adam optimization(結合adagrad和momentum,用的最多,是缺省的optimizer)
a. momentum optimization
記住以前算出的gradient方向,做爲慣性加到當前梯度上。至關於下山時,SGD是靜止的之判斷當前最陡的是哪裏,而momentum至關於在跑的過程當中不斷修正方向,顯然更加有效。
b. Nesterov Accelerated Gradient
只計算當前這點的梯度,超前一步,再往前跑一點計算會更準一些。
c. AdaGrad
各個維度計算梯度做爲分母,加到當前梯度上,不一樣維度梯度降低不一樣。以下圖所示,橫軸比縱軸平緩不少,傳統gradient僅僅單純沿法線方向移動,而AdaGrad平緩的θ1走的慢點,陡的θ2走的快點,效果較好。
但也有必定缺陷,s不斷積累,分母愈來愈大,可能致使最後走不動。
d. RMSProp(Adadelta)
只加一部分,加一個衰減係數只選取相關的最近幾步相關係數
e. Adam Optimization
目前用的最多效果最好的方法,結合AdaGrad和Momentum的優勢
# TensorFlow中調用方法 optimizer = tf.train.MomentumOptimizer(learning_rate=learning_rate,momentum=0.9) optimizer = tf.train.MomentumOptimizer(learning_rate=learning_rate,momentum=0.9, use_nesterov=True) optimizer = tf.train.RMSPropOptimizer(learning_rate=learning_rate,momentum=0.9, decay=0.9, epsilon=1e-10) # 能夠看出AdamOptimizer最省心了 optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
八、learning rate scheduling
learning rate的設置也很重要,以下圖所示,太大不會收斂到全局最優,過小收斂效果最差。最理想狀況是都必定狀況縮小learning rate,先大後小
a. Exponential Scheduling
指數級降低學習率
initial_learning_rate = 0.1 decay_steps = 10000 decay_rate = 1/10 global_step = tf.Variable(0, trainable=False) learning_rate = tf.train.exponential_decay(initial_learning_rate, global_step, decay_steps, decay_rate) optimizer = tf.train.MomentumOptimizer(learning_rate, momentum=0.9) training_op = optimizer.minimize(loss, global_step=global_step)
九、Avoiding Overfitting Through Regularization
解決深度模型過擬合問題
a. Early Stopping
訓練集上錯誤率開始上升時中止
b. l1和l2正則化
# construct the neural network base_loss = tf.reduce_mean(xentropy, name="avg_xentropy") reg_losses = tf.reduce_sum(tf.abs(weights1)) + tf.reduce_sum(tf.abs(weights2)) loss = tf.add(base_loss, scale * reg_losses, name="loss") with arg_scope( [fully_connected], weights_regularizer=tf.contrib.layers.l1_regularizer(scale=0.01)): hidden1 = fully_connected(X, n_hidden1, scope="hidden1") hidden2 = fully_connected(hidden1, n_hidden2, scope="hidden2") logits = fully_connected(hidden2, n_outputs, activation_fn=None,scope="out") reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES) loss = tf.add_n([base_loss] + reg_losses, name="loss")
c. dropout
一種新的正則化方法,隨機生成一個機率,大於某個閾值就扔掉,隨機扔掉一些神經元節點,結果代表dropout很能解決過擬合問題。可強迫現有神經元不會集中太多特徵,下降網絡複雜度,魯棒性加強。
加入dropout後,training和test的準確率會很接近,必定程度解決overfit問題
training = tf.placeholder_with_default(False, shape=(), name='training') dropout_rate = 0.5 # == 1 - keep_prob X_drop = tf.layers.dropout(X, dropout_rate, training=training) with tf.name_scope("dnn"): hidden1 = tf.layers.dense(X_drop, n_hidden1, activation=tf.nn.relu, name="hidden1") hidden1_drop = tf.layers.dropout(hidden1, dropout_rate, training=training) hidden2 = tf.layers.dense(hidden1_drop, n_hidden2, activation=tf.nn.relu, name="hidden2") hidden2_drop = tf.layers.dropout(hidden2, dropout_rate, training=training) logits = tf.layers.dense(hidden2_drop, n_outputs, name="outputs")
d. Max-Norm Regularization
能夠把超出threshold的權重截取掉,必定程度上讓網絡更加穩定
def max_norm_regularizer(threshold, axes=1, name="max_norm", collection="max_norm"): def max_norm(weights): clipped = tf.clip_by_norm(weights, clip_norm=threshold, axes=axes) clip_weights = tf.assign(weights, clipped, name=name) tf.add_to_collection(collection, clip_weights) return None # there is no regularization loss term return max_norm max_norm_reg = max_norm_regularizer(threshold=1.0) hidden1 = fully_connected(X, n_hidden1, scope="hidden1", weights_regularizer=max_norm_reg)
e. Date Augmentation
深度學習網絡是一個數據飢渴模型,須要不少的數據。擴大數據集,例如圖片左右鏡像翻轉,隨機截取,傾斜隨機角度,變換敏感度,改變色調等方法,擴大數據量,減小overfit可能性
十、default DNN configuration