TensorFlow 2.0是對1.x版本作了一次大的瘦身,Eager Execution默認開啓,而且使用Keras做爲默認高級API,
這些改進大大下降的TensorFlow使用難度。html
本文主要記錄了一次曲折的使用Keras+TensorFlow2.0的BatchNormalization的踩坑經歷,這個坑差點要把TF2.0的新特性都毀滅殆盡,若是你在學習TF2.0的官方教程,不妨一觀。git
從教程[1]https://www.tensorflow.org/alpha/tutorials/images/transfer_learning?hl=zh-cn(講述如何Transfer Learning)提及:github
IMG_SHAPE = (IMG_SIZE, IMG_SIZE, 3) # Create the base model from the pre-trained model MobileNet V2 base_model = tf.keras.applications.MobileNetV2(input_shape=IMG_SHAPE, include_top=False,weights='imagenet') model = tf.keras.Sequential([ base_model, tf.keras.layers.GlobalAveragePooling2D(), tf.keras.layers.Dense(NUM_CLASSES) ])
簡單的代碼咱們就複用了MobileNetV2的結構建立了一個分類器模型,接着咱們就能夠調用Keras的接口去訓練模型:app
model.compile(optimizer=tf.keras.optimizers.RMSprop(lr=base_learning_rate), loss='sparse_categorical_crossentropy', metrics=['sparse_categorical_accuracy']) model.summary() history = model.fit(train_batches.repeat(), epochs=20, steps_per_epoch = steps_per_epoch, validation_data=validation_batches.repeat(), validation_steps=validation_steps)
輸出的結果看,一塊兒都很完美:函數
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= mobilenetv2_1.00_160 (Model) (None, 5, 5, 1280) 2257984 _________________________________________________________________ global_average_pooling2d (Gl (None, 1280) 0 _________________________________________________________________ dense (Dense) (None, 2) 1281 ================================================================= Total params: 2,259,265 Trainable params: 1,281 Non-trainable params: 2,257,984 _________________________________________________________________ Epoch 11/20 581/581 [==============================] - 134s 231ms/step - loss: 0.4208 - accuracy: 0.9484 - val_loss: 0.1907 - val_accuracy: 0.9812 Epoch 12/20 581/581 [==============================] - 114s 197ms/step - loss: 0.3359 - accuracy: 0.9570 - val_loss: 0.1835 - val_accuracy: 0.9844 Epoch 13/20 581/581 [==============================] - 116s 200ms/step - loss: 0.2930 - accuracy: 0.9650 - val_loss: 0.1505 - val_accuracy: 0.9844 Epoch 14/20 581/581 [==============================] - 114s 196ms/step - loss: 0.2561 - accuracy: 0.9701 - val_loss: 0.1575 - val_accuracy: 0.9859 Epoch 15/20 581/581 [==============================] - 119s 206ms/step - loss: 0.2302 - accuracy: 0.9715 - val_loss: 0.1600 - val_accuracy: 0.9812 Epoch 16/20 581/581 [==============================] - 115s 197ms/step - loss: 0.2134 - accuracy: 0.9747 - val_loss: 0.1407 - val_accuracy: 0.9828 Epoch 17/20 581/581 [==============================] - 115s 197ms/step - loss: 0.1546 - accuracy: 0.9813 - val_loss: 0.0944 - val_accuracy: 0.9828 Epoch 18/20 581/581 [==============================] - 116s 200ms/step - loss: 0.1636 - accuracy: 0.9794 - val_loss: 0.0947 - val_accuracy: 0.9844 Epoch 19/20 581/581 [==============================] - 115s 198ms/step - loss: 0.1356 - accuracy: 0.9823 - val_loss: 0.1169 - val_accuracy: 0.9828 Epoch 20/20 581/581 [==============================] - 116s 199ms/step - loss: 0.1243 - accuracy: 0.9849 - val_loss: 0.1121 - val_accuracy: 0.9875
然而這種寫法仍是不方便Debug,咱們但願能夠精細的控制迭代的過程,並可以看到中間結果,因此咱們訓練的過程改爲了這樣:學習
optimizer = tf.keras.optimizers.RMSprop(lr=base_learning_rate) train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='train_accuracy') @tf.function def train_cls_step(image, label): with tf.GradientTape() as tape: predictions = model(image) loss = tf.keras.losses.SparseCategoricalCrossentropy()(label, predictions) gradients = tape.gradient(loss, model.trainable_variables) optimizer.apply_gradients(zip(gradients, model.trainable_variables)) train_accuracy(label, predictions) for images, labels in train_batches: train_cls_step(images,labels)
從新訓練後,結果依然很完美!測試
可是,這時候咱們想對比一下Finetune和重頭開始訓練的差異,因此把構建模型的代碼改爲了這樣:ui
base_model = tf.keras.applications.MobileNetV2(input_shape=IMG_SHAPE, include_top=False,weights=None)
使得模型的權重隨機生成,這時候訓練結果就開始抽風了,Loss不降低,Accuracy穩定在50%附近遊蕩:spa
Step #10: loss=0.6937199831008911 acc=46.5625% Step #20: loss=0.6932525634765625 acc=47.8125% Step #30: loss=0.699873685836792 acc=49.16666793823242% Step #40: loss=0.6910845041275024 acc=49.6875% Step #50: loss=0.6935917139053345 acc=50.0625% Step #60: loss=0.6965731382369995 acc=49.6875% Step #70: loss=0.6949992179870605 acc=49.19642639160156% Step #80: loss=0.6942993402481079 acc=49.84375% Step #90: loss=0.6933775544166565 acc=49.65277862548828% Step #100: loss=0.6928421258926392 acc=49.5% Step #110: loss=0.6883170008659363 acc=49.54545593261719% Step #120: loss=0.695658802986145 acc=49.453125% Step #130: loss=0.6875559091567993 acc=49.61538314819336% Step #140: loss=0.6851695775985718 acc=49.86606979370117% Step #150: loss=0.6978713274002075 acc=49.875% Step #160: loss=0.7165156602859497 acc=50.0% Step #170: loss=0.6945627331733704 acc=49.797794342041016% Step #180: loss=0.6936900615692139 acc=49.9305534362793% Step #190: loss=0.6938323974609375 acc=49.83552551269531% Step #200: loss=0.7030564546585083 acc=49.828125% Step #210: loss=0.6926192045211792 acc=49.76190185546875% Step #220: loss=0.6932414770126343 acc=49.786930084228516% Step #230: loss=0.6924526691436768 acc=49.82337188720703% Step #240: loss=0.6882281303405762 acc=49.869789123535156% Step #250: loss=0.6877702474594116 acc=49.86249923706055% Step #260: loss=0.6933954954147339 acc=49.77163314819336% Step #270: loss=0.6944763660430908 acc=49.75694274902344% Step #280: loss=0.6945018768310547 acc=49.49776840209961%
咱們將predictions的結果打印出來,發現batch內每一個輸出都是如出一轍的:code
0 = tf.Tensor([0.51352817 0.48647183], shape=(2,), dtype=float32) 1 = tf.Tensor([0.51352817 0.48647183], shape=(2,), dtype=float32) 2 = tf.Tensor([0.51352817 0.48647183], shape=(2,), dtype=float32) 3 = tf.Tensor([0.51352817 0.48647183], shape=(2,), dtype=float32) 4 = tf.Tensor([0.51352817 0.48647183], shape=(2,), dtype=float32) 5 = tf.Tensor([0.51352817 0.48647183], shape=(2,), dtype=float32) 6 = tf.Tensor([0.51352817 0.48647183], shape=(2,), dtype=float32) 7 = tf.Tensor([0.51352817 0.48647183], shape=(2,), dtype=float32) 8 = tf.Tensor([0.51352817 0.48647183], shape=(2,), dtype=float32) 9 = tf.Tensor([0.51352817 0.48647183], shape=(2,), dtype=float32)
只是修改了初始權重,爲什麼會產生這樣的結果?
是否是訓練不夠充分,或者learning rate設置的不合適?
通過幾輪調整,發現不管訓練多久,learning rate變大變小,都沒法改變這種結果
既然是權重的問題,是否是權重隨機初始化的有問題,把初始權重拿出來統計了一下,一切正常
這種問題根據以前的經驗,在導出Inference模型的時候BatchNormalization沒有處理好會出現這種一個batch內全部結果都同樣的問題。可是如何解釋訓練的時候爲何會出現這個問題?並且爲何Finetue不會出現問題呢?只是改了權重的初始值而已呀
按照這個方向去Google的一番,發現了Keras的BatchNormalization確實有不少issue,其中一個問題是在保存模型的是BatchNormalzation的moving mean和moving variance不會被保存[6]https://github.com/tensorflow/tensorflow/issues/16455,而另一個issue提到問題就和咱們問題有關係的了:
[2] https://github.com/tensorflow/tensorflow/issues/19643
[3] https://github.com/tensorflow/tensorflow/issues/23873
最後,這位做者找到了緣由,而且總結在了這裏:
[4] https://pgaleone.eu/tensorflow/keras/2019/01/19/keras-not-yet-interface-to-tensorflow/
根據這個提示,咱們作了以下嘗試:
改用model.fit的寫法進行訓練,在最初的幾個epoch裏面,咱們發現好的一點的是training accuracy已經開始緩慢提高了,可是validation accuracy存在原來的問題。並且經過model.predict_on_batch()拿到中間結果,發現依然仍是batch內輸出都同樣。
Epoch 1/20 581/581 [==============================] - 162s 279ms/step - loss: 0.6768 - sparse_categorical_accuracy: 0.6224 - val_loss: 0.6981 - val_sparse_categorical_accuracy: 0.4984 Epoch 2/20 581/581 [==============================] - 133s 228ms/step - loss: 0.4847 - sparse_categorical_accuracy: 0.7684 - val_loss: 0.6931 - val_sparse_categorical_accuracy: 0.5016 Epoch 3/20 581/581 [==============================] - 130s 223ms/step - loss: 0.3905 - sparse_categorical_accuracy: 0.8250 - val_loss: 0.6996 - val_sparse_categorical_accuracy: 0.4984 Epoch 4/20 581/581 [==============================] - 131s 225ms/step - loss: 0.3113 - sparse_categorical_accuracy: 0.8660 - val_loss: 0.6935 - val_sparse_categorical_accuracy: 0.5016
可是,隨着訓練的深刻,結果出現了逆轉,開始變得正常了(tf.function的寫法是不管怎麼訓練都不會變化,幸虧沒有放棄治療)(追加:其實這裏仍是有問題的,繼續看後面,當時就以爲怪怪的,不該該收斂這麼慢)
Epoch 18/20 581/581 [==============================] - 131s 226ms/step - loss: 0.0731 - sparse_categorical_accuracy: 0.9725 - val_loss: 1.4896 - val_sparse_categorical_accuracy: 0.8703 Epoch 19/20 581/581 [==============================] - 130s 225ms/step - loss: 0.0664 - sparse_categorical_accuracy: 0.9748 - val_loss: 0.6890 - val_sparse_categorical_accuracy: 0.9016 Epoch 20/20 581/581 [==============================] - 126s 217ms/step - loss: 0.0631 - sparse_categorical_accuracy: 0.9768 - val_loss: 1.0290 - val_sparse_categorical_accuracy: 0.9031
通多model.predict_on_batch()拿到的結果也和這個Accuracy也是一致的
經過上一個實驗,咱們驗證了確實若是隻經過Keras的API去訓練,是正常。更深層的緣由是什麼呢?是否是BatchNomalization沒有update moving mean和moving variance致使的呢?答案是Yes
咱們分別在兩中訓練方法先後,打印 moving mean和moving variance的值:
def get_bn_vars(collection): moving_mean, moving_variance = None, None for var in collection: name = var.name.lower() if "variance" in name: moving_variance = var if "mean" in name: moving_mean = var if moving_mean is not None and moving_variance is not None: return moving_mean, moving_variance raise ValueError("Unable to find moving mean and variance") mean, variance = get_bn_vars(model.variables) print(mean) print(variance)
咱們發現,確實若是使用model.fit()進行訓練,mean和variance是在update的(雖然更新的速率看着有些奇怪),可是對於tf.function那種寫法這兩個值就沒有被update
那這裏咱們也能夠解釋爲何Finetune不會出現問題了,由於imagenet訓練的mean, variance已是一個比較好的值了,即便不更新也能夠正常使用
是否是改爲[4]裏面說的方法構建動態的Input_Shape的模型就OK了呢?
class MyModel(Model): def __init__(self): super(MyModel, self).__init__() self.conv1 = Conv2D(32, 3, activation='relu') self.batch_norm1=BatchNormalization() self.flatten = Flatten() self.d1 = Dense(128, activation='relu') self.d2 = Dense(10, activation='softmax') def call(self, x): x = self.conv1(x) x = self.batch_norm1(x) x = self.flatten(x) x = self.d1(x) return self.d2(x) model = MyModel() #model.build((None,28,28,1)) model.summary() @tf.functiondef train_step(image, label): with tf.GradientTape() as tape: predictions = model(image) loss = loss_object(label, predictions) gradients = tape.gradient(loss, model.trainable_variables) optimizer.apply_gradients(zip(gradients, model.trainable_variables)) train_loss(loss) train_accuracy(label, predictions)
模型以下:
Model: "my_model" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= conv2d (Conv2D) multiple 320 _________________________________________________________________ batch_normalization_v2 (Batc multiple 128 _________________________________________________________________ flatten (Flatten) multiple 0 _________________________________________________________________ dense (Dense) multiple 2769024 _________________________________________________________________ dense_1 (Dense) multiple 1290 ================================================================= Total params: 2,770,762 Trainable params: 2,770,698 Non-trainable params: 64
從Output Shape看,構建模型沒問題
跑了一遍MINST,結果也很不錯!
以防萬一,咱們一樣測試了一下mean和variance是否被更新,然而結果出乎意料,並無!
也就是說[4]裏面說的方案在咱們這裏並不可行
既然咱們定位問題是在BatchNormalization這裏,因此就想到BatchNormalization的training和testing時候行爲是不一致的,在testing的時候moving mean和variance是不須要update的,那麼會不會是tf.function的這種寫法並不會自動更改這個狀態呢?
查看源碼,發現BatchNormalization的call()存在一個training參數,並且默認是False
Call arguments: inputs: Input tensor (of any rank). training: Python boolean indicating whether the layer should behave in training mode or in inference mode. - `training=True`: The layer will normalize its inputs using the mean and variance of the current batch of inputs. - `training=False`: The layer will normalize its inputs using the mean and variance of its moving statistics, learned during training.
因此,作了以下改進:
class MyModel(Model): def __init__(self): super(MyModel, self).__init__() self.conv1 = Conv2D(32, 3, activation='relu') self.batch_norm1=BatchNormalization() self.flatten = Flatten() self.d1 = Dense(128, activation='relu') self.d2 = Dense(10, activation='softmax') def call(self, x,training=True): x = self.conv1(x) x = self.batch_norm1(x,training=training) x = self.flatten(x) x = self.d1(x) return self.d2(x) model = MyModel() #model.build((None,28,28,1)) model.summary() @tf.functiondef train_step(image, label): with tf.GradientTape() as tape: predictions = model(image,training=True) loss = loss_object(label, predictions) gradients = tape.gradient(loss, model.trainable_variables) optimizer.apply_gradients(zip(gradients, model.trainable_variables)) train_loss(loss) train_accuracy(label, predictions) @tf.functiondef test_step(image, label): predictions = model(image,training=False) t_loss = loss_object(label, predictions) test_loss(t_loss) test_accuracy(label, predictions)
結果顯示,moving mean和variance開始更新啦,測試Accuracy也是符合預期
因此,咱們能夠肯定問題的根源在於須要指定BatchNormalization是在training仍是在testing!
3.4中方法雖然解決了咱們的問題,可是它是使用構建Model的subclass的方式,而咱們以前的MobileNetV2是基於更加靈活Keras Functional API構建的,因爲沒法控制call()函數的定義,沒有辦法靈活切換training和testing的狀態,另外用Sequential的方式構建時也是同樣。
[5]https://blog.keras.io/keras-as-a-simplified-interface-to-tensorflow-tutorial.html
[7]https://github.com/keras-team/keras/issues/7085
[8]https://github.com/keras-team/keras/issues/6752
從5[8]中,我瞭解到兩個狀況,
因此我首先嚐試:
tf.keras.backend.set_learning_phase(True)
結果,MobileNetV2構建的模型也能夠正常工做了。
並且收斂的速度彷佛比model.fit()還快了不少,結合以前model.fit()收斂慢的困惑,這裏又增長的一個實驗,在model.fit()的版本里面也加上這句話,發現一樣收斂速度也變快了!1個epoch就能獲得不錯的結果了!
所以,這裏又產生了一個問題model.fit()到底有沒有設learning_phase狀態?若是沒有是怎麼作moving mean和variance的update的?
第二個方法,因爲教程中講述的是如何在1.x的版本構建,而在eager execution模式下,彷佛沒有辦法去run這些Assign Operation。僅作參考吧
update_ops = [] for assign_op in model.updates: update_ops.append(assign_op)) #可是不知道拿到這些update_ops在eager execution模式下怎麼處理呢?
總結一下,咱們從[4]找到了解決問題的啓發點,可是最終證實[4]裏面的問題和解決方法用到咱們這裏並不能真正解決問題,問題的關鍵仍是在於Keras+TensorFlow2.0裏面咱們如何處理在training和testing狀態下行爲不一致的Layer;以及對於model.fit()和tf.funtion這兩種訓練方法的區別,最終來看model.fit()裏面彷佛包含不少詭異的行爲。
最終的使用建議以下:
最後,爲何TF 2.0的教程裏面沒有說起這些?默認你已經精通Keras了嗎?[捂臉哭]
原文連接 本文爲雲棲社區原創內容,未經容許不得轉載。