機器學習之Artificial Neural Networks

人類經過模仿天然界中的生物,已經發明瞭不少東西,好比飛機,就是模仿鳥翼,但最終,這些東西會和原來的東西有些許差別,artificial neural networks (ANNs)就是模仿動物大腦的神經網絡。node

ANNs是Deep Learning的基本組成部分,它有不少用處:python

ANNs are at the very core of Deep Learning. They are versatile, powerful, and scala‐ ble, making them ideal to tackle large and highly complex Machine Learning tasks, such as classifying billions of images (e.g., Google Images), powering speech recogni‐ tion services (e.g., Apple’s Siri), recommending the best videos to watch to hundreds of millions of users every day (e.g., YouTube), or learning to beat the world champion at the game of Go by examining millions of past games and then playing against itself (DeepMind’s AlphaGo).git

From Biological to Artificial Neurons

ANNs已經有很悠久的歷史了,咱們不談歷史,到目前爲止,ANNs又從新煥發青春,主要有下邊幾個理由:算法

  • 如今有大量高質量的數據來訓練神經網絡,而且ANNs更適宜大且複雜的問題
  • 物理計算能力的巨大提高,爲訓練大型神經網絡提供了基礎,好比強大的GPU
  • 訓練算法有了必定的改進,性能獲得很大提高
  • ANNs的理論侷限性已經在實踐中被證實是良性的,經過大量的實驗,ANNs都體現了良好的效果,最大限度的接近全局最優解
  • 資金流向ANNs,這一樣極大刺激了它的發展

Biological Neurons

ANNs主要是模仿生物神經元,咱們先簡單瞭解一下生物神經元的組成:api

神經元由細胞體,樹突,軸突和神經末梢組成:網絡

  • 細胞體的做用是處理樹突傳過來的各類信息,在ANNs中它就比如激活函數,用於決定當前神經元的輸出
  • 樹突用於鏈接其餘的神經元,在ANNs中它就像每一個神經元和輸入的鏈接
  • 軸突是神經元的輸出

生物系統的神經元網絡:架構

深度神經網絡:app

Logical Computations with Neurons

若是咱們把神經元的輸入和輸出都設定爲二進制(開或者關),咱們可以利用ANNs實現一些邏輯運算,下邊就是一個示例:dom

  • 最左邊的是一個identity函數,A的狀態直接傳遞給C
  • 第一個實現了邏輯AND,只有當A和B都激活的時候,C才被激活
  • 第三個實現了邏輯OR, A或B任何一個被激活,C都會激活
  • 第四個,只有A激活且B不激活的狀況下,才能激活C

The Perceptron

Perceptron是ANNs架構的最簡單的一種,它的神經元成爲linear threshold unit (LTU),咱們後續都使用LTU這個名稱,它的輸入和輸出都是數值類型,先看一張示意圖:ide

每一個輸入值都經過一個w(權值)和神經元連接,這時候神經元就是上邊提到的LTU,LTU經過下邊的公式計算輸入的值:
\[ z = w_1x_1 + w_2x_2 + \cdots + w_nx_n = \vec{w}^T \cdot \vec{x} \]
求和結束後,使用一個step function來再次計算要輸出的值,公式以下:
\[ h_w(\vec{x}) = step(z) = step(\vec{w}^T \cdot \vec{x}) \]
通常通用的step function有兩個:
\[ heaviside(z) = \begin{cases} 0 & \text{if $z < 0$} \\ 1 & \text{if $z \geq 0$} \end{cases} \]

\[ sgn(z) = \begin{cases} -1 & \text{if $z < 0$} \\ 0 & \text{if $z = 0$} \\ 1 & \text{if $z > 0$} \end{cases} \]

看上圖可知,一個單獨的LTU能夠作線性二進制分類,原理是,它對輸入數據求和後,在根據一個threshold作判斷,而後再輸出類別。

Perceptron能夠說是由單層的LTUs組成,咱們已經知道LTUs是一個神經單元,若是把多個LTUs放到一層,做爲輸出層,再增長一個輸入層,就構成了感知機。在輸入層還須要增長一個偏置。以下圖所示:

那麼感知機是如何訓練的呢?由上圖可知,訓練感知機就是獲取這些連接中的權值w,天然地,若是某個LTU的預測值和實際值差異越大,w更新的步子就越大,所以,得出下邊的公式:
\[ w_{i,j}^{(next \ step)} = w_{i,j} + \eta(\hat{y_j} - y_j)x_i \]

  • \(w_{i, j}\)是第i個輸入神經元和第j個輸出神經元的權值
  • \(x_i\)是當前訓練實例的第i個輸入值
  • \(\hat{y_j}\)是當前訓練實例的第j個神經元的輸出值
  • \(y_j\)是當前訓練實例的第j個神經元的實際值
  • \(\eta\)是學習率

因爲每個輸出神經元的決策邊界是線性的,全部感知機不能處理複雜的學習模式,好比邏輯迴歸分類。若是訓練集是線性可分的,它是能夠實現的。

咱們用代碼實現一個iris分類的例子:

import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import Perceptron

iris = load_iris()
X = iris.data[:, (2, 3)]  # petal length, petal width
y = (iris.target == 0).astype(np.int)

per_clf = Perceptron(max_iter=100, tol=-np.infty, random_state=42)
per_clf.fit(X, y)

y_pred = per_clf.predict([[2, 0.5]])

"""
array([1])
"""

其實,感知機算法和隨機梯度降低算法很類似,實際上Scikit-Learn的Perceptron類等同於使用SGDClassifier類,設置loss="perceptron" learning_rate="constant" eta=1 penalty=None

相對於邏輯迴歸分類,感知機的缺點是它並不輸出機率。它的預測只能根據一個hard threshold。這也是爲何選擇邏輯迴歸分類的緣由。

若是把感知機拓展到多層結構,就可以實現一些其餘的功能,好比,實現XOR邏輯:

def heaviside(z):
    return (z >= 0).astype(z.dtype)

def mlp_xor(x1, x2, activation=heaviside):
    return activation(-activation(x1 + x2 - 1.5) + activation(x1 + x2 - 0.5) - 0.5)

上邊的代碼是按照下圖設計的:

上邊的代碼中,使用的是heaviside激活函數,也可使用其餘激活函數,好比sigmoid,他們的決策邊界對好比下:

Multi-Layer Perceptron and Backpropagation

一個MLP由一個輸入層,一層或多層LTUs(隱藏層)和一個輸出層組成。除了輸出層以外,其餘的層都包含一個偏置。當一個ANN包含2個隱藏層時,成爲deep neural network(DNN)。

下圖是一個隱藏層的MLP:

對於MLP,在一開始,比較困難的是如何訓練它,目前採用的算法是backpropagation,逆向傳播算法。在介紹這個以前,咱們先簡單介紹一下TensorFlow的reverse-mode autodiff。

在TensorFlow中,首先正向的計算出各個node的值,而後利用鏈式法則,逆向的對每一個node進行求導,用一張圖就能很輕易的表示這個過程:

這其中有一個細節,以n5來舉例,
\[ \partial{n_7}/\partial{n_5} \]
這個是如何計算的呢? 因爲\(n_7 = n_5 + n_6\),因此在計算\(\partial{n_7}/\partial{n_5}\)的偏導數的時候,獲得的偏導數爲1,下邊的原理都是這樣。==整個偏導數的計算過程須要對圖作兩次遍歷,首先正向求出每一個node的值,而後反向遍歷求各個node的偏導數。==

咱們在訓練該算法中,最核心的問題是修正W,那麼上邊計算出來的偏導數就用上了,看下邊這個圖,它顯示了逆向傳播算法的基本原理和過程:

在上圖中,首先正向的先計算出最後的偏差\(\delta\),而後利用node間連接的w計算出前一層各個node的\(\delta\)。這樣咱們就可以得到每一個node的偏差\(\delta\)

從上圖能夠很簡單的看出,使用該node的梯度和偏差,就能夠計算出修正後的w。這其中,用到的激活函數,最經常使用的有下邊三個:

  • logistic function \(\sigma(z)=1/(1 + e^{-z})\)
  • hyperbolic tangent function \(tanh(z) = 2\sigma(2z) - 1\)
  • ReLU function \(Relu(z) = max(0, z)\)

經過圖表對比下他們的函數圖形和導數圖形:

MLP常常被用來分類,它的輸出層能夠分別對應多個類別,所以咱們須要對輸出層的神經元的激活函數作一些特殊的處理,若是是分類任務,咱們能夠去掉輸出層的激活函數,而後爲它增長額外的softmax function。用來獲取一些自定義的策略,好比控制threshold或者機率等等。

Training an MLP with TensorFlow’s High-Level API

若是使用TensorFlow的高層api來實現DNN,很是簡單,咱們看下代碼實現:

獲取數據:

import tensorflow as tf


(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
X_train = X_train.astype(np.float32).reshape(-1, 28 * 28) / 255.0
X_test = X_test.astype(np.float32).reshape(-1, 28 * 28) / 255.0
y_train = y_train.astype(np.int32)
y_test = y_test.astype(np.int32)
X_valid, X_train = X_train[:5000], X_train[5000:]
y_valid, y_train = y_train[:5000], y_train[5000:]

核心代碼,訓練模型:

feature_cols = [tf.feature_column.numeric_column("X", shape=[28 * 28])]
dnn_clf = tf.estimator.DNNClassifier(hidden_units=[300, 100], n_classes=10, feature_columns=feature_cols)

input_fn = tf.estimator.inputs.numpy_input_fn(x={"X": X_train}, y=y_train, num_epochs=40, batch_size=50, shuffle=True)
dnn_clf.train(input_fn=input_fn)

TensorFlow提供了評估函數:

valid_input_fn = tf.estimator.inputs.numpy_input_fn(x={"X": X_valid}, y=y_valid, shuffle=False)
eval_results = dnn_clf.evaluate(input_fn=valid_input_fn)

"""
{'accuracy': 0.982,
 'average_loss': 0.09565534,
 'loss': 11.956917,
 'global_step': 44000}
"""

評估測試集:

test_input_fn = tf.estimator.inputs.numpy_input_fn(x={"X": X_test}, y=y_test, shuffle=False)
y_pred_iter = dnn_clf.predict(input_fn=test_input_fn)
y_pred = list(y_pred_iter)
y_pred[0]

"""
{'logits': array([ -7.4078193,   2.8550382,   1.8491653,   6.945245 ,  -5.996856 ,
         -0.6053193, -10.372234 ,  22.27766  ,  -4.490141 ,   2.99006  ],
       dtype=float32),
 'probabilities': array([1.2816213e-13, 3.6716565e-09, 1.3428177e-09, 2.1938979e-07,
        5.2545213e-13, 1.1535802e-10, 6.6119684e-15, 9.9999976e-01,
        2.3707813e-12, 4.2024433e-09], dtype=float32),
 'class_ids': array([7]),
 'classes': array([b'7'], dtype=object),
 'all_class_ids': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32),
 'all_classes': array([b'0', b'1', b'2', b'3', b'4', b'5', b'6', b'7', b'8', b'9'],
       dtype=object)}
"""

從上邊最後一個代碼塊能夠看出,它輸出了每一個類別的機率值。

Training a DNN Using Plain TensorFlow

若是相對神經網絡有更靈活的控制,咱們就須要使用TensorFlow的底層api。

Construction Phase

首先,咱們定義一些屬性,包括輸入層的維度,各個隱藏層包含的神經元個數,輸出層的神經元個數,咱們用X和y表示輸入的數據目標值:

n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10

reset_graph()

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")

接下來,咱們須要寫一個函數來描述神經網絡的某一層,多是隱藏層,也多是輸出層,他們大體上是類似的,區別只是是否包含激活函數,函數以下:

def neuron_layer(X, n_neurons, name, activation=None):
    with tf.name_scope(name):
        n_inputs = int(X.get_shape()[1])
        stddev = 2 / np.sqrt(n_inputs)
        init = tf.truncated_normal((n_inputs, n_neurons), stddev=stddev)
        W = tf.Variable(init, name="kernel")
        b = tf.Variable(tf.zeros([n_neurons]), name="bias")
        Z = tf.matmul(X, W) + b
        if activation is not None:
            return activation(Z)
        else:
            return Z

咱們對上邊函數作一個詳細的說明:

  • 經過name給每一層加一個名稱域,這麼作的好處是在用TensorBoard看圖的結構時,更加清晰
  • n_inputs用戶計算該層輸入數據的特徵的個數
  • 第三部主要用於計算該層與上一層的權值矩陣,(n_inputs, n_neurons)保存着上一層中每一個神經元與該層連接的權值,權值經過隨機的方式產生,符合高斯正態分佈
  • 爲該層的每一個神經元添加一個偏置
  • 計算該層的輸出值,這裏是一個輸出矩陣,每一行保存着該層的神經元的值
  • 若是指定了激活函數,則調用激活函數計算結果,不然直接輸出結果

基於上邊的函數,定義神經網絡模型:

with tf.name_scope("dnn"):
    hidden1 = neuron_layer(X, n_hidden1, name="hidden1", activation=tf.nn.relu)
    hidden2 = neuron_layer(hidden1, n_hidden2, name="hidden2", activation=tf.nn.relu)
    logits = neuron_layer(hidden2, n_outputs, name="outputs")

接下來咱們須要定義代價函數,咱們使用交叉熵來計算偏差:

with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")

定義訓練模型:

learning_rate = 0.01

with tf.name_scope("train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

定義評估模塊:

with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

初始化變量:

init = tf.global_variables_initializer()
saver = tf.train.Saver()

Execution Phase

咱們再改算法中須要使用小批次梯度降低算法,所以咱們先定義一個隨機產生批次函數:

n_epochs = 40
batch_size = 50

def shuffle_batch(X, y, batch_size):
    rnd_idx = np.random.permutation(len(X))
    n_batches = len(X) // batch_size
    for batch_idx in np.array_split(rnd_idx, n_batches):
        X_batch, y_batch = X[batch_idx], y[batch_idx]
        yield X_batch, y_batch

執行代碼:

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        acc_batch = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
        acc_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
        print(epoch, "Batch accuracy:", acc_batch, "Val accuracy:", acc_val)

    save_path = saver.save(sess, "./my_model_final.ckpt")

訓練過程輸出以下:

0 Batch accuracy: 0.9 Val accuracy: 0.9146
1 Batch accuracy: 0.92 Val accuracy: 0.936
2 Batch accuracy: 0.96 Val accuracy: 0.945
...
37 Batch accuracy: 1.0 Val accuracy: 0.9776
38 Batch accuracy: 1.0 Val accuracy: 0.9792
39 Batch accuracy: 1.0 Val accuracy: 0.9776

Using the Neural Network

with tf.Session() as sess:
    saver.restore(sess, "./my_model_final.ckpt") # or better, use save_path
    X_new_scaled = X_test[:20]
    Z = logits.eval(feed_dict={X: X_new_scaled})
    y_pred = np.argmax(Z, axis=1)

print("Predicted classes:", y_pred)
print("Actual classes:   ", y_test[:20])

"""
Predicted classes: [7 2 1 0 4 1 4 9 5 9 0 6 9 0 1 5 9 7 3 4]
Actual classes:    [7 2 1 0 4 1 4 9 5 9 0 6 9 0 1 5 9 7 3 4]
"""

上邊代碼中的neuron_layer是咱們自定義的建立神經網絡層的函數,在開發中,也可使用TensorFlow提供的dense()函數來替代,效果同樣, 上邊代碼只改動不多的一部分:

with tf.name_scope("dnn"):
    hidden1 = tf.layers.dense(X, n_hidden1, name="hidden1", activation=tf.nn.relu)
    hidden2 = tf.layers.dense(hidden1, n_hidden2, name="hidden2", activation=tf.nn.relu)
    logits = tf.layers.dense(hidden2, n_outputs, name="outputs")

Fine-Tuning Neural Network Hyperparameters

神經網絡的靈活性一樣也是它最主要的缺點,由於有太多的參數能夠調整,比方說,網絡拓撲,層數,每一層神經元的個數,每一層的激活函數,權值初始化等等不少參數,那麼如何來獲取最優的參數呢?

固然,咱們可使用grid search方法,但這種方法須要花費大量的時間,而且只能計算出一部分參數。用randomized search方法是一個很不錯的方法。另一個工具是Oscar,它提供更加負責的算法來快速計算最優參數。

Number of Hidden Layers

對於大多數問題,在最初,只須要定義一個隱藏層就能得到不錯的效果。固然若是問題比較複雜,可能該層須要很是多的神經元,這種狀況,把神經元個數分散到多層,會取得更多好的性能。

==神經網絡之因此是多層架構,緣由是這種多層架構有不少好處,首先現實中的數據,每每是多層架構==

咱們以識別手寫數字爲例,

在神經網絡的第一個隱藏層,經過映射,能夠把數字隱身成很簡單的圖形結構,好比說 一橫,一豎等等,在第二個隱藏層,用上一層的簡單結構組和成更復雜的圖形,好比一個圈,一個更長的豎槓,越日後邊的層次,可以描述的圖形越複雜。

方向 ,線段 -> 矩形,圓 -> 更復雜的圖形

==這種分層架構讓神經網絡可以輕易適配到新的數據集上,能夠把某個神經網絡做爲另外一個神經網絡的隱藏層,這大大節省了訓練時間。==

總之,對於大多數的問題,只須要一到兩層的隱藏層就可以得到滿意的效果,對於更復雜的問題,能夠逐漸增長隱藏層的數量,直到訓練集出現過擬爲止。那種更復雜的問題,則須要更多的隱藏層和訓練數據。

Number of Neurons per Hidden Layer

輸入層和輸出層的個數須要根據實際問題來定義,至於隱藏層中神經元的個數,通常來講,能夠遵循漏斗模式,也就是說神經元的數目隨着層次的遞增原來越少,總體呈現漏斗的樣式。緣由就是咱們上邊說的,越日後,描述的結構就越複雜,相應的數量也就越少。

可是,目前通常採起的經常使用作法是,爲每一個隱藏層定義某個相同的數量,而後再計算最優參數時,把該個數做爲一個可優化的參數, 可以大大簡化計算過程。

一個最簡單的方法是,在選擇模型時,使用的層數和每層的個數要超過咱們須要的個數,而後再使用early stopping來中止算法。

Activation Functions

激活函數涉及了梯度是否飽和的問題,一般狀況下,使用ReLU就能夠了。對於分類任務,輸出層通常使用softmax激活函數,對於邏輯任務,則輸出層不適用激活函數。

Exercises

1. Draw an ANN using the original artificial neurons (like the ones in Figure 10-3) that computes AB (where ⊕ represents the XOR operation). Hint: AB = (A∧ ¬ B) ∨ (¬ AB).

2. Why is it generally preferable to use a Logistic Regression classifier rather than a classical Perceptron (i.e., a single layer of linear threshold units trained using the Perceptron training algorithm)? How can you tweak a Perceptron to make it equivalent to a Logistic Regression classifier?

A classical Perceptron will converge only if the dataset is linearly separable, and it won’t be able to estimate class probabilities. In contrast, a Logistic Regression classifier will converge to a good solution even if the dataset is not linearly sepa‐ rable, and it will output class probabilities. If you change the Perceptron’s activa‐ tion function to the logistic activation function (or the softmax activation function if there are multiple neurons), and if you train it using Gradient Descent (or some other optimization algorithm minimizing the cost function, typically cross entropy), then it becomes equivalent to a Logistic Regression classifier.

3. Why was the logistic activation function a key ingredient in training the first MLPs?

The logistic activation function was a key ingredient in training the first MLPs because its derivative is always nonzero, so Gradient Descent can always roll down the slope. When the activation function is a step function, Gradient Descent cannot move, as there is no slope at all.

The step function, the logistic function, the hyperbolic tangent, the rectified lin‐ ear unit (see Figure 10-8). See Chapter 11 for other examples, such as ELU and variants of the ReLU.

5. Suppose you have an MLP composed of one input layer with 10 passthrough neurons, followed by one hidden layer with 50 artificial neurons, and finally one output layer with 3 artificial neurons. All artificial neurons use the ReLU activa‐ tion function.

  • What is the shape of the input matrix X?

  • What about the shape of the hidden layer’s weight vector \(W_h\), and the shape of

    its bias vector \(b_h\)?

  • What is the shape of the output layer’s weight vector \(W_o\), and its bias vector \(b_o\)?

  • What is the shape of the network’s output matrix Y?

  • Write the equation that computes the network’s output matrix Y as a function of X, \(W_h\), \(b_h\), \(W_o\) and \(b_o\).

6. How many neurons do you need in the output layer if you want to classify email into spam or ham? What activation function should you use in the output layer? If instead you want to tackle MNIST, how many neurons do you need in the out‐ put layer, using what activation function? Answer the same questions for getting your network to predict housing prices as in Chapter 2.

To classify email into spam or ham, you just need one neuron in the output layer of a neural network—for example, indicating the probability that the email is spam. You would typically use the logistic activation function in the output layer when estimating a probability. If instead you want to tackle MNIST, you need 10 neurons in the output layer, and you must replace the logistic function with the softmax activation function, which can handle multiple classes, outputting one probability per class. Now, if you want your neural network to predict housing prices like in Chapter 2, then you need one output neuron, using no activation function at all in the output layer.4

7. What is backpropagation and how does it work? What is the difference between backpropagation and reverse-mode autodiff?

Backpropagation is a technique used to train artificial neural networks. It first computes the gradients of the cost function with regards to every model parame‐ ter (all the weights and biases), and then it performs a Gradient Descent step using these gradients. This backpropagation step is typically performed thou‐ sands or millions of times, using many training batches, until the model parame‐ ters converge to values that (hopefully) minimize the cost function. To compute the gradients, backpropagation uses reverse-mode autodiff (although it wasn’t called that when backpropagation was invented, and it has been reinvented sev‐ eral times). Reverse-mode autodiff performs a forward pass through a computa‐ tion graph, computing every node’s value for the current training batch, and then it performs a reverse pass, computing all the gradients at once (see Appendix Dfor more details). So what’s the difference? Well, backpropagation refers to the whole process of training an artificial neural network using multiple backpropa‐ gation steps, each of which computes gradients and uses them to perform a Gra‐ dient Descent step. In contrast, reverse-mode autodiff is a simply a technique to compute gradients efficiently, and it happens to be used by backpropagation.

8. Can you list all the hyperparameters you can tweak in an MLP? If the MLP over‐ fits the training data, how could you tweak these hyperparameters to try to solve the problem?

Here is a list of all the hyperparameters you can tweak in a basic MLP: the num‐ ber of hidden layers, the number of neurons in each hidden layer, and the activa‐ tion function used in each hidden layer and in the output layer.5 In general, the ReLU activation function (or one of its variants; see Chapter 11) is a good default for the hidden layers. For the output layer, in general you will want the logistic activation function for binary classification, the softmax activation function for multiclass classification, or no activation function for regression.

If the MLP overfits the training data, you can try reducing the number of hidden layers and reducing the number of neurons per hidden layer.

9. Train a deep MLP on the MNIST dataset and see if you can get over 98% preci‐ sion. Just like in the last exercise of Chapter 9, try adding all the bells and whistles (i.e., save checkpoints, restore the last checkpoint in case of an interruption, add summaries, plot learning curves using TensorBoard, and so on).

這一題須要注意的有兩點,一個是如何保存模型的斷點,另外一個是如何early stopping,代碼以下:

from datetime import datetime


reset_graph()

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")
with tf.name_scope("dnn"):
    hidden1 = tf.layers.dense(X, n_hidden1, name="hidden1", activation=tf.nn.relu)
    hidden2 = tf.layers.dense(hidden1, n_hidden2, name="hidden2", activation=tf.nn.relu)
    logits = tf.layers.dense(hidden2, n_outputs, name="outputs")

with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")
    loss_summary = tf.summary.scalar('log_loss', loss)
    
learning_rate = 0.01

with tf.name_scope("train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)
    
with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
    accuracy_summary = tf.summary.scalar('accuracy', accuracy)
    
init = tf.global_variables_initializer()
saver = tf.train.Saver()

def log_dir(prefix=""):
    now = datetime.utcnow().strftime("%Y%m%d%H%M%S")
    root_logdir = "tf_logs"
    if prefix:
        prefix += "_"
    name = prefix + "run_" + now
    return "{}/{}/".format(root_logdir, name)

logdir = log_dir("mnist_dnn")

file_writer = tf.summary.FileWriter(logdir, tf.get_default_graph())

m, n = X_train.shape
n_epochs = 10001
batch_size = 50
n_batches = int(np.ceil(m / batch_size))

checkpoint_path = "/tmp/my_deep_mnist_model.ckpt"
checkpoint_epoch_path = checkpoint_path + ".epoch"
fimal_model_path = './my_deep_mnist_model'

best_loss = np.infty
epochs_without_progress = 0
max_epochs_without_progress = 50

def shuffle_batch(X, y, batch_size):
    rnd_idx = np.random.permutation(len(X))
    n_batches = len(X) // batch_size
    for batch_idx in np.array_split(rnd_idx, n_batches):
        X_batch, y_batch = X[batch_idx], y[batch_idx]
        yield X_batch, y_batch
        
with tf.Session() as sess:
    if os.path.isfile(checkpoint_epoch_path):
        # if the checkpoint file exists, restore the model and load the epoch number
        with open(checkpoint_epoch_path, "rb") as f:
            start_epoch = int(f.read())
        print("Training was interrupter. Continuing at epoch", start_epoch)
        saver.restore(sess, checkpoint_path)
    else:
        start_epoch = 0
        sess.run(init)
        
    for epoch in range(start_epoch, n_epochs):
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        accuracy_val, loss_val, accuracy_summary_str, loss_summary_str = sess.run([accuracy, loss, accuracy_summary, loss_summary], 
                                                                                  feed_dict={X: X_valid, y: y_valid})
        file_writer.add_summary(accuracy_summary_str, epoch)
        file_writer.add_summary(loss_summary_str, epoch)
        if epoch % 5 == 0:
            print("Epoch:", epoch, "\tValidation accuracy: {:.3f}%".format(accuracy_val * 100), "\tLoss: {:.5f}".format(loss_val))
            saver.save(sess, checkpoint_path)
            
            with open(checkpoint_epoch_path, "wb") as f:
                f.write(b"%d" % (epoch + 1))
            if loss_val < best_loss: 
                saver.save(sess, fimal_model_path)
                best_loss = loss_val 
            else: 
                epochs_without_progress += 5
                if epochs_without_progress > max_epochs_without_progress:
                    print("Early stopping")
                    break
相關文章
相關標籤/搜索