【機器學習】神經網絡實現異或（XOR）

時間 2019-11-13

標籤機器學習神經網絡實現異或 xor 简体版

原文原文鏈接

注：在吳恩達老師講的【機器學習】課程中，最開始介紹神經網絡的應用時就介紹了含有一個隱藏層的神經網絡能夠解決異或問題，而這是單層神經網絡（也叫感知機）作不到了，當時就以爲很是神奇，以後就一直打算本身實現一下，一直到一週前纔開始動手實現。本身參考【機器學習】課程中數字識別的做業題寫了代碼，對於做業題中給的數字圖片能夠達到95%左右的識別準確度。可是改爲訓練異或的網絡時，怎麼也沒法獲得正確的結果。後來查了一些資料才發現是由於本身有一個參數設置的有問題，並且學習率太小，迭代的次數也不夠。總之，異或邏輯的實現不只對於人工神經網絡這一算法是一大突破，對於我本身對偏差反向傳播算法（Error Back Propagation, BP）的理解也是很是重要的過程，所以記錄於此。html

什麼是異或

在數字邏輯中，異或是對兩個運算元的一種邏輯分析類型，符號爲XOR或EOR或⊕。與通常的或（OR）不一樣，當兩兩數值相同時爲否，而數值不一樣時爲真。異或的真值表以下：python

XOR truth table
Input		Output
A	B	Output
0	0	0
0	1	1
1	0	1
1	1	0

0, false
1, true

聽說在人工神經網絡（artificial neural network, ANN）發展初期，因爲沒法實現對多層神經網絡（包括異或邏輯）的訓練而形成了一場ANN危機，到最後BP算法的出現，才讓訓練帶有隱藏層的多層神經網絡成爲可能。所以異或的實如今ANN的發展史是也是具備里程碑意義的。異或之因此重要，是由於它相對於其餘邏輯關係，例如與（AND）, 或（OR）等，異或是線性不可分的。以下圖：git

在實際應用中，異或門（Exclusive-OR gate, XOR gate）是數字邏輯中實現邏輯異或的邏輯門，這一函數能實現模爲2的加法。所以，異或門能夠實現計算機中的二進制加法。github

異或的神經網絡結構

在【機器學習】課程中，使用了AND（與），NOR（或非）和OR（或）的組合實現了XNOR（同或），與咱們要實現的異或（XOR）正好相反。所以仍是能夠採用課程中的神經網絡結構，以下圖：算法

若是算上輸入層咱們的網絡共有三層，以下圖所示，其中第1層和第2層中的1分別是這兩層的偏置單元。連線上是鏈接先後層的參數。編程

輸入：咱們一共有四個訓練樣本，每一個樣本有兩個特徵，分別是(0, 0), (1, 0), (0, 1), (1, 1);
理想輸出：參考上面的真值表，樣本中兩個特徵相同時爲0，相異爲1
參數：隨機初始化，範圍爲(-1, 1)
關於神經網絡的基礎知識以及前向傳播、反向傳播的實現請參考下面兩篇文章，寫的很是精彩：

機器學習公開課筆記(4)：神經網絡(Neural Network)——表示網絡

機器學習公開課筆記(5)：神經網絡(Neural Network)——學習數據結構

代碼

原生態的代碼：

下面的實現是徹底根據本身的理解和對【機器學習】課程中做業題的模仿而寫成的，雖然代碼質量不是很是高，可是算法的全部細節都展現出來了。app

在66, 69, 70行的註釋是我以前沒有獲得正確結果的三個緣由，其中epsilon肯定的是隨機初始化參數的範圍，例如epsilon=1，參數範圍就是(-1, 1)dom

 1 # -*- coding: utf-8 -*-
 2 """
 3 Created on Tue Apr  4 10:47:51 2017
 4 
 5 @author: xin
 6 """
 7 # Neural Network for XOR
 8 import numpy as np
 9 import matplotlib.pyplot as plt
10 
11 HIDDEN_LAYER_SIZE = 2
12 INPUT_LAYER = 2  # input feature
13 NUM_LABELS = 1  # output class number
14 X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
15 y = np.array([[0], [1], [1], [0]])
16 
17 
18 def rand_initialize_weights(L_in, L_out, epsilon):
19     """
20     Randomly initialize the weights of a layer with L_in
21     incoming connections and L_out outgoing connections;
22 
23     Note that W should be set to a matrix of size(L_out, 1 + L_in) as
24     the first column of W handles the "bias" terms
25     """
26     epsilon_init = epsilon
27     W = np.random.rand(L_out, 1 + L_in) * 2 * epsilon_init - epsilon_init
28     return W
29 
30 
31 def sigmoid(x):
32     return 1.0 / (1.0 + np.exp(-x))
33 
34 
35 def sigmoid_gradient(z):
36     g = np.multiply(sigmoid(z), (1 - sigmoid(z)))
37     return g
38 
39 
40 def nn_cost_function(theta1, theta2, X, y):
41     m = X.shape[0]  # m=4
42     # 計算全部參數的偏導數（梯度）
43     D_1 = np.zeros(theta1.shape)  # Δ_1
44     D_2 = np.zeros(theta2.shape)  # Δ_2
45     h_total = np.zeros((m, 1))  # 全部樣本的預測值, m*1, probability
46     for t in range(m):
47         a_1 = np.vstack((np.array([[1]]), X[t:t + 1, :].T))  # 列向量, 3*1
48         z_2 = np.dot(theta1, a_1)  # 2*1
49         a_2 = np.vstack((np.array([[1]]), sigmoid(z_2)))  # 3*1
50         z_3 = np.dot(theta2, a_2)  # 1*1
51         a_3 = sigmoid(z_3)
52         h = a_3  # 預測值h就等於a_3, 1*1
53         h_total[t,0] = h
54         delta_3 = h - y[t:t + 1, :].T  # 最後一層每個單元的偏差, δ_3, 1*1
55         delta_2 = np.multiply(np.dot(theta2[:, 1:].T, delta_3), sigmoid_gradient(z_2))  # 第二層每個單元的偏差（不包括偏置單元）, δ_2, 2*1
56         D_2 = D_2 + np.dot(delta_3, a_2.T)  # 第二層全部參數的偏差, 1*3
57         D_1 = D_1 + np.dot(delta_2, a_1.T)  # 第一層全部參數的偏差, 2*3
58     theta1_grad = (1.0 / m) * D_1  # 第一層參數的偏導數，取全部樣本中參數的均值，沒有加正則項
59     theta2_grad = (1.0 / m) * D_2
60     J = (1.0 / m) * np.sum(-y * np.log(h_total) - (np.array([[1]]) - y) * np.log(1 - h_total))
61     return {'theta1_grad': theta1_grad,
62             'theta2_grad': theta2_grad,
63             'J': J, 'h': h_total}
64 
65 
66 theta1 = rand_initialize_weights(INPUT_LAYER, HIDDEN_LAYER_SIZE, epsilon=1)  # 以前的問題之一，epsilon的值設置的過小 67 theta2 = rand_initialize_weights(HIDDEN_LAYER_SIZE, NUM_LABELS, epsilon=1)
68 
69 iter_times = 10000  # 以前的問題之二，迭代次數太少
70 alpha = 0.5  # 以前的問題之三，學習率過小
71 result = {'J': [], 'h': []}
72 theta_s = {}
73 for i in range(iter_times):
74     cost_fun_result = nn_cost_function(theta1=theta1, theta2=theta2, X=X, y=y)
75     theta1_g = cost_fun_result.get('theta1_grad')
76     theta2_g = cost_fun_result.get('theta2_grad')
77     J = cost_fun_result.get('J')
78     h_current = cost_fun_result.get('h')
79     theta1 -= alpha * theta1_g
80     theta2 -= alpha * theta2_g
81     result['J'].append(J)
82     result['h'].append(h_current)
83     # print(i, J, h_current)
84     if i==0 or i==(iter_times-1):
85         print('theta1', theta1)
86         print('theta2', theta2)
87         theta_s['theta1_'+str(i)] = theta1.copy()
88         theta_s['theta2_'+str(i)] = theta2.copy()
89 
90 plt.plot(result.get('J'))
91 plt.show()
92 print(theta_s)
93 print(result.get('h')[0], result.get('h')[-1])

下面是輸出結果：

# 隨機初始化獲得的參數
('theta1', array([[ 0.18589823, -0.77059558,  0.62571502],
       [-0.79844165,  0.56069914,  0.21090703]]))
('theta2', array([[ 0.1327994 ,  0.59513332,  0.34334931]]))

# 訓練後獲得的參數
('theta1', array([[-3.90903729, -7.44497437,  7.20130773],
       [-3.76429211,  6.93482723, -7.21857912]]))
('theta2', array([[ -6.5739346 ,  13.33011993,  13.3891608 ]]))

# 同上，第一次迭代和最後一次迭代獲得的參數
{'theta1_0': array([[ 0.18589823, -0.77059558,  0.62571502],
       [-0.79844165,  0.56069914,  0.21090703]]), 'theta2_9999': array([[ -6.5739346 ,  13.33011993,  13.3891608 ]]), 'theta1_9999': array([[-3.90903729, -7.44497437,  7.20130773],
       [-3.76429211,  6.93482723, -7.21857912]]), 'theta2_0': array([[ 0.1327994 ,  0.59513332,  0.34334931]])}

# 預測值h: 第1個array裏是初始參數預測出來的值，第2個array中是最後一次獲得的參數預測出來的值
(array([[ 0.66576877],
       [ 0.69036552],
       [ 0.64994307],
       [ 0.67666546]]), array([[ 0.00245224],
       [ 0.99812746],
       [ 0.99812229],
       [ 0.00215507]]))

下面是隨着迭代次數的增長，代價函數值J(θ)的變化狀況：

更加精煉的代碼

下面這段代碼是我在排除以前本身的代碼中的問題時，在Stack Overflow上發現的，發帖的人也碰到了一樣的問題，但緣由不同。他的代碼裏有一點小問題，已經修正。這段代碼，相對於我本身的原生態代碼，有了很是大的改進，沒有限定層數和每層的單元數，代碼自己也比較簡潔。

說明：因爲第44行，傳的參數是該層的a值，而不是z值，因此第11行須要作出一點修改，其實直接傳遞a值是一種更方便的作法。

 1 # -*- coding: utf-8 -*-
 2 
 3 import numpy as np
 4 import matplotlib.pyplot as plt
 5 
 6 
 7 def sigmoid(x):
 8     return 1/(1+np.exp(-x))
 9 
10 def s_prime(z):
11     return np.multiply(z, 1.0-z)  # 修改的地方
12 
13 def init_weights(layers, epsilon):
14     weights = []
15     for i in range(len(layers)-1):
16         w = np.random.rand(layers[i+1], layers[i]+1)
17         w = w * 2*epsilon - epsilon
18         weights.append(np.mat(w))
19     return weights
20 
21 def fit(X, Y, w):
22     # now each para has a grad equals to 0
23     w_grad = ([np.mat(np.zeros(np.shape(w[i])))
24               for i in range(len(w))])  # len(w) equals the layer number
25     m, n = X.shape
26     h_total = np.zeros((m, 1))  # 全部樣本的預測值, m*1, probability
27     for i in range(m):
28         x = X[i]
29         y = Y[0,i]
30         # forward propagate
31         a = x
32         a_s = []
33         for j in range(len(w)):
34             a = np.mat(np.append(1, a)).T
35             a_s.append(a)  # 這裏保存了前L-1層的a值
36             z = w[j] * a
37             a = sigmoid(z)
38         h_total[i, 0] = a
39         # back propagate
40         delta = a - y.T
41         w_grad[-1] += delta * a_s[-1].T  # L-1層的梯度
42         # 倒過來，從倒數第二層開始到第二層結束，不包括第一層和最後一層
43         for j in reversed(range(1, len(w))):
44             delta = np.multiply(w[j].T*delta, s_prime(a_s[j]))  # 這裏傳遞的參數是a，而不是z 45             w_grad[j-1] += (delta[1:] * a_s[j-1].T)
46     w_grad = [w_grad[i]/m for i in range(len(w))]
47     J = (1.0 / m) * np.sum(-Y * np.log(h_total) - (np.array([[1]]) - Y) * np.log(1 - h_total))
48     return {'w_grad': w_grad, 'J': J, 'h': h_total}
49 
50 
51 X = np.mat([[0,0],
52             [0,1],
53             [1,0],
54             [1,1]])
55 Y = np.mat([0,1,1,0])
56 layers = [2,2,1]
57 epochs = 5000
58 alpha = 0.5
59 w = init_weights(layers, 1)
60 result = {'J': [], 'h': []}
61 w_s = {}
62 for i in range(epochs):
63     fit_result = fit(X, Y, w)
64     w_grad = fit_result.get('w_grad')
65     J = fit_result.get('J')
66     h_current = fit_result.get('h')
67     result['J'].append(J)
68     result['h'].append(h_current)
69     for j in range(len(w)):
70         w[j] -= alpha * w_grad[j]
71     if i == 0 or i == (epochs - 1):
72         # print('w_grad', w_grad)
73         w_s['w_' + str(i)] = w_grad[:]
74 
75 
76 plt.plot(result.get('J'))
77 plt.show()
78 print(w_s)
79 print(result.get('h')[0], result.get('h')[-1])

下面是輸出的結果：

# 第一次迭代和最後一次迭代獲得的參數

{'w_4999': [matrix([[  1.51654104e-04,  -2.30291680e-04,   6.20083292e-04],
        [  9.15463982e-05,  -1.51402782e-04,  -6.12464354e-04]]), matrix([[ 0.0004279 , -0.00051928, -0.00042735]])], 
'w_0': [matrix([[ 0.00172196,  0.0010952 ,  0.00132499],
        [-0.00489422, -0.00489643, -0.00571827]]), matrix([[-0.02787502, -0.01265985, -0.02327431]])]}

# 預測值h: 第1個array裏是初始參數預測出來的值，第2個array中是最後一次獲得的參數預測出來的值

(array([[ 0.45311095],
       [ 0.45519066],
       [ 0.4921871 ],
       [ 0.48801121]]), 
array([[ 0.00447994],
       [ 0.49899856],
       [ 0.99677373],
       [ 0.50145936]]))

觀察上面的結果，最後一次迭代獲得的結果並非咱們期待的結果，也就是第一、4個值接近於0, 第二、3個值接近於1。下面是代價函數值J(θ)隨着迭代次數增長的變化狀況：

從上圖能夠看到，J(θ)的值從2000之後就一直停留在0.35左右，所以整個網絡有可能收斂到了一個局部最優解，也有多是迭代次數不夠致使的。

將迭代次數改爲10000後，即epochs = 10000，基本上都是能夠獲得預期的結果的。其實在迭代次數少的狀況下，也有可能獲得預期的結果，這應該主要取決於初始的參數。

經驗小結

經過閱讀別人的代碼確實是提升本身編程能力的一種重要方法。例如經過比較本身的原生態版代碼和其餘人寫的代碼，就能夠找出本身的不足之處。其中最大的收穫是：數據結構對於代碼的結構和邏輯都很是重要。好比我本身寫的時候，每一層是分開的，但後面的代碼中將整個網絡一塊兒初始化並保存在一個list中，這就提升了代碼的可擴展能力，也使得代碼更加簡潔！

此外，要準確的理解各類算法的細節，最好的方式就是本身實現一次。

<完>