代碼實現（機器學習識別手寫數字）

時間 2019-11-30

標籤代碼實現機器學習識別手寫數字简体版

原文原文鏈接

你能夠在這裏閱讀上一篇html

我是薛銀亮，感謝英文原版在線書籍，這是我學習機器學習過程當中感受很是適合新手入門的一本書。鑑於知識分享的精神，我但願能將其翻譯過來，並分享給全部想了解機器學習的人，本人翻譯水平有限，歡迎讀者提出問題和發現錯誤，更歡迎大牛的指導。由於篇幅較長，文章將會分爲多個部分進行，感興趣的能夠關注個人文集，文章會持續更新。python

根據上一篇學到的知識，讓咱們使用隨機梯度降低和MNIST數據集來實現咱們的手寫數字識別程序吧。若是你還沒看過必備的知識，請移步到上一篇文章進行學習，你能夠關注個人文集來或許個人持續更新。這裏咱們將使用Phthon（2.7），代碼量僅僅74行，可是請注意，咱們若是爲了學習機器學習的思想而且但願能將這種技術應用到更多的領域的話，我建議不要過於關注代碼，千萬不要去試圖背誦代碼，由於這是沒有意義的。git

第一件事是獲取MNIST數據集，若是是一個git的使用者，關於git是什麼我就不介紹了，應該是每個程序員或者技術研究者都會的纔對。你可使用下面的命令來獲取數據集合：程序員

git clone https://github.com/mnielsen/neural-networks-and-deep-learning.git複製代碼

若是你沒有用過git，也能夠在這裏下載github

補充：前面的文章我介紹說MNIST數據有60000個訓練數據，10000個測試數據，這是MNIST官方的介紹。實際上，咱們這裏的數據有點小不一樣。咱們的測試數據是從訓練數據中分離出一部分組成的，也就是說，咱們把60000個圖片的數據分紅50000個組成訓練集，而後剩下的10000個組成測試集。算法

咱們會使用一個Python庫Numpy。使用它提供的線性代數的運算，若是你尚未安裝Numpy，能夠在這裏得到。shell

我先解釋一下代碼的結構設計，核心是Network類，這表明一個神經網絡，這裏給出初始化一個Network對象的代碼：數組

class Network(object):

    def __init__(self, sizes):
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
        self.weights = [np.random.randn(y, x) 
                        for x, y in zip(sizes[:-1], sizes[1:])]複製代碼

其中，sizes是一個list，包含了網絡中每一層神經元的數量。例如，咱們建立一個網絡第一層有2個神經元，第二層有3個神經元，最後一層有1個神經元，咱們可使用下面的代碼：bash

net = Network([2, 3, 1])複製代碼

b和權重的值都是被初始化成隨機數的，使用Numpy的np.random.randn函數生成的均值爲0方差爲1符合高斯分佈的隨機數。這些初始值是咱們開始隨機梯度降低開始的地方。可是在後面的章節，咱們會有更好的方法來初始化權重和b值，只是如今先這樣作。注意，網絡中的第一層是輸入層，這一層中沒有設置b值，由於b值僅僅用於後續的輸出層。微信

全部的b和權重都被做爲list存儲在Numpy的向量中。例如，net.weights[1]是Numpy向量中存儲鏈接神經元第二和第三層的（不是第一和第二層，由於python的list的index開始是0）。由於net.weights[1]這樣的寫法過於複雜，咱們能夠定義向量是w。例如Wjk表明的是第二層第k個神經元和第三層中第j個神經元之間的權重。咱們將 σ函數向量化：

a是第二層神經元的激活向量，很容易看出來公式(22)和公式(4)是相同的。

定義sigmoid函數：

def sigmoid(z):
    return 1.0/(1.0+np.exp(-z))複製代碼

注意，當z是一個向量或一個Numpy數組時，Numpy會自動的對向量中每個元素使用sigmoid函數，就是向量化操做。

而後添加feedforward方法：當網絡輸入一個a，獲得響應的輸出*。這個方法實現的就是公式(22)，每一層也都用這個方法：

def feedforward(self, a):
        """Return the output of the network if "a" is input."""
        for b, w in zip(self.biases, self.weights):
            a = sigmoid(np.dot(w, a)+b)
        return a複製代碼

固然，咱們主要是要Network來學習。咱們給出一叫作SGD(stochastic gradient descent)的方法來實現隨機梯度降低。下面是代碼：

def SGD(self, training_data, epochs, mini_batch_size, eta, test_data=None):
        """Train the neural network using mini-batch stochastic gradient descent. The "training_data" is a list of tuples "(x, y)" representing the training inputs and the desired outputs. The other non-optional parameters are self-explanatory. If "test_data" is provided then the network will be evaluated against the test data after each epoch, and partial progress printed out. This is useful for tracking progress, but slows things down substantially."""
        if test_data: n_test = len(test_data)
        n = len(training_data)
        for j in xrange(epochs):
            random.shuffle(training_data)
            mini_batches = [
                training_data[k:k+mini_batch_size]
                for k in xrange(0, n, mini_batch_size)]
            for mini_batch in mini_batches:
                self.update_mini_batch(mini_batch, eta)
            if test_data:
                print "Epoch {0}: {1} / {2}".format(
                    j, self.evaluate(test_data), n_test)
            else:
                print "Epoch {0} complete".format(j)複製代碼

訓練數據training_data 是元組(x, y)表明輸入和指望的輸出。變量epochs和mini_batch_size是你能夠設置的訓練次數和採樣時小批量的個數。eta是學習速率η。若是可選參數test_data被設置了，程序就會在每次訓練後將結果打印，這對跟蹤調試頗有幫助可是卻會下降速度。

在每一次epoch訓練中，都會把訓練數據從新排序，而後放在大小指定的list中（mini_batches）,這是一種簡單的抽取訓練數據的方法。而後對每個mini_batch再執行單一的梯度降低，這一步是使用self.update_mini_batch(mini_batch, eta)這句代碼執行。這會根據這一次的迭代更新權重和b值。接下來給出update_mini_batch的方法：

def update_mini_batch(self, mini_batch, eta):
        """Update the network's weights and biases by applying gradient descent using backpropagation to a single mini batch. The ``mini_batch`` is a list of tuples ``(x, y)``, and ``eta`` is the learning rate."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        self.weights = [w-(eta/len(mini_batch))*nw
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(mini_batch))*nb
                       for b, nb in zip(self.biases, nabla_b)]複製代碼

大部分工做都在這一行：

delta_nabla_b, delta_nabla_w = self.backprop(x, y)複製代碼

這個被調用的方法叫後向傳播算法（backpropagation），這是一個計算損失函數最快的方法，因此update_mini_batch這個方法能很快的使用每個訓練樣本mini_batch計算並更新self.weights 和 self.biases屬性。

這裏先不提供self.backprop的代碼，咱們將會在下面的章節學到有關後向傳播算法（backpropagation）以及它的代碼實現。如今假設它就是可以根據輸入訓練樣本x恰當的返回梯度就好了。

讓咱們來看一看所有代碼，包括文檔說明和我上面省略的部分。其中self.backprop 方法中使用了sigmoid_prime方法來協助計算梯度，這個計算了 σ函數的導數。self.cost_derivative方法你能夠經過看代碼和注視瞭解，咱們會在下一個章節詳細解釋。全部的代碼能夠在這裏下載:

""" network.py ~~~~~~~~~~ A module to implement the stochastic gradient descent learning algorithm for a feedforward neural network. Gradients are calculated using backpropagation. Note that I have focused on making the code simple, easily readable, and easily modifiable. It is not optimized, and omits many desirable features. """

#### Libraries
# Standard library
import random

# Third-party libraries
import numpy as np

class Network(object):

    def __init__(self, sizes):
        """The list ``sizes`` contains the number of neurons in the respective layers of the network. For example, if the list was [2, 3, 1] then it would be a three-layer network, with the first layer containing 2 neurons, the second layer 3 neurons, and the third layer 1 neuron. The biases and weights for the network are initialized randomly, using a Gaussian distribution with mean 0, and variance 1. Note that the first layer is assumed to be an input layer, and by convention we won't set any biases for those neurons, since biases are only ever used in computing the outputs from later layers."""
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
        self.weights = [np.random.randn(y, x)
                        for x, y in zip(sizes[:-1], sizes[1:])]

    def feedforward(self, a):
        """Return the output of the network if ``a`` is input."""
        for b, w in zip(self.biases, self.weights):
            a = sigmoid(np.dot(w, a)+b)
        return a

    def SGD(self, training_data, epochs, mini_batch_size, eta, test_data=None):
        """Train the neural network using mini-batch stochastic gradient descent. The ``training_data`` is a list of tuples ``(x, y)`` representing the training inputs and the desired outputs. The other non-optional parameters are self-explanatory. If ``test_data`` is provided then the network will be evaluated against the test data after each epoch, and partial progress printed out. This is useful for tracking progress, but slows things down substantially."""
        if test_data: n_test = len(test_data)
        n = len(training_data)
        for j in xrange(epochs):
            random.shuffle(training_data)
            mini_batches = [
                training_data[k:k+mini_batch_size]
                for k in xrange(0, n, mini_batch_size)]
            for mini_batch in mini_batches:
                self.update_mini_batch(mini_batch, eta)
            if test_data:
                print "Epoch {0}: {1} / {2}".format(
                    j, self.evaluate(test_data), n_test)
            else:
                print "Epoch {0} complete".format(j)

    def update_mini_batch(self, mini_batch, eta):
        """Update the network's weights and biases by applying gradient descent using backpropagation to a single mini batch. The ``mini_batch`` is a list of tuples ``(x, y)``, and ``eta`` is the learning rate."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        self.weights = [w-(eta/len(mini_batch))*nw
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(mini_batch))*nb
                       for b, nb in zip(self.biases, nabla_b)]

    def backprop(self, x, y):
        """Return a tuple ``(nabla_b, nabla_w)`` representing the gradient for the cost function C_x. ``nabla_b`` and ``nabla_w`` are layer-by-layer lists of numpy arrays, similar to ``self.biases`` and ``self.weights``."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        # feedforward
        activation = x
        activations = [x] # list to store all the activations, layer by layer
        zs = [] # list to store all the z vectors, layer by layer
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation)+b
            zs.append(z)
            activation = sigmoid(z)
            activations.append(activation)
        # backward pass
        delta = self.cost_derivative(activations[-1], y) * \
            sigmoid_prime(zs[-1])
        nabla_b[-1] = delta
        nabla_w[-1] = np.dot(delta, activations[-2].transpose())
        # Note that the variable l in the loop below is used a little
        # differently to the notation in Chapter 2 of the book. Here,
        # l = 1 means the last layer of neurons, l = 2 is the
        # second-last layer, and so on. It's a renumbering of the
        # scheme in the book, used here to take advantage of the fact
        # that Python can use negative indices in lists.
        for l in xrange(2, self.num_layers):
            z = zs[-l]
            sp = sigmoid_prime(z)
            delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
            nabla_b[-l] = delta
            nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
        return (nabla_b, nabla_w)

    def evaluate(self, test_data):
        """Return the number of test inputs for which the neural network outputs the correct result. Note that the neural network's output is assumed to be the index of whichever neuron in the final layer has the highest activation."""
        test_results = [(np.argmax(self.feedforward(x)), y)
                        for (x, y) in test_data]
        return sum(int(x == y) for (x, y) in test_results)

    def cost_derivative(self, output_activations, y):
        """Return the vector of partial derivatives \partial C_x / \partial a for the output activations."""
        return (output_activations-y)

#### Miscellaneous functions
def sigmoid(z):
    """The sigmoid function."""
    return 1.0/(1.0+np.exp(-z))

def sigmoid_prime(z):
    """Derivative of the sigmoid function."""
    return sigmoid(z)*(1-sigmoid(z))複製代碼

那麼程序執行的效果怎麼樣呢？Well，讓咱們先加載MNIST數據，這裏使用一個工具類來幫助咱們作這件事mnist_loader.py，咱們在Python的shell裏執行這個文件：

>>> import mnist_loader
>>> training_data, validation_data, test_data = \
... mnist_loader.load_data_wrapper()複製代碼

接下來運行network，咱們設置30個隱藏層：

>>> import network
>>> net = network.Network([784, 30, 10])複製代碼

而後設置訓練次數30次，epochs＝30；每組訓練數據10個，mini-batch.size = 10；學習速率3.0， η=3.0。

>>> net.SGD(training_data, 30, 10, 3.0, test_data=test_data)複製代碼

若是你在這時候運行代碼，可能須要花點時間才能運行完。我建議你繼續閱讀，設置一些運行的東西，並按期檢查代碼輸出。可是若是你想如今就運行，能夠經過減小訓練次數、減小隱藏層的數量或僅僅使用部分訓練數據來加快速度。請注意：寫這些代碼只是爲了幫助你瞭解神經網絡工做的方式，而不是性能很高的代碼。固然，一旦咱們已經訓練出來一個很好的神經網絡，那麼就能夠直接將其移植到網頁（用JS）或app等，這時候它也會運行的很快的。正如你看到的，僅僅一次訓練之後，正確識別的數量就已經達到了9129個（一共10000個）。

Epoch 0: 9129 / 10000
Epoch 1: 9295 / 10000
Epoch 2: 9348 / 10000
...
Epoch 27: 9528 / 10000
Epoch 28: 9542 / 10000
Epoch 29: 9534 / 10000複製代碼

可是，由於咱們初始化時用的隨機生成的權重和b，因此你運行的結果可能並不會和個人如出一轍。

咱們把隱藏層改爲100看看：

>>> net = network.Network([784, 100, 10])
>>> net.SGD(training_data, 30, 10, 3.0, test_data=test_data)複製代碼

咱們會發現準確率提高了，至少在這種狀況下增長隱藏層數量會幫助咱們獲得更好的結果。

若是咱們減少學習效率到η=0.001：

>>> net = network.Network([784, 100, 10])
>>> net.SGD(training_data, 30, 10, 0.001, test_data=test_data)複製代碼

結果就有點不近人情了：

Epoch 0: 1139 / 10000
Epoch 1: 1136 / 10000
Epoch 2: 1135 / 10000
...
Epoch 27: 2101 / 10000
Epoch 28: 2123 / 10000
Epoch 29: 2142 / 10000複製代碼

再改變學習速率到0.01，發現結果又變好一點了。相似的當你發現改變一個參數能使得結果改變時就多試幾回。咱們能夠最終選擇更適合咱們的這個參數。

通常，調試神經網絡是比較困難的。當指定一個參數沒有隨機選擇的好的時候尤爲如此。假設咱們嘗試設置隱藏層30個神經元，學習效率η=100.0：

>>> net = network.Network([784, 30, 10])
>>> net.SGD(training_data, 30, 10, 100.0, test_data=test_data)複製代碼

會發現學習效率過高了：

Epoch 0: 1009 / 10000
Epoch 1: 1009 / 10000
Epoch 2: 1009 / 10000
Epoch 3: 1009 / 10000
...
Epoch 27: 982 / 10000
Epoch 28: 982 / 10000
Epoch 29: 982 / 10000複製代碼

這個時候咱們應該會適當調小學習速率，來提升準確率。可是假如這是咱們第一次來嘗試，那麼咱們可能不會馬上懷疑時學習速率太大的問題，而是可能會懷疑時咱們神經網絡的問題，好比是否是咱們初始化權重和b值形成網絡的問題呢？會不會是訓練數據的問題呢？是否是訓練次數問題？或者是否是應該改變學習算法呢？等等各類猜測，因此當你第一次遇到這個狀況時，你是不肯定是什麼問題出現形成的這種結果。可是這裏先不解釋，會在以後的文章中解釋這些問題。這裏僅僅是展現源碼。

咱們來看看前面提到的如何加載MNIST數據的細節，下面是源碼。數據結構是MNIST官網文檔上面描述的是stuff、tuples和lists。若是你不瞭解ndarray，能夠理解成向量。

""" mnist_loader ~~~~~~~~~~~~ A library to load the MNIST image data. For details of the data structures that are returned, see the doc strings for ``load_data`` and ``load_data_wrapper``. In practice, ``load_data_wrapper`` is the function usually called by our neural network code. """

#### Libraries
# Standard library
import cPickle
import gzip

# Third-party libraries
import numpy as np

def load_data():
    """Return the MNIST data as a tuple containing the training data, the validation data, and the test data. The ``training_data`` is returned as a tuple with two entries. The first entry contains the actual training images. This is a numpy ndarray with 50,000 entries. Each entry is, in turn, a numpy ndarray with 784 values, representing the 28 * 28 = 784 pixels in a single MNIST image. The second entry in the ``training_data`` tuple is a numpy ndarray containing 50,000 entries. Those entries are just the digit values (0...9) for the corresponding images contained in the first entry of the tuple. The ``validation_data`` and ``test_data`` are similar, except each contains only 10,000 images. This is a nice data format, but for use in neural networks it's helpful to modify the format of the ``training_data`` a little. That's done in the wrapper function ``load_data_wrapper()``, see below. """
    f = gzip.open('../data/mnist.pkl.gz', 'rb')
    training_data, validation_data, test_data = cPickle.load(f)
    f.close()
    return (training_data, validation_data, test_data)

def load_data_wrapper():
    """Return a tuple containing ``(training_data, validation_data, test_data)``. Based on ``load_data``, but the format is more convenient for use in our implementation of neural networks. In particular, ``training_data`` is a list containing 50,000 2-tuples ``(x, y)``. ``x`` is a 784-dimensional numpy.ndarray containing the input image. ``y`` is a 10-dimensional numpy.ndarray representing the unit vector corresponding to the correct digit for ``x``. ``validation_data`` and ``test_data`` are lists containing 10,000 2-tuples ``(x, y)``. In each case, ``x`` is a 784-dimensional numpy.ndarry containing the input image, and ``y`` is the corresponding classification, i.e., the digit values (integers) corresponding to ``x``. Obviously, this means we're using slightly different formats for the training data and the validation / test data. These formats turn out to be the most convenient for use in our neural network code."""
    tr_d, va_d, te_d = load_data()
    training_inputs = [np.reshape(x, (784, 1)) for x in tr_d[0]]
    training_results = [vectorized_result(y) for y in tr_d[1]]
    training_data = zip(training_inputs, training_results)
    validation_inputs = [np.reshape(x, (784, 1)) for x in va_d[0]]
    validation_data = zip(validation_inputs, va_d[1])
    test_inputs = [np.reshape(x, (784, 1)) for x in te_d[0]]
    test_data = zip(test_inputs, te_d[1])
    return (training_data, validation_data, test_data)

def vectorized_result(j):
    """Return a 10-dimensional unit vector with a 1.0 in the jth position and zeroes elsewhere. This is used to convert a digit (0...9) into a corresponding desired output from the neural network."""
    e = np.zeros((10, 1))
    e[j] = 1.0
    return e複製代碼

咱們知道，2的圖案比1的更加黑一點，由於它更多區域被染成黑色。

有個建議是計算0到9的平均黑度，這樣在有個數字要猜想時，能夠先計算它的黑度而後再猜想它是什麼數字，這個實現不太難，因此這裏不寫出代碼，而是把代碼放在了GitHub repository,這種方法能提升咱們的準確率。

可是若是你想盡量提升準確率，咱們可使用支持向量機算法SVM（support vector machine）。不用擔憂，暫時咱們不須要理解SVM算法細節，咱們如今能夠顯示用庫 scikit-learn，裏面提供了SVM算法C語言對Python方便的接口。代碼在這裏 here.這說明SVM比咱們的算法更厲害，這有點不太好，因此在後面，咱們會提升咱們的算法，讓它比SVM準確率更高。

SVM也有不少能夠調整的參數，若是你感興趣能夠學習 this blog post做者是 Andreas Mueller 。