《神經網絡與深度學習》第一章使用神經網絡來識別手寫數字（三）- 用Python代碼實現

時間 2019-11-12

標籤神經網絡與深度學習第一章使用神經網絡識別手寫數字 python 代碼實現欄目 Python 简体版

原文原文鏈接

實現咱們分類數字的網絡

好，讓咱們使用隨機梯度降低和 MNIST訓練數據來寫一個程序來學習怎樣識別手寫數字。咱們用Python (2.7) 來實現。只有 74 行代碼！咱們須要的第一個東西是 MNIST數據。若是有 github 帳號，你能夠將這些代碼庫克隆下來，php

git clone https://github.com/mnielsen/neural-networks-and-deep-learning.git

或者你能夠到這裏下載。html

順便說一下，當我先前說到 MNIST 數據集時，我說它被分紅 60,000 個訓練圖片，和 10,000張測試圖片。這是官方的說法。實際上，咱們準備用不一樣的分法。咱們將這60,000張圖片的MNIST訓練數據集分紅兩部分：一部分有50,000 張圖片，咱們用這些圖片來訓練咱們的神經網絡，另外的10,000 張的圖片用來做確認數據集，用來驗證識別是否準確。在這一章節咱們不會使用確認數據，在本系列文章的後面，咱們會發現它對於計算出怎樣設置神經網絡的hyper-parameters是頗有用的 - 例如學習率等等，咱們的學習算法中可能不會直接用到這些hyper-parameters。雖然確認數據不是源MNIST規格的一部分，不少人按這種方式使用MNIST，確認數據的使用在神經網絡中是很常見的。當我提到"MNIST" 從如今起，它表示咱們的 50,000個圖片數據集，而不是原來的 60,000 張圖片數據集*早前提到的， MNIST數據集基於NIST收集的兩種數據。爲了構建MNIST，數據集被NIST 的Yann LeCun, Corinna Cortes和 Christopher J. C. Burges幾我的拆開，放進更方便的格式點擊此連接查看更多詳情。在我數據集中的數據集是以一種容易加載的格式出現的，而且是用Python來處理這些 MNIST 數據。我是從Montreal大學LISA 機器學習實驗室 (連接)得到這些特定格式的數據的。
git

除了MNIST數據，咱們還須要一個Python庫Numpy，用來作快速線性代數運算。若是你還沒安裝這個庫，你能夠到這裏下載： heregithub

讓咱們講述一下神經網絡代碼的核心功能，在我給出完整清單前。核心是一個 Network 類，咱們用了表現一個神經網絡。下面這些代碼是用來初始化一個Network對象：算法

class Network(object):

    def __init__(self, sizes):
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
        self.weights = [np.random.randn(y, x) 
                        for x, y in zip(sizes[:-1], sizes[1:])]

這些代碼，列表的 sizes 包含各個層的神經元的數量。例如，若是咱們想建立一個第一層有有兩個神經元，第二層有3個神經元，最後一層有一個神經元的 Network對象，咱們這樣設置：shell

net = Network([2, 3, 1])

編程

也要注意偏移量和權重以Numpy數據矩陣的方式存儲。所以，例如 net.weights[1]是一個Numpy矩陣用來儲存鏈接第二層和第三層神經網絡的權重。(它不是第一次和第二層，由於Python List 是從0開始算起的）。既然 net.weights[1] 是至關冗長的，讓咱們用矩陣數組

a' = σ (w a + b) (22)

瀏覽器

以上記住以後，很容易寫出代碼來計算網絡的輸出。我先定義S型函數開始：網絡

def sigmoid(z):
    return 1.0/(1.0+np.exp(-z))

咱們添加一個 feedforward 方法到 Network 類，給神經網絡一個輸入 a ，返回對應的輸入*。加入輸入值 a 是一個 (n, 1)Numpy ndarray，不是一個 (n,) 向量。這裏， n 是神經網絡輸入的數字。若是你嘗試使用一個 (n,) 向量做爲輸入，你會獲得一個奇怪的結果。雖然使用(n,)向量看起來是一個更天然的選擇，可是使用 (n, 1) ndarray可讓代碼改成前饋一次性多輸入更加容易，有時候很方便。全部這些方法都是應用方程 (22) 到每一層：

 def feedforward(self, a):
        """Return the output of the network if "a" is input."""
        for b, w in zip(self.biases, self.weights):
            a = sigmoid(np.dot(w, a)+b)
        return a

固然，咱們想讓咱們的Network對象作得主要事情是去學習。爲了達到這個目的，咱們給它們一個SGD方法，這個方法實現了隨機梯度降低算法。這裏是它的代碼。它在有些地方有點神祕，但我會分紅一個個小點來解釋。

 def SGD(self, training_data, epochs, mini_batch_size, eta,
            test_data=None):
        """Train the neural network using mini-batch stochastic
        gradient descent.  The "training_data" is a list of tuples
        "(x, y)" representing the training inputs and the desired
        outputs.  The other non-optional parameters are
        self-explanatory.  If "test_data" is provided then the
        network will be evaluated against the test data after each
        epoch, and partial progress printed out.  This is useful for
        tracking progress, but slows things down substantially."""
        if test_data: n_test = len(test_data)
        n = len(training_data)
        for j in xrange(epochs):
            random.shuffle(training_data)
            mini_batches = [
                training_data[k:k+mini_batch_size]
                for k in xrange(0, n, mini_batch_size)]
            for mini_batch in mini_batches:
                self.update_mini_batch(mini_batch, eta)
            if test_data:
                print "Epoch {0}: {1} / {2}".format(
                    j, self.evaluate(test_data), n_test)
            else:
                print "Epoch {0} complete".format(j)

training_data 是一個元組(x, y)列表，表明訓練數據輸入和 相應想要的輸出。變量epochs 和mini_batch_size 是你指望的 - 訓練次數，當取樣時用到的最小批次。 eta是學習率，

這段代碼的做用以下。在每一個時期，它會將訓練數據隨機洗牌，而後分紅適當的幾批訓練數據。這是將訓練數據隨機抽樣的一種簡單方式。而後對於每個mini_batch，咱們作一次梯度降低。這由代碼self.update_mini_batch(mini_batch, eta)來完成，這段代碼經過使用mini_batch的訓練數據作一次隨機降低循環更新網絡的偏移量和權重。下面是update_mini_batch 方法的代碼：

 def update_mini_batch(self, mini_batch, eta):
        """Update the network's weights and biases by applying
        gradient descent using backpropagation to a single mini batch.
        The "mini_batch" is a list of tuples "(x, y)", and "eta"
        is the learning rate."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        self.weights = [w-(eta/len(mini_batch))*nw 
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(mini_batch))*nb 
                       for b, nb in zip(self.biases, nabla_b)]

delta_nabla_b, delta_nabla_w = self.backprop(x, y)

我不許備如今展現 self.backprop 的代碼。在下一個章節我會介紹反向傳播怎樣學習，包括 self.backprop的代碼。如今，咱們假設它能表現的如它聲稱的那樣返回恰當的訓練樣本x的代價Cost梯度。

讓咱們看一下整個程序，包括文檔註釋，上面我省略了不少東西。除了self.backprop，這個程序是自解釋的（ self-explanatory ）- 咱們上面已經提到過，全部的累活都在self.SGD和self.update_mini_batch裏面給你完成好了。 self.backprop方法利用一些額外的函數來幫助計算梯度，例如sigmoid_prime方法是用來計算

"""
network.py
~~~~~~~~~~

A module to implement the stochastic gradient descent learning
algorithm for a feedforward neural network.  Gradients are calculated
using backpropagation.  Note that I have focused on making the code
simple, easily readable, and easily modifiable.  It is not optimized,
and omits many desirable features.
"""

#### Libraries
# Standard library
import random

# Third-party libraries
import numpy as np

class Network(object):

    def __init__(self, sizes):
        """The list ``sizes`` contains the number of neurons in the
        respective layers of the network.  For example, if the list
        was [2, 3, 1] then it would be a three-layer network, with the
        first layer containing 2 neurons, the second layer 3 neurons,
        and the third layer 1 neuron.  The biases and weights for the
        network are initialized randomly, using a Gaussian
        distribution with mean 0, and variance 1.  Note that the first
        layer is assumed to be an input layer, and by convention we
        won't set any biases for those neurons, since biases are only
        ever used in computing the outputs from later layers."""
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
        self.weights = [np.random.randn(y, x)
                        for x, y in zip(sizes[:-1], sizes[1:])]

    def feedforward(self, a):
        """Return the output of the network if ``a`` is input."""
        for b, w in zip(self.biases, self.weights):
            a = sigmoid(np.dot(w, a)+b)
        return a

    def SGD(self, training_data, epochs, mini_batch_size, eta,
            test_data=None):
        """Train the neural network using mini-batch stochastic
        gradient descent.  The ``training_data`` is a list of tuples
        ``(x, y)`` representing the training inputs and the desired
        outputs.  The other non-optional parameters are
        self-explanatory.  If ``test_data`` is provided then the
        network will be evaluated against the test data after each
        epoch, and partial progress printed out.  This is useful for
        tracking progress, but slows things down substantially."""
        if test_data: n_test = len(test_data)
        n = len(training_data)
        for j in xrange(epochs):
            random.shuffle(training_data)
            mini_batches = [
                training_data[k:k+mini_batch_size]
                for k in xrange(0, n, mini_batch_size)]
            for mini_batch in mini_batches:
                self.update_mini_batch(mini_batch, eta)
            if test_data:
                print "Epoch {0}: {1} / {2}".format(
                    j, self.evaluate(test_data), n_test)
            else:
                print "Epoch {0} complete".format(j)

    def update_mini_batch(self, mini_batch, eta):
        """Update the network's weights and biases by applying
        gradient descent using backpropagation to a single mini batch.
        The ``mini_batch`` is a list of tuples ``(x, y)``, and ``eta``
        is the learning rate."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        self.weights = [w-(eta/len(mini_batch))*nw
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(mini_batch))*nb
                       for b, nb in zip(self.biases, nabla_b)]

    def backprop(self, x, y):
        """Return a tuple ``(nabla_b, nabla_w)`` representing the
        gradient for the cost function C_x.  ``nabla_b`` and
        ``nabla_w`` are layer-by-layer lists of numpy arrays, similar
        to ``self.biases`` and ``self.weights``."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        # feedforward
        activation = x
        activations = [x] # list to store all the activations, layer by layer
        zs = [] # list to store all the z vectors, layer by layer
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation)+b
            zs.append(z)
            activation = sigmoid(z)
            activations.append(activation)
        # backward pass
        delta = self.cost_derivative(activations[-1], y) * \
            sigmoid_prime(zs[-1])
        nabla_b[-1] = delta
        nabla_w[-1] = np.dot(delta, activations[-2].transpose())
        # Note that the variable l in the loop below is used a little
        # differently to the notation in Chapter 2 of the book.  Here,
        # l = 1 means the last layer of neurons, l = 2 is the
        # second-last layer, and so on.  It's a renumbering of the
        # scheme in the book, used here to take advantage of the fact
        # that Python can use negative indices in lists.
        for l in xrange(2, self.num_layers):
            z = zs[-l]
            sp = sigmoid_prime(z)
            delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
            nabla_b[-l] = delta
            nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
        return (nabla_b, nabla_w)

    def evaluate(self, test_data):
        """Return the number of test inputs for which the neural
        network outputs the correct result. Note that the neural
        network's output is assumed to be the index of whichever
        neuron in the final layer has the highest activation."""
        test_results = [(np.argmax(self.feedforward(x)), y)
                        for (x, y) in test_data]
        return sum(int(x == y) for (x, y) in test_results)

    def cost_derivative(self, output_activations, y):
        """Return the vector of partial derivatives \partial C_x /
        \partial a for the output activations."""
        return (output_activations-y)

#### Miscellaneous functions
def sigmoid(z):
    """The sigmoid function."""
    return 1.0/(1.0+np.exp(-z))

def sigmoid_prime(z):
    """Derivative of the sigmoid function."""
    return sigmoid(z)*(1-sigmoid(z))

這個程序識別手寫數字的效果有多好？讓咱們先加載MNIST訓練數據。我用一個工具程序來幫忙加載，它是 mnist_loader.py，下面介紹一下它。咱們在Python shell命令行中輸入下面的命令：

>>> import mnist_loader
>>> training_data, validation_data, test_data = \
... mnist_loader.load_data_wrapper()

固然，這些能夠用其它的Python程序來完成，但在 Python shell中執行多是最容易的方法。

加載了 MNIST 數據以後，咱們在導入network模塊，用30個隱藏的神經元來搭建網絡。

>>> import network
>>> net = network.Network([784, 30, 10])

最後，咱們會使用隨機梯度降低來學習。用 MNIST training_data 訓練30次， mini-batch是10，學習率爲

>>> net.SGD(training_data, 30, 10, 3.0, test_data=test_data)

注意若是你運行上面的代碼，可能會花一點時間來執行 - 通常的電腦 (2015年時期) 會可能花幾分鐘來運行。我建議你先用程序代碼跑一遍再繼續往下看，按期檢查一下代碼的輸出。若是你時間倉促，你能夠經過減小訓練次數，或者減小隱藏神經元的數量，又或者只使用小部分訓練數據來加快程序運行。注意實際生產環境的代碼會快不少：這些Python腳本旨在幫助你理解神經網絡的工做原理，並非高性能的代碼！固然一旦你完成了網絡的訓練，它幾乎在全部計算平臺都會運行得很是快。例如咱們一旦的網絡訓練好了權重和偏移量，它能夠很容易移植到瀏覽器上的網頁用Javascript來運行，或者移動設備的本地app。不管如何，這裏只是神經網絡訓練輸出的代碼副本。這個副本展現了測試圖片在每一個訓練週期內能夠被正確地識別。如你所見，單單一個訓練週期就能識別10,000張圖片中 9,129張圖片，數量還會繼續增加。

Epoch 0: 9129 / 10000
Epoch 1: 9295 / 10000
Epoch 2: 9348 / 10000
...
Epoch 27: 9528 / 10000
Epoch 28: 9542 / 10000
Epoch 29: 9534 / 10000

跟進上面的訓練結果，能夠看到訓練後的神經網絡的分類率classification rate大概是

讓咱們從新運行上面的試驗，將隱藏神經元的數量改爲

>>> net = network.Network([784, 100, 10])
>>> net.SGD(training_data, 30, 10, 3.0, test_data=test_data)

果真，改善後的結果是

固然，爲了得到這些準確性，我必須調整各類訓練的參數，例如訓練次數，最新批次the mini-batch size，和學習率 the learning rate

>>> net = network.Network([784, 100, 10])
>>> net.SGD(training_data, 30, 10, 0.001, test_data=test_data)

結果就很不理想

Epoch 0: 1139 / 10000
Epoch 1: 1136 / 10000
Epoch 2: 1135 / 10000
...
Epoch 27: 2101 / 10000
Epoch 28: 2123 / 10000
Epoch 29: 2142 / 10000

總的來講，調試一個神經網絡多是一項挑戰，尤爲是當初始hyper-parameters參數的結果比隨機的噪音產生的結果要差的時候。假如咱們30個神經元的網絡設置學習率爲

>>> net = network.Network([784, 30, 10])
>>> net.SGD(training_data, 30, 10, 100.0, test_data=test_data)

Epoch 0: 1009 / 10000
Epoch 1: 1009 / 10000
Epoch 2: 1009 / 10000
Epoch 3: 1009 / 10000
...
Epoch 27: 982 / 10000
Epoch 28: 982 / 10000
Epoch 29: 982 / 10000

如今想象一下，咱們第一次遇到這種問題。固然，咱們根據以前的試驗將學習率下降纔是正確的。但若是第一次遇到這種問題，咱們沒法根據輸出結果獲知怎麼調整參數。咱們可能不只單選學習率，還擔憂神經網絡其它方面的參數。咱們可能會疑惑是否權重和偏移量的初始值使神經網絡難以訓練？或者咱們沒有足夠的訓練數據來進行有意義的學習？仍是沒有足夠的訓練次數？或者這種架構的神經網絡不可能適用於識別手寫數字？學習率定得過低或者過高？當你第一次遇到問題，你不肯定是什麼緣由致使的。

這節內容以調試神經網絡結束，調試神經網絡並非小事，像編程同樣重要，是一門藝術。你須要學會經過調試來使神經網絡得到良好的輸出結果。通常來講咱們須要提升選擇合適的 hyper-parameters 和好架構的探索能力。做者的整本書都會討論這些，包括怎樣選擇合適的hyper-parameters。

練習

嘗試創建一個只有兩層的神經網絡 - 只有輸入和輸出層，沒有隱藏層 - 輸入層784個神經元，輸出層10 個神經元，respectively. 用隨機梯度降低來訓練這個網絡。看看你能達到怎樣的分類精度？

早前，我跳過了，沒有解釋怎樣加載MNIST數據。很直接，爲了完整一點，我給出了代碼。用來存儲MNIST 的數據結構在代碼註釋中說的很清楚了- 很直接了當的東西。 Numpy ndarray 對象的元組和列表 (若是你熟悉 ndarray，把它們想象成向量):

"""
mnist_loader
~~~~~~~~~~~~

A library to load the MNIST image data.  For details of the data
structures that are returned, see the doc strings for ``load_data``
and ``load_data_wrapper``.  In practice, ``load_data_wrapper`` is the
function usually called by our neural network code.
"""

#### Libraries
# Standard library
import cPickle
import gzip

# Third-party libraries
import numpy as np

def load_data():
    """Return the MNIST data as a tuple containing the training data,
    the validation data, and the test data.

    The ``training_data`` is returned as a tuple with two entries.
    The first entry contains the actual training images.  This is a
    numpy ndarray with 50,000 entries.  Each entry is, in turn, a
    numpy ndarray with 784 values, representing the 28 * 28 = 784
    pixels in a single MNIST image.

    The second entry in the ``training_data`` tuple is a numpy ndarray
    containing 50,000 entries.  Those entries are just the digit
    values (0...9) for the corresponding images contained in the first
    entry of the tuple.

    The ``validation_data`` and ``test_data`` are similar, except
    each contains only 10,000 images.

    This is a nice data format, but for use in neural networks it's
    helpful to modify the format of the ``training_data`` a little.
    That's done in the wrapper function ``load_data_wrapper()``, see
    below.
    """
    f = gzip.open('../data/mnist.pkl.gz', 'rb')
    training_data, validation_data, test_data = cPickle.load(f)
    f.close()
    return (training_data, validation_data, test_data)

def load_data_wrapper():
    """Return a tuple containing ``(training_data, validation_data,
    test_data)``. Based on ``load_data``, but the format is more
    convenient for use in our implementation of neural networks.

    In particular, ``training_data`` is a list containing 50,000
    2-tuples ``(x, y)``.  ``x`` is a 784-dimensional numpy.ndarray
    containing the input image.  ``y`` is a 10-dimensional
    numpy.ndarray representing the unit vector corresponding to the
    correct digit for ``x``.

    ``validation_data`` and ``test_data`` are lists containing 10,000
    2-tuples ``(x, y)``.  In each case, ``x`` is a 784-dimensional
    numpy.ndarry containing the input image, and ``y`` is the
    corresponding classification, i.e., the digit values (integers)
    corresponding to ``x``.

    Obviously, this means we're using slightly different formats for
    the training data and the validation / test data.  These formats
    turn out to be the most convenient for use in our neural network
    code."""
    tr_d, va_d, te_d = load_data()
    training_inputs = [np.reshape(x, (784, 1)) for x in tr_d[0]]
    training_results = [vectorized_result(y) for y in tr_d[1]]
    training_data = zip(training_inputs, training_results)
    validation_inputs = [np.reshape(x, (784, 1)) for x in va_d[0]]
    validation_data = zip(validation_inputs, va_d[1])
    test_inputs = [np.reshape(x, (784, 1)) for x in te_d[0]]
    test_data = zip(test_inputs, te_d[1])
    return (training_data, validation_data, test_data)

def vectorized_result(j):
    """Return a 10-dimensional unit vector with a 1.0 in the jth
    position and zeroes elsewhere.  This is used to convert a digit
    (0...9) into a corresponding desired output from the neural
    network."""
    e = np.zeros((10, 1))
    e[j] = 1.0
    return e

上面我說過咱們的程序得到了很好的結果。是什麼意思呢？這個好是跟什麼比較？用一下簡單的 (非神經網絡的) 基準測試來做比較，才能明白這個好是什麼意思。這個基準測試固然是隨機猜數字。隨機猜中的準確度是10%。咱們用另一種方法來稍微提升一下準確度。

有沒有更簡單易懂的基準呢？讓咱們來嘗試一個很是簡單的想法：比較圖片灰度。例如，一個

建議試用訓練數據來計算每一個像素的平均灰度

用上面的方法實現精度達

若是你使用默認設置運行scikit-learn的 SVM 分類器，精度大概是94.35% (代碼在這裏 here) 比起上面的利用灰度來分類有天大的改善。事實上這裏的 SVM 的性能比神經網絡稍微查一點。在後面的一章咱們會引進一種新的技術來改善神經網絡，讓它的性能比SVM出色。

然而，這不是故事的結尾。94.35%這個結果scikit-learn的SVM默認設置時的性能。 SVM有一大堆可調的參數，有可能找到一些參數來提升性能。我不會明確地去作這件事，看這裏由 Andreas Mueller寫的這篇博客若是你想了解更多。Mueller給咱們演示了經過一些方法來優化SVM的參數，能夠將精度提升到98.5%。換句話講，一個好的可調的SVM出錯率大0七十分之一。這很是厲害！神經網絡能作得更好嗎？

事實上，神經網絡能夠作得更好。如今，一個設計良好的神經網絡處理MNIST數據方面的精度比其它算法要好，包括SVM。當前時間 (2013年)的記錄的分類的精度達到99.79%( 9,979/10,000)。這是 Li Wan, Matthew Zeiler, Sixin Zhang, Yann LeCun, 和Rob Fergus作到的。在本書的後面，咱們會看到他們使用的大多數技術。這個水平的性能已經接近人類的水平了，甚至可能比人類還好一點，由於有少許的MNIST圖片甚至人類都沒有信心識別出來，例如：

我相信你會贊成上面這些圖片很難區分！上面這些MNIST圖片， 21 張這樣的圖片放在10,000圖片中神經網絡能準確地識別出來。一般，編程的時候咱們相信解決一個諸如識別MNIST圖片數字須要一種深奧的算法。但關於咱們在本章節看到算法原型，即便在Wan et al 論文中也提到神經網絡僅涉及一種很是簡單的算法。全部的複雜性在於神經網絡從訓練數據中自動學習。在某種意義上，咱們實現的神經網絡和其它更深奧的論文是爲了解決如下問題：

深奧的算法

向深度學習邁進

譯者注：最後翻譯進度的時間是：2017-01-11 00:41，我會繼續往下翻譯的：

咱們的神經網絡的性能使人印象深入，性能有點神祕。權重和偏移會自動調整。這意味着咱們不能一會兒解釋出神經網絡是怎樣作到的。咱們能夠找到一些方法類理解咱們的神經網絡怎樣分類手寫數字的法則嗎？若是有一些法則咱們會作得更好嗎？

爲了使這個問題更加分明，假定幾十年後神經網絡致使人工智能(AI)出現了。咱們能夠知道這種智能地神經網絡是怎樣工做的嗎？或許網絡對咱們來講是透明的，權重偏移量咱們不能理解，由於他們自主學習了。早些時候的AI研究，人們但願創建AI的努力能夠幫助咱們理解智能背後的法則和人類大腦的運行機理。最後結果多是咱們既不瞭解大腦的運行機制也不知道人工智能怎麼工做！

爲了解決這個問題，讓咱們回想一下我再第一章開頭提到的人工神經元的解釋，衡量證據的一種手段。假如咱們想判斷一張圖片是不是人臉：

咱們能夠用手寫數字識別的相同方法類解決這個問題 - 使用圖片中的像素做爲神經網絡的輸入，一個神經元輸入"是的這是一張臉" 或者 "不是，這不是臉"（這翻譯有點硬）

讓咱們假設咱們來作這件事，但咱們不使用現有的學習算法。咱們準備嘗試手動設計一個網絡，選擇合適的權重和偏移量。咱們應該怎麼作？先把神經網絡的概念徹底忘掉，咱們能夠將問題分解成一個個小問題：圖片左上角有沒有一個眼睛？右上角有沒有一個眼睛？中間有鼻子嗎？下邊中間有沒有一個嘴巴等等。

如上上面的問題的答案是 "yes"，或者極可能是"yes"，那麼咱們認爲這張圖片極可能是一張臉。相反，若是大多數答案都是 "no"，那麼圖片極可能不是一張臉。

固然這只是一個粗暴的思惟探索，有不少缺陷。也許這我的是光頭，所以沒有頭髮。也許咱們只能看到半張臉，或者臉的某個角度，所以不少面部特徵模糊不清。但這個思惟探索代表若是咱們用神經網絡來解答這些子問題，經過這些子問題組合造成的網絡，那麼極可能咱們能夠創建一個用於臉部識別的神經網絡。這是大概的架構，用矩形來表明子網絡。注意這不是一個解決面部識別問題的現實中應用的方法；只是一個幫助咱們創建神經網絡直覺。這是架構圖：

子網絡貌似也能夠分解。假如咱們考慮一個問題："左上角有一個眼睛嗎？" 這個問題能夠分解爲："是否有眼珠？"； "是否有眼睫毛？"; "是否有虹膜？"；以及其它等等。固然這些問題也真的包含位置信息 - "眼珠在左上方，在睫毛的上面？", 諸如此類- 但咱們爲了保持簡單。網絡只分解一個問題， "左上方是否眼睛？" 如今能夠分解成：

這些問題能夠經過多層網絡一步步分解。最後咱們子網絡能夠回答到能從像素級別的回答的問題。這些問題可能，例如在圖片的特定的點上的很是簡單的形狀。這些問題能夠用一個鏈接到圖片像素的神經元來回答。

最後的結果是一個複雜問題的網絡 - (用來判斷圖片是不是一張臉的網絡) - 分解成一個個能在單個像素級別回答的很是簡單的問題。它經過分紅不少層來分解問題。前幾層回答圖片輸入的特定的簡單問題，後面的層創建更復雜和抽象的概念。這種多層結構的網絡 - 有兩個或者更多的隱藏層 - 被叫作深度神經網絡。

固然，我沒有說過怎樣遞歸分解成子網絡。固然不是手工來設計權重和偏移量，咱們用學習算法來搞，這樣網絡就能夠從訓練數據中自動學習調整權重和偏移量了。研究人員在1980和1990年代嘗試使用隨機梯度降低和反向傳播算法來訓練深度網絡。不幸的是，除了少許特殊的架構，其它的就沒有那麼幸運得出心儀的結果。網絡會學習，可是太慢，在實踐中沒有多大做用。

2006年以來，一系列可用的深度學習神經網絡的新技術被開發出來。這些深度學習技術也是基於隨機梯度降低算法和反向傳播算法的，但也引入了新的思想。這些技術可以訓練更深更大型的網絡 - 人們如今一般能訓練有5到10個隱藏層的網絡，性能比原來的淺層網絡（例如只有一個隱藏層的網絡）要好不少。理由固然是深度網絡的能力能創建複雜的概念。這有點像傳統的編程語言使用模塊化設計思想來抽象來構造一個複雜的程序。對比深度網絡和淺層網絡有點像對比有函數封裝概念和沒有函數概念的編程語言。固然神經網絡的抽象和傳統編程的抽象是不一樣的，只是想說明抽象真的很是重要。

譯者注：至此全部淺層神經網絡部分翻譯都完成了，翻譯完成的時間是：2017-01-22 23:35，接下來將會翻譯有關深度神經網絡和深度學習方面的知識，敬請期待。因爲沒太多時間，可能會有翻譯不通順，錯別字等狀況，請見諒，後面我會逐步回頭檢查修正，請見諒！