Posted on August 11, 2016 by ujjwalkarnphp
Convolutional Neural Networks (ConvNets or CNNs) are a category of Neural Networks that have proven very effective in areas such as image recognition and classification. ConvNets have been successful in identifying faces, objects and traffic signs apart from powering vision in robots and self driving cars.html
卷積神經網絡(ConvNets 或者 CNNs)屬於神經網絡的範疇,已經在諸如圖像識別和分類的領域證實了其高效的能力。卷積神經網絡能夠成功識別人臉、物體和交通訊號,從而爲機器人和自動駕駛汽車提供視力。node
In Figure 1 above, a ConvNet is able to recognize scenes and the system is able to suggest relevant captions (「a soccer player is kicking a soccer ball」) while Figure 2 shows an example of ConvNets being used for recognizing everyday objects, humans and animals. Lately, ConvNets have been effective in several Natural Language Processing tasks (such as sentence classification) as well.python
在圖1中,卷積神經網絡能夠識別場景,也能夠提供相關的標籤,好比「橋樑」、「火車」和「網球」;而下圖展現了卷積神經網絡能夠用來識別平常物體、人和動物。最近,卷積神經網絡也在一些天然語言處理任務(好比語句分類)上面展現了良好的效果。git
ConvNets, therefore, are an important tool for most machine learning practitioners today. However, understanding ConvNets and learning to use them for the first time can sometimes be an intimidating experience. The primary purpose of this blog post is to develop an understanding of how Convolutional Neural Networks work on images.github
所以,卷積神經網絡對於今天大多數的機器學習用戶來講都是一個重要的工具。然而,理解卷積神經網絡以及首次學習使用它們有時會很痛苦。那本篇博客的主要目的就是讓咱們對卷積神經網絡如何處理圖像有一個基本的瞭解。web
If you are new to neural networks in general, I would recommend reading this short tutorial on Multi Layer Perceptrons to get an idea about how they work, before proceeding. Multi Layer Perceptrons are referred to as 「Fully Connected Layers」 in this post.網絡
若是你是神經網絡的新手,我建議你閱讀下這篇短小的多層感知器的教程,在進一步閱讀前對神經網絡有必定的理解。在本篇博客中,多層感知器叫作「全鏈接層」。架構
LeNet was one of the very first convolutional neural networks which helped propel the field of Deep Learning. This pioneering work by Yann LeCun was named LeNet5 after many previous successful iterations since the year 1988 [3]. At that time the LeNet architecture was used mainly for character recognition tasks such as reading zip codes, digits, etc.app
LeNet 是推動深度學習領域發展的最先的卷積神經網絡之一。通過屢次成功迭代,到 1988 年,Yann LeCun 把這一先驅工做命名爲 LeNet5。當時,LeNet 架構主要用於字符識別任務,好比讀取郵政編碼、數字等等。
Below, we will develop an intuition of how the LeNet architecture learns to recognize images. There have been several new architectures proposed in the recent years which are improvements over the LeNet, but they all use the main concepts from the LeNet and are relatively easier to understand if you have a clear understanding of the former.
接下來,咱們將會了解 LeNet 架構是如何學會識別圖像的。近年來有許多在 LeNet 上面改進的新架構被提出來,但它們都使用了 LeNet 中的主要概念,若是你對 LeNet 有一個清晰的認識,就相對比較容易理解。
The Convolutional Neural Network in Figure 3 is similar in architecture to the original LeNet and classifies an input image into four categories: dog, cat, boat or bird (the original LeNet was used mainly for character recognition tasks). As evident from the figure above, on receiving a boat image as input, the network correctly assigns the highest probability for boat (0.94) among all four categories. The sum of all probabilities in the output layer should be one (explained later in this post).
圖3中的卷積神經網絡和原始的 LeNet 的結構比較類似,能夠把輸入的圖像分爲四類:狗、貓、船或者鳥(原始的 LeNet 主要用於字符識別任務)。正如上圖說示,當輸入爲一張船的圖片時,網絡能夠正確的從四個類別中把最高的機率分配給船(0.94)。在輸出層全部機率的和應該爲一(本文稍後會解釋)。
There are four main operations in the ConvNet shown in Figure 3 above:
在圖3中ConvNet有四個主要操做:
Classification (Fully Connected Layer)
1.卷積
2.非線性處理(ReLU)
3.池化(亞採樣)
4.分類(全鏈接層)
These operations are the basic building blocks of every Convolutional Neural Network, so understanding how these work is an important step to developing a sound understanding of ConvNets. We will try to understand the intuition behind each of these operations below.
這些操做對於各個卷積神經網絡來講都是基本組件,所以理解它們的工做原理有助於充分了解卷積神經網絡。下面咱們將會嘗試理解各步操做背後的原理。
Essentially, every image can be represented as a matrix of pixel values.
本質上來講,每張圖像均可以表示爲像素值的矩陣:
Channel is a conventional term used to refer to a certain component of an image. An image from a standard digital camera will have three channels – red, green and blue – you can imagine those as three 2d-matrices stacked over each other (one for each color), each having pixel values in the range 0 to 255.
通道 經常使用於表示圖像的某種組成。一個標準數字相機拍攝的圖像會有三通道 - 紅、綠和藍;你能夠把它們看做是互相堆疊在一塊兒的二維矩陣(每個通道表明一個顏色),每一個通道的像素值在 0 到 255 的範圍內。
A grayscale image, on the other hand, has just one channel. For the purpose of this post, we will only consider grayscale images, so we will have a single 2d matrix representing an image. The value of each pixel in the matrix will range from 0 to 255 – zero indicating black and 255 indicating white.
灰度圖像,僅僅只有一個通道。在本篇文章中,咱們僅考慮灰度圖像,這樣咱們就只有一個二維的矩陣來表示圖像。矩陣中各個像素的值在 0 到 255 的範圍內——零表示黑色,255 表示白色。
ConvNets derive their name from the 「convolution」 operator. The primary purpose of Convolution in case of a ConvNet is to extract features from the input image. Convolution preserves the spatial relationship between pixels by learning image features using small squares of input data. We will not go into the mathematical details of Convolution here, but will try to understand how it works over images.
卷積神經網絡的名字就來自於其中的卷積操做。卷積的主要目的是爲了從輸入圖像中提取特徵。卷積能夠經過從輸入的一小塊數據中學到圖像的特徵,並能夠保留像素間的空間關係。咱們在這裏並不會詳細講解卷積的數學細節,但咱們會試着理解卷積是如何處理圖像的。
As we discussed above, every image can be considered as a matrix of pixel values. Consider a 5 x 5 image whose pixel values are only 0 and 1 (note that for a grayscale image, pixel values range from 0 to 255, the green matrix below is a special case where pixel values are only 0 and 1):
正如咱們上面所說,每張圖像均可以看做是像素值的矩陣。考慮一下一個 5 x 5 的圖像,它的像素值僅爲 0 或者 1(注意對於灰度圖像而言,像素值的範圍是 0 到 255,下面像素值爲 0 和 1 的綠色矩陣僅爲特例):
Also, consider another 3 x 3 matrix as shown below:
同時,考慮下另外一個 3 x 3 的矩陣,以下所示:
Then, the Convolution of the 5 x 5 image and the 3 x 3 matrix can be computed as shown in the animation in Figure 5 below:
接下來,5 x 5 的圖像和 3 x 3 的矩陣的卷積能夠按下圖所示的動畫同樣計算:
Take a moment to understand how the computation above is being done. We slide the orange matrix over our original image (green) by 1 pixel (also called ‘stride’) and for every position, we compute element wise multiplication (between the two matrices) and add the multiplication outputs to get the final integer which forms a single element of the output matrix (pink). Note that the 3×3 matrix 「sees」 only a part of the input image in each stride.
如今停下來好好理解下上面的計算是怎麼完成的。咱們用橙色的矩陣在原始圖像(綠色)上滑動,每次滑動一個像素(也叫作「步長」),在每一個位置上,咱們計算對應元素的乘積(兩個矩陣間),並把乘積的和做爲最後的結果,獲得輸出矩陣(粉色)中的每個元素的值。注意,3 x 3 的矩陣每次步長中僅能夠「看到」輸入圖像的一部分。
In CNN terminology, the 3×3 matrix is called a ‘filter‘ or ‘kernel’ or ‘feature detector’ and the matrix formed by sliding the filter over the image and computing the dot product is called the ‘Convolved Feature’ or ‘Activation Map’ or the ‘Feature Map‘. It is important to note that filters acts as feature detectors from the original input image.
在 CNN 的術語中,3x3 的矩陣叫作「濾波器(filter)」或者「核(kernel)」或者「特徵檢測器(feature detector)」,經過在圖像上滑動濾波器並計算點乘獲得矩陣叫作「卷積特徵(Convolved Feature)」或者「激活圖(Activation Map)」或者「特徵圖(Feature Map)」。記住濾波器在原始輸入圖像上的做用是特徵檢測器。
It is evident from the animation above that different values of the filter matrix will produce different Feature Maps for the same input image. As an example, consider the following input image:
從上面圖中的動畫能夠看出,對於一樣的輸入圖像,不一樣值的濾波器將會生成不一樣的特徵圖。好比,對於下面這張輸入圖像:
In the table below, we can see the effects of convolution of the above image with different filters. As shown, we can perform operations such as Edge Detection, Sharpen and Blur just by changing the numeric values of our filter matrix before the convolution operation [8] – this means that different filters can detect different features from an image, for example edges, curves etc. More such examples are available in Section 8.2.4 here.
Another good way to understand the Convolution operation is by looking at the animation in Figure 6 below:
A filter (with red outline) slides over the input image (convolution operation) to produce a feature map. The convolution of another filter (with the green outline), over the same image gives a different feature map as shown. It is important to note that the Convolution operation captures the local dependencies in the original image. Also notice how these two different filters generate different feature maps from the same original image. Remember that the image and the two filters above are just numeric matrices as we have discussed above.
In practice, a CNN learns the values of these filters on its own during the training process (although we still need to specify parameters such as number of filters, filter size, architecture of the networketc. before the training process). The more number of filters we have, the more image features get extracted and the better our network becomes at recognizing patterns in unseen images.
The size of the Feature Map (Convolved Feature) is controlled by three parameters [4] that we need to decide before the convolution step is performed:
Stride: Stride is the number of pixels by which we slide our filter matrix over the input matrix. When the stride is 1 then we move the filters one pixel at a time. When the stride is 2, then the filters jump 2 pixels at a time as we slide them around. Having a larger stride will produce smaller feature maps.
Zero-padding: Sometimes, it is convenient to pad the input matrix with zeros around the border, so that we can apply the filter to bordering elements of our input image matrix. A nice feature of zero padding is that it allows us to control the size of the feature maps. Adding zero-padding is also called wide convolution, and not using zero-padding would be a narrow convolution. This has been explained clearly in [14].
An additional operation called ReLU has been used after every Convolution operation in Figure 3 above. ReLU stands for Rectified Linear Unit and is a non-linear operation. Its output is given by:
ReLU is an element wise operation (applied per pixel) and replaces all negative pixel values in the feature map by zero. The purpose of ReLU is to introduce non-linearity in our ConvNet, since most of the real-world data we would want our ConvNet to learn would be non-linear (Convolution is a linear operation – element wise matrix multiplication and addition, so we account for non-linearity by introducing a non-linear function like ReLU).
The ReLU operation can be understood clearly from Figure 9 below. It shows the ReLU operation applied to one of the feature maps obtained in Figure 6 above. The output feature map here is also referred to as the ‘Rectified’ feature map.
Other non linear functions such as tanh or sigmoid can also be used instead of ReLU, but ReLU has been found to perform better in most situations.
Spatial Pooling (also called subsampling or downsampling) reduces the dimensionality of each feature map but retains the most important information. Spatial Pooling can be of different types: Max, Average, Sum etc.
In case of Max Pooling, we define a spatial neighborhood (for example, a 2×2 window) and take the largest element from the rectified feature map within that window. Instead of taking the largest element we could also take the average (Average Pooling) or sum of all elements in that window. In practice, Max Pooling has been shown to work better.
Figure 10 shows an example of Max Pooling operation on a Rectified Feature map (obtained after convolution + ReLU operation) by using a 2×2 window.
We slide our 2 x 2 window by 2 cells (also called ‘stride’) and take the maximum value in each region. As shown in Figure 10, this reduces the dimensionality of our feature map.
In the network shown in Figure 11, pooling operation is applied separately to each feature map (notice that, due to this, we get three output maps from three input maps).
Figure 12 shows the effect of Pooling on the Rectified Feature Map we received after the ReLU operation in Figure 9 above.
The function of Pooling is to progressively reduce the spatial size of the input representation [4]. In particular, pooling
So far we have seen how Convolution, ReLU and Pooling work. It is important to understand that these layers are the basic building blocks of any CNN. As shown in Figure 13, we have two sets of Convolution, ReLU & Pooling layers – the 2nd Convolution layer performs convolution on the output of the first Pooling Layer using six filters to produce a total of six feature maps. ReLU is then applied individually on all of these six feature maps. We then perform Max Pooling operation separately on each of the six rectified feature maps.
Together these layers extract the useful features from the images, introduce non-linearity in our network and reduce feature dimension while aiming to make the features somewhat equivariant to scale and translation [18].
The output of the 2nd Pooling Layer acts as an input to the Fully Connected Layer, which we will discuss in the next section.
The Fully Connected layer is a traditional Multi Layer Perceptron that uses a softmax activation function in the output layer (other classifiers like SVM can also be used, but will stick to softmax in this post). The term 「Fully Connected」 implies that every neuron in the previous layer is connected to every neuron on the next layer. I recommend reading this post if you are unfamiliar with Multi Layer Perceptrons.
The output from the convolutional and pooling layers represent high-level features of the input image. The purpose of the Fully Connected layer is to use these features for classifying the input image into various classes based on the training dataset. For example, the image classification task we set out to perform has four possible outputs as shown in Figure 14 below (note that Figure 14 does not show connections between the nodes in the fully connected layer)
Apart from classification, adding a fully-connected layer is also a (usually) cheap way of learning non-linear combinations of these features. Most of the features from convolutional and pooling layers may be good for the classification task, but combinations of those features might be even better [11].
The sum of output probabilities from the Fully Connected Layer is 1. This is ensured by using the Softmax as the activation function in the output layer of the Fully Connected Layer. The Softmax function takes a vector of arbitrary real-valued scores and squashes it to a vector of values between zero and one that sum to one.
As discussed above, the Convolution + Pooling layers act as Feature Extractors from the input image while Fully Connected layer acts as a classifier.
Note that in Figure 15 below, since the input image is a boat, the target probability is 1 for Boat class and 0 for other three classes, i.e.
The overall training process of the Convolution Network may be summarized as below:
Step1: We initialize all filters and parameters / weights with random values
Step2:
The network takes a training image as input, goes through the forward propagation step (convolution, ReLU and pooling operations along with forward propagation in the Fully Connected layer) and finds the output probabilities for each class.
Step3:
Calculate the total error at the output layer (summation over all 4 classes)
Step4:
Use Backpropagation to calculate the
gradients
of the error with respect to all weights in the network and use
gradient descent
to update all filter values / weights and parameter values to minimize the output error.
Step5: Repeat steps 2-4 with all images in the training set.
The above steps train the ConvNet – this essentially means that all the weights and parameters of the ConvNet have now been optimized to correctly classify images from the training set.
When a new (unseen) image is input into the ConvNet, the network would go through the forward propagation step and output a probability for each class (for a new image, the output probabilities are calculated using the weights which have been optimized to correctly classify all the previous training examples). If our training set is large enough, the network will (hopefully) generalize well to new images and classify them into correct categories.
Note 1: The steps above have been oversimplified and mathematical details have been avoided to provide intuition into the training process. See [4] and [12] for a mathematical formulation and thorough understanding.
Note 2: In the example above we used two sets of alternating Convolution and Pooling layers. Please note however, that these operations can be repeated any number of times in a single ConvNet. In fact, some of the best performing ConvNets today have tens of Convolution and Pooling layers! Also, it is not necessary to have a Pooling layer after every Convolutional Layer. As can be seen in the Figure 16 below, we can have multiple Convolution + ReLU operations in succession before having a Pooling operation. Also notice how each layer of the ConvNet is visualized in the Figure 16 below.
In general, the more convolution steps we have, the more complicated features our network will be able to learn to recognize. For example, in Image Classification a ConvNet may learn to detect edges from raw pixels in the first layer, then use the edges to detect simple shapes in the second layer, and then use these shapes to deter higher-level features, such as facial shapes in higher layers [14]. This is demonstrated in Figure 17 below – these features were learnt using a Convolutional Deep Belief Network and the figure is included here just for demonstrating the idea (this is only an example: real life convolution filters may detect objects that have no meaning to humans).
Adam Harley created amazing visualizations of a Convolutional Neural Network trained on the MNIST Database of handwritten digits [13]. I highly recommend playing around with it to understand details of how a CNN works.
We will see below how the network works for an input ‘8’. Note that the visualization in Figure 18does not show the ReLU operation separately.
The input image contains 1024 pixels (32 x 32 image) and the first Convolution layer (Convolution Layer 1) is formed by convolution of six unique 5 × 5 (stride 1) filters with the input image. As seen, using six different filters produces a feature map of depth six.
Convolutional Layer 1 is followed by Pooling Layer 1 that does 2 × 2 max pooling (with stride 2) separately over the six feature maps in Convolution Layer 1. You can move your mouse pointer over any pixel in the Pooling Layer and observe the 2 x 2 grid it forms in the previous Convolution Layer (demonstrated in Figure 19). You’ll notice that the pixel having the maximum value (the brightest one) in the 2 x 2 grid makes it to the Pooling layer.
Pooling Layer 1 is followed by sixteen 5 × 5 (stride 1) convolutional filters that perform the convolution operation. This is followed by Pooling Layer 2 that does 2 × 2 max pooling (with stride 2). These two layers use the same concepts as described above.
We then have three fully-connected (FC) layers. There are:
Notice how in Figure 20, each of the 10 nodes in the output layer are connected to all 100 nodes in the 2nd Fully Connected layer (hence the name Fully Connected).
Also, note how the only bright node in the Output Layer corresponds to ‘8’ – this means that the network correctly classifies our handwritten digit (brighter node denotes that the output from it is higher, i.e. 8 has the highest probability among all other digits).
The 3d version of the same visualization is available here.
Convolutional Neural Networks have been around since early 1990s. We discussed the LeNet above which was one of the very first convolutional neural networks. Some other influential architectures are listed below [3] [4].
LeNet (1990s): Already covered in this article.
1990s to 2012: In the years from late 1990s to early 2010s convolutional neural network were in incubation. As more and more data and computing power became available, tasks that convolutional neural networks could tackle became more and more interesting.
AlexNet (2012) – In 2012, Alex Krizhevsky (and others) released AlexNet which was a deeper and much wider version of the LeNet and won by a large margin the difficult ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012. It was a significant breakthrough with respect to the previous approaches and the current widespread application of CNNs can be attributed to this work.
ZF Net (2013) – The ILSVRC 2013 winner was a Convolutional Network from Matthew Zeiler and Rob Fergus. It became known as the ZFNet (short for Zeiler & Fergus Net). It was an improvement on AlexNet by tweaking the architecture hyperparameters.
GoogLeNet (2014) – The ILSVRC 2014 winner was a Convolutional Network from Szegedy et al.from Google. Its main contribution was the development of an Inception Module that dramatically reduced the number of parameters in the network (4M, compared to AlexNet with 60M).
VGGNet (2014) – The runner-up in ILSVRC 2014 was the network that became known as the VGGNet. Its main contribution was in showing that the depth of the network (number of layers) is a critical component for good performance.
ResNets (2015) – Residual Network developed by Kaiming He (and others) was the winner of ILSVRC 2015. ResNets are currently by far state of the art Convolutional Neural Network models and are the default choice for using ConvNets in practice (as of May 2016).
DenseNet (August 2016) – Recently published by Gao Huang (and others), the Densely Connected Convolutional Network has each layer directly connected to every other layer in a feed-forward fashion. The DenseNet has been shown to obtain significant improvements over previous state-of-the-art architectures on five highly competitive object recognition benchmark tasks. Check out the Torch implementation here.
In this post, I have tried to explain the main concepts behind Convolutional Neural Networks in simple terms. There are several details I have oversimplified / skipped, but hopefully this post gave you some intuition around how they work.
This post was originally inspired from Understanding Convolutional Neural Networks for NLP by Denny Britz (which I would recommend reading) and a number of explanations here are based on that post. For a more thorough understanding of some of these concepts, I would encourage you to go through the notes from Stanford’s course on ConvNets as well as other excellent resources mentioned under References below. If you face any issues understanding any of the above concepts or have questions / suggestions, feel free to leave a comment below.
All images and animations used in this post belong to their respective authors as listed in References section below.