Artificial Neural Networks are all the rage. One has to wonder if the catchy name played a role in the model’s own marketing and adoption. I’ve seen business managers giddy to mention that their products use 「Artificial Neural Networks」 and 「Deep Learning」. Would they be so giddy to say their products use 「Connected Circles Models」 or 「Fail and Be Penalized Machines」? But make no mistake – Artificial Neural Networks are the real deal as evident by their success in a number of applications like image recognition, natural language processing, automated trading, and autonomous cars. As a professional data scientist who didn’t fully understand them, I felt embarrassed like a builder without a table saw. Consequently I’ve done my homework and written this article to help others overcome the same hurdles and head scratchers I did in my own (ongoing) learning process.node
Note that R code for the examples presented in this article can be found here in the Machine Learning Problem Bible. Additionally, check out Part 2, Neural Networks – A Worked Example after reading this article to see the details behind designing and coding a neural network from scratch.ios
We’ll start with a motivational problem. Here we have a collection of grayscale images, each a 2×2 grid of pixels where each pixel has an intensity value between 0 (white) and 255 (black). The goal is to build a model that identifies images with a 「stairs」 pattern.git
At this point, we are only interested in finding a model that could fit the data reasonably. We’ll worry about the fitting methodology later.github
For each image, we label the pixels x1, x2, x3, x4 and generate an input vector x=[x1x2x3x4] which will be the input to our model. We expect our model to predict True (the image has the stairs pattern) or False (the image does not have the stairs pattern).express
ImageId | x1 | x2 | x3 | x4 | IsStairs |
---|---|---|---|---|---|
1 | 252 | 4 | 155 | 175 | TRUE |
2 | 175 | 10 | 186 | 200 | TRUE |
3 | 82 | 131 | 230 | 100 | FALSE |
… | … | … | … | … | … |
498 | 36 | 187 | 43 | 249 | FALSE |
499 | 1 | 160 | 169 | 242 | TRUE |
500 | 198 | 134 | 22 | 188 | FALSE |
A simple model we could build is a single layer perceptron. A perceptron uses a weighted linear combination of the inputs to return a prediction score. If the prediction score exceeds a selected threshold, the perceptron predicts True. Otherwise it predicts False. More formally,api
f(x)={10if w1x1+w2x2+w3x3+w4x4>thresholdotherwiseapp
Let’s re-express this as followsless
yˆ=w⋅x+b f(x)={10if yˆ>0otherwisedom
Here yˆ is our prediction score.ide
Pictorially, we can represent a perceptron as input nodes that feed into an output node.
For our example, suppose we build the following perceptron:
yˆ=−0.0019x1+−0.0016x2+0.0020x3+0.0023x4+0.0003
Here’s how the perceptron would perform on some of our training images.
This would certainly be better than randomly guessing and it makes some logical sense. All the stairs patterns have darkly shaded pixels in the bottom row which supports the larger, positive coefficients for x3 and x4. Nonetheless, there are some glaring problems with this model.
Case A
Start with an image, x = [100, 0, 0, 125]. Increase x3 from 0 to 60.
Case B
Start with the last image, x = [100, 0, 60, 125]. Increase x3 from 60 to 120.
Intuitively, Case A should have a much larger increase in yˆ than Case B. However, since our perceptron model is a linear equation, the equivalent +60 change in x3 resulted in an equivalent +0.12 change in yˆ for both cases.
There are more issues with our linear perception, but let’s start by addressing these two.
We can fix problems 1 and 2 above by wrapping our perceptron within a sigmoid function (and subsequently choosing different weights). Recall that the sigmoid function is an S shaped curve bounded on the vertical axis between 0 and 1, and is thus frequently used to model the probability of a binary event.
sigmoid(z)=11+e−z
Following this idea, we can update our model with the following picture and equation.
z=w⋅x=w1x1+w2x2+w3x3+w4x4 yˆ=sigmoid(z)=11+e−z
Looks familiar? It’s our old friend, logistic regression. However, it’ll serve us well to interpret the model as a linear perceptron with a sigmoid 「activation function」 because that gives us more room to generalize. Also, since we now interpret yˆ as a probability, we must update our decision rule accordingly.
f(x)={10if yˆ>0.5otherwise
Continuing with our example problem, suppose we come up with the following fitted model:
[w1w2w3w4]=[−0.140−0.1450.1210.092] b=−0.008 yˆ=11+e−(−0.140x1−0.145x2+0.121x3+0.092x4−0.008)
Observe how this model performs on the same sample images from the previous section.
Clearly this fixes problem 1 from above. Observe how it also fixes problem 2.
Case A
Start with an image, x = [100, 0, 0, 125]. Increase x3 from 0 to 60.
Case B
Start with the last image, x = [100, 0, 60, 125]. Increase x3 from 60 to 120.
Notice how the curvature of the sigmoid function causes Case A to 「fire」 (increase rapidly) as z=w⋅x increases, but the pace slows down as z continues to increase. This aligns with our intuition that Case A should reflect a greater increase in the likelihood of stairs versus Case B.
Unfortunately this model still has issues.
We can solve both of the above issues by adding an extra layer to our perceptron model. We’ll construct a number of base models like the one above, but then we’ll feed the output of each base model as input into another perceptron. This model is in fact a vanilla neural network. Let’s see how it might work on some examples.
Example 1: Identify the stairs pattern
Alternatively
Example 2: Identify lightly shaded stairs
A single-layer perceptron has a single output layer. Consequently, the models we just built would be called two-layer perceptrons because they have an output layer which is the input to another output layer. However, we could call these same models neural networks, and in this respect the networks have three layers – an input layer, a hidden layer, and an output layer.
In our examples we used a sigmoid activation function. However, we could use other activation functions. tanh and relu are common choices. The activation function must be non-linear, otherwise the neural network would simplify to an equivalent single layer perceptron.
We can easily extend our model to work for multiclass classification by using multiple nodes in the final output layer. The idea here is that each output node corresponds to one of the C classes we are trying to predict. Instead of squashing the output with the sigmoid function which maps an element in ℝ to and element in [0, 1], we can use the softmax function which maps a vector in ℝn to a vector in ℝn such that the resulting vector elements sum to 1. In other words, we can design the network such that it outputs the vector [prob(class1), prob(class2), …, prob(classC)].
You might be wondering, 「Can we extend our vanilla neural network so that its output layer is fed into a 4th layer (and then a 5th, and 6th, etc.)?」. Yes. This is what’s commonly referred to as 「deep learning」. In practice it can be very effective. However, it’s worth noting that any network you build with more than one hidden layer can be mimicked by a network with only one hidden layer. In fact, you can approximate any continuous function using a neural network with a single hidden layer as per the Universal Approximation Theorem. The reason deep neural network architectures are frequently chosen in favor of single hidden layer architectures is that they tend to converge to a solution faster during the fitting procedure.
Alas we come to the fitting procedure. So far we’ve discussed how neural networks could work effectively, but we haven’t discussed how to fit a neural network to labeled training samples. An equivalent question would be, 「How can we choose the best weights for a network, given some labeled training samples?」. Gradient descent is the common answer (although MLE can work too). Continuing with our example problem, the gradient descent procedure would go something like this:
That’s the basic idea at least. In practice, this poses a number of challenges.
During the fitting procedure, one of the things we’ll need to calculate is the gradient of L with respect to every weight. This is tricky because L depends on every node in the output layer, and each of those nodes depends on every node in the layer before it, and so on. This means calculating ∂L∂wab is a chain-rule nightmare. (Keep in mind that many real-wold neural networks have thousands of nodes across tens of layers.) The key to dealing with this is to recognize that most of the ∂L∂wabs reuse the same intermediate derivatives when you apply the chain-rule. If you’re careful about tracking this, you can avoid recalculating the same thing thousands of times.
Another trick is to use special activation functions whose derivatives can be written as a function of their value. For example, the derivative of sigmoid(x) = sigmoid(x)(1–sigmoid(x)). This is convenient because during the forward pass, when we calculate yˆ for each training sample, we have to calculate sigmoid(x) element-wise for some vector x. During backprop we can reuse those values when calculating the gradient of L with respect to the weights, saving time and memory.
A third trick is to partition the training data into 「mini batches」 and update the weights with respect to each batch, one after another. For example, if you partition your training data into {batch1, batch2, batch3}, the first pass over the training data would
where the gradient of L is recalculated after each update.
The last technique worth mentioning is to make use of GPU as opposed to CPU, as GPU is better suited to perform lots of calculations in parallel.
This is not so much a neural network problem as it is a gradient descent problem. It’s possible that the weights could get stuck in a local minimum during gradient descent. It’s also possible that weights can overshoot the minimum. One trick to dealing with this is to tinker with different step sizes. Another trick is to increase the number of nodes and/or layers in the network. (Beware of overfitting). Additionally, some heuristic techniques like using momentum can be effective.
How might we write a generic program to fit any neural network with any number of nodes and layers? The answer is, 「You don’t, you use Tensorflow「. But if you really wanted to, the hard part is calculating the gradient of the loss function. The trick to doing this is to recognize that you can represent the gradient as a recursive function. A neural network with 5 layers is just a neural network with 4 layers that feeds into some perceptrons. But a neural network with 4 layers is just a neural network with 3 layers that feed into some perceptrons. And so on it goes. This is more formally known as auto differentiation.
Now check out Neural Networks – A Worked Example to see how to build a neural network from scratch.