The choice of optimization algorithm for your deep learning model can mean the difference between good results in minutes, hours, and days.php
The Adam optimization algorithm is an extension to stochastic gradient descent that has recently seen broader adoption for deep learning applications in computer vision and natural language processing.html
In this post, you will get a gentle introduction to the Adam optimization algorithm for use in deep learning.python
After reading this post, you will know:git
Let’s get started.github
Adam is an optimization algorithm that can used instead of the classical stochastic gradient descent procedure to update network weights iterative based in training data.api
Adam was presented by Diederik Kingma from OpenAI and Jimmy Ba from the University of Toronto in their 2015 ICLR paper (poster) titled 「Adam: A Method for Stochastic Optimization「. I will quote liberally from their paper in this post, unless stated otherwise.app
The algorithm is called Adam. It is not an acronym and is not written as 「ADAM」.less
… the name Adam is derived from adaptive moment estimation.ide
When introducing the algorithm, the authors list the attractive benefits of using Adam on non-convex optimization problems, as follows:wordpress
Adam is different to classical stochastic gradient descent.
Stochastic gradient descent maintains a single learning rate (termed alpha) for all weight updates and the learning rate does not change during training.
A learning rate is maintained for each network weight (parameter) and separately adapted as learning unfolds.
The method computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients.
The authors describe Adam as combining the advantages of two other extensions of stochastic gradient descent. Specifically:
Adam realizes the benefits of both AdaGrad and RMSProp.
Instead of adapting the parameter learning rates based on the average first moment (the mean) as in RMSProp, Adam also makes use of the average of the second moments of the gradients (the uncentered variance).
Specifically, the algorithm calculates an exponential moving average of the gradient and the squared gradient, and the parameters beta1 and beta2 control the decay rates of these moving averages.
The initial value of the moving averages and beta1 and beta2 values close to 1.0 (recommended) result in a bias of moment estimates towards zero. This bias is overcome by first calculating the biased estimates before then calculating bias-corrected estimates.
The paper is quite readable and I would encourage you to read it if you are interested in the specific implementation details.
Adam is a popular algorithm in the field of deep learning because it achieves good results fast.
Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods.
In the original paper, Adam was demonstrated empirically to show that convergence meets the expectations of the theoretical analysis. Adam was applied to the logistic regression algorithm on the MNIST character recognition and IMDB sentiment analysis datasets, a Multilayer Perceptron algorithm on the MNIST dataset and Convolutional Neural Networks on the CIFAR-10 image recognition dataset. They conclude:
Using large models and datasets, we demonstrate Adam can efficiently solve practical deep learning problems.
Comparison of Adam to Other Optimization Algorithms Training a Multilayer Perceptron
Taken from Adam: A Method for Stochastic Optimization, 2015.
Sebastian Ruder developed a comprehensive review of modern gradient descent optimization algorithms titled 「An overview of gradient descent optimization algorithms」 published first as a blog post, then a technical report in 2016.
The paper is basically a tour of modern methods. In his section titled 「Which optimizer to use?「, he recommends using Adam.
Insofar, RMSprop, Adadelta, and Adam are very similar algorithms that do well in similar circumstances. […] its bias-correction helps Adam slightly outperform RMSprop towards the end of optimization as gradients become sparser. Insofar, Adam might be the best overall choice.
In the Stanford course on deep learning for computer vision titled 「CS231n: Convolutional Neural Networks for Visual Recognition」 developed by Andrej Karpathy, et al., the Adam algorithm is again suggested as the default optimization method for deep learning applications.
In practice Adam is currently recommended as the default algorithm to use, and often works slightly better than RMSProp. However, it is often also worth trying SGD+Nesterov Momentum as an alternative.
And later stated more plainly:
The two recommended updates to use are either SGD+Nesterov Momentum or Adam.
Adam is being adapted for benchmarks in deep learning papers.
For example, it was used in the paper 「Show, Attend and Tell: Neural Image Caption Generation with Visual Attention」 on attention in image captioning and 「DRAW: A Recurrent Neural Network For Image Generation」 on image generation.
Do you know of any other examples of Adam? Let me know in the comments.
Further, learning rate decay can also be used with Adam. The paper uses a decay rate alpha = alpha/sqrt(t) updted each epoch (t) for the logistic regression demonstration.
The Adam paper suggests:
Good default settings for the tested machine learning problems are alpha=0.001, beta1=0.9, beta2=0.999 and epsilon=10−8
The TensorFlow documentation suggests some tuning of epsilon:
The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1.
We can see that the popular deep learning libraries generally use the default parameters recommended by the paper.
Do you know of any other standard configurations for Adam? Let me know in the comments.
This section lists resources to learn more about the Adam optimization algorithm.
Do you know of any other good resources on Adam? Let me know in the comments.
In this post, you discovered the Adam optimization algorithm for deep learning.
Specifically, you learned:
Do you have any questions? Ask your questions in the comments below and I will do my best to answer.