Authors
Kingma, Ba
Conference
ICLR 2015
Abstract
Adam (Adaptive Moment Estimation) combines the advantages of AdaGrad and RMSProp. It computes adaptive learning rates for each parameter.
Algorithm
- Maintains exponentially decaying averages of past gradients (momentum)
- Also maintains exponentially decaying averages of past squared gradients (adaptive learning rate)
Impact
Adam became the default optimizer for deep learning. While SGD with momentum is still preferred for some vision tasks, Adam is ubiquitous in NLP and general deep learning.