Mukund's Portfolio | Adam: A Method for Stochastic Optimization

Authors

Kingma, Ba

Conference

ICLR 2015

Abstract

Adam (Adaptive Moment Estimation) combines the advantages of AdaGrad and RMSProp. It computes adaptive learning rates for each parameter.

Algorithm

Maintains exponentially decaying averages of past gradients (momentum)
Also maintains exponentially decaying averages of past squared gradients (adaptive learning rate)

Impact

Adam became the default optimizer for deep learning. While SGD with momentum is still preferred for some vision tasks, Adam is ubiquitous in NLP and general deep learning.