ABSTRACT

Typically, stochastic optimization methods using minibatches of data are used for training neural networks. We review the classic stochastic gradient descent (SGD), with add-on tricks, such as momentum, step-length decay, cyclic annealing, and weight decay. Popular adaptive step-length methods are discussed within a unified framework, which includes AdaGrad, RMSProp, Adam, and its variants. In particular, we give attention to recent criticism of adaptive methods, revealing their marginal value for generalization, compared to SGD with effective initial step-length tuning and decay.