A survey of convergence rate of SGD for non-smooth optimization

doi:10.1201/b19371-120

ABSTRACT

Most of the machine learning problems can be cast as convex optimization problems, known as regularized empirical risk minimization with the general framework of “regularization + loss function” [1].The regularization term primarily regulate generalization of the classifier to prevent over-fitting [2]. Generally, the regularization term includes: L1 regularization [3,4], L2 regularization [5] and hybrid L1-L2 regularization [7,8], where the L1 regularization is generally convex and non-smooth and so is L2 regularization. The loss function mainly controls the training accuracy of the classifier. As we all know, the natural loss function for classification is zero-one loss. This loss function assigns no loss to a correct decision, and assigns a unit loss to any error. Thus, all errors are equally costly. However, the zero-one loss belongs to NP hard problems [9] which is difficult to optimize the problems in machine learning. So we use the agents of zero-one loss to train simple regularized linear prediction problems. The agents include: hinge loss [5], L2 loss [10], the least squares loss [11] and logistic loss [12], which of all are smooth functions except hinge loss.