On the Convergence Rate of Stochastic Gradient Descent for Strongly Convex Functions
Based on this information, how can we determine the optimal classifier in the hypothesis space? According to statistical learning theory, the optimal classifier is the one that minimizes the true risk Ep[l(w, (Y,X))]. Empirical Risk Minimization (ERM) is a commonly used principle to estimate the optimal classifier using a finite data sample. It chooses the best classifier as the one that minimizes the empirical risk or regularized empirical risk:
Qn[l(w, (Y,X))] = 1 n
l(w, (Yi, Xi)) +R(w) (7.1)
where R is a regularizer, which penalizes the complexity of w. When Qn[l(w, (Y,X))] is convex in w ∈ W , a gradient descent algorithm
will find the optimum of this function. For simplification, we temporarily assume Qn[l(w, (Y,X))] is differentiable. Then, to run gradient descent, we need to compute the gradient ∇Qn[l(w, (y, x))]. Given the functional form of Qn[l(w, (y, x))], we see that to find the gradient of the empirical risk, we need to evaluate the loss on every sample in the data set. For large data sets, this imposes a lot of computation.