ABSTRACT

In higher dimensions, the exact Newton method multiplies the gradient by the inverse of the Hessian, thus scaling the descent direction coordinate-by-coordinate. Our current example has just a single independent variable; so this means for us: take the first derivative, and divide by the second. Like their exact counterpart, approximate Newton methods can work without a learning rate. In that case, they compute a descent direction and follow the scaled gradient as-is. The alternative strategy is to do an approximate search. By now, you're probably surprised: Just as approximate Newton is more realistically feasible than exact Newton, approximate line search is more practicable than exact line search. With line search, a single iteration is sufficient to reach the minimum. Inspecting the individual losses, we also see that the algorithm reduces the loss nearly every time it probes the function, which without line search, had not been the case.