ABSTRACT

The idea behind batch normalization directly follows from the basic mechanism of backpropagation. The solution Ioffe and Szegedy proposed was the following. At each pass, and for every layer, normalize the activations. If that were all, however, some sort of levelling would occur. Transfer, as a general concept, is what happens when we have learned to do one thing, and benefit from those skills in learning something else. For example, we may have learned how to make some move with our left leg; it will then be easier to learn how to do the same with our right leg. Or, we may have studied Latin and then, found that it helped us a lot in learning French. In comparison, the typical usage of “transfer learning” in deep learning seems rather narrow, at first glance. Concretely, it refers to making use of huge, highly effective models, that have already been trained, for a long time, on a huge dataset.