The criterion for choosing this function g is the low probability of error, i.e., the algorithm must choose g in such a way that when an unseen pair (x,y) ∈ X ×Y is chosen according to PX×Y and only x is presented to the algorithm, the probability that g(x) = y is minimized over a class of functions g ∈ G . In this case, the best function that one can hope for is based on the conditional distribution P(Y |X) and is given by η(x) = sign [E(Y |X = x)], which is known as the so called Bayes optimal classifier. Note that, when a learning algorithm constructs a function g based on a training set of size l, the function g is a random quantity and it depends on the training set size l. So it is a better idea to represent the function g as gl to emphasize its dependence on training set size l. A natural question, then, is what properties should one expect from gl as the training set size l increases? A learning algorithm is called consistent if gl converges to η (in appropriate convergence mode) as the training set size l tends to infinity. This is the best one can hope for, as it guarantees that as the training set size l increases, gl converges to the “right” function. Of course, “convergence rate” is also important, which specifies how fast gl converges to the right function. Thus, one would always prefer a consistent learning algorithm. It is clear that performance of such an algorithm improves as the training set size increases, or in other words, as we have capacity to label more and more examples.