Consider an input space X and an output space Y, where one would like to see an example from input space X and automatically predict its output. In traditional supervised learning, a learning algorithm is typically given a training set of the form {(x_{i}, y_{i})}^{l}_{i=1}, where each pair (x_{i}, y_{i}) ∊ X × Y is drawn independently at random according to an unknown joint probability distribution P_{X × Y}. In the case of a supervised classification problem, Y is a finite set of class labels and the goal of the learning algorithm is to construct a function g : X → Y that predicts the label y given x. For example, consider the problem of predicting if an email is a spam email. This is a binary classification problem where Y = {+1, −1 } and the label “+ 1” corresponds to a spam email and the label “−1” corresponds to a non-spam email. Given a training set of email-label pairs, the learning algorithm needs to learn a function that given a new email, would predict if it is a spam email.