Let us begin by recalling the familiar supervised learning framework. A large number of raw data objects, {xi}ni=1, are typically collected by an automated process, such as a crawler in a web-page classification setting, a camera in video surveillance applications, or a microphone recorder for speech recognition problems. Given an example x, we assume that there is an unknown true conditional distribution P (y|x) over an output space y ∈ Y. By hu-

man annotation effort, the desired outputs for a random subset of objects is obtained by sampling yi ∼ P (y|xi), 1 ≤ i ≤ l, where l, the number of labeled examples, is very often far smaller than the total data collected. Next, a typically high-dimensional numerical representation Ψ(x) ∈ X ⊂ Rd is chosen for the raw data, and a supervised learning model based on labeled samples is induced as a proxy for the true underlying conditional distribution, i.e., P (y|x) = P (y|Ψ(x), θ), where the model parameters θ are tuned to fit the labeled examples while being regularized sufficiently to avoid overfitting.