ABSTRACT

In both clustering and semisupervised learning, a broad distinction can be made between approaches that build on the cluster hypothesis – the idea that similar examples should have similar labels, or, in spatial terms, that nearby examples should belong to the same class – and the separation hypothesis – the idea that boundaries between classes should lie in sparsely populated regions of the instance space. Approaches that build on the cluster hypothesis seek to characterize what clusters look like. Approaches that build on the separation hypothesis are not concerned with what clusters look like, but rather with what boundary regions look like. A related distinction is that between generative models and discriminative

models. A generative model describes how instances of each class are generated. As we will see in chapter 7, one way of implementing the cluster hypothesis is via a mixture model: choose a class with probability p(y), then generate an instance with probability p(x|y). A probabilistic classifier p(y|x) arises as a posterior probability, derived by Bayes’ rule:

p(y|x) = p(y)p(x|y)∑ y′ p(y′)p(x|y′)

.