ABSTRACT

Clustering, classification, and categorization are the triad of processes by which unstructured data is turned into information. In a sense, they provide discrimination of information from noise. Clustering associates related content, while classification then assigns meaning to the clusters which can potentially reorganize the clusters. Finally, categorization is the process by which rich data and metadata is associated with the content in each class. This includes tagging, indexing, and assigning document relevance and other statistics. In this chapter, we show how to simultaneously consider these three – clustering, classification, and categorization – to create more robust and complete systems of content. The use of regularization to provide mechanisms for optimizing the information created is highlighted. The relationship between statistical approaches and classification are illustrated.