ABSTRACT

This chapter discusses some ways we can try to fix the data so that it works harder to satisfy our belief that there are clusters lurking in it somewhere, if only we can improve it enough. This is what feature analysis is all about. The chapter discusses problems and issues associated with choosing what features or relationships to measure. Nominations of the data analyst are based on prior knowledge of the physical process that generates the data and the questions that the data analyst will try to answer. Data cleansing or data scrubbing is the process of detecting and correcting corrupt or inaccurate records from a record set, table, or database. Feature selection can be quite useful for answering some questions but not all questions. One of the most interesting but often confounding linear feature extraction techniques is based on random projection. Random projection brings probability into the mix with linear functions that preserve metric topology in a well-specified sense.