ABSTRACT

Prior to passing large volumes of geographic datasets through a machine learning algorithm, it is important to carefully organize these data. The process of organizing spatial data prior to clustering involves several theoretical and practical considerations. This chapter takes the reader through a detailed journey focused on the techniques and practical approaches involved in preparing spatial data prior to deploying clustering algorithms. In general, small area classifications are developed for general or bespoke reasons. The chapter starts by explaining why it is essential for researchers and practitioners to clarify the purpose for which they intend to develop classifications. Next, the discussion thoroughly engages the principles for selecting input variables covering ten important principles including theoretical relevance, objectivity, policy relevance, measurability and replicability, auditability, coverage, comparability, flexibility, updatability, and longevity. This is followed by a discussion of a wide range of statistical techniques for ensuring quality control. Similarly, various multivariate statistical techniques for judging the appropriateness of input variables are discussed. Readers of this chapter also benefit from understanding how to resolve the problems of outliers and how to work with variables measured with different units. The chapter then concludes with a discussion on how to weight variables during spatial data preparation.