ABSTRACT

This chapter highlights the underappreciated value of using visualizations of data as a tool to inform and improve the processes related to actually identifying which data should be used in a process, what form that data should take, and how to refine and iteratively improve model performance. While a histogram is in itself not “complex,” the use of a histogram for the purposes of evaluating and diagnosing potential issues with data is a more complex utilization of the visualization as a tool for diagnosis, rather than as a “nice to have.” This is particularly important when the analyst has limited domain knowledge. The students went through the process for all 400 potential predictors. As they developed some facility with the data—as they were building some level of domain expertise in real time—they were able to anticipate expected distributions, percentage of missing values, optimal replacement strategies and optimal transformations.