ABSTRACT

This book talks about exploratory data analysis (EDA) and how interactive graphical methods can help gain further insights in a dataset, generate new questions and hypotheses. John W. Tukey often referred to EDA as experimental work. Tukey and Wilk (1965) summarize data analysis saying it “… must be considered as an open-ended, highly interactive, iterative process, whose actual steps are segments of a stubbily branching, tree-like pattern of possible actions.” Visualization of the data is probably one of the most powerful tools in this exploration process, as the role of the researcher in EDA is to explore the data in as many different ways as possible until a plausible “story” of the data emerges. Typical data analyses comprise the following eight steps:

Plan the study

A well-thought-out study design that respects the study goals should be the initial step of any data analysis. For very clear-cut questions like optimizing the yield in a plant, optimal designs can be chosen – see Pukelsheim (2006), for instance.

Unfortunately, statisticians are often consulted after the data were collected and thus cannot influence the study design. EDA methods can cope with the “here is the data” situation much more easily because they do not rely on apriori hypotheses and distributions.

Understand the background and collect questions

Analyzing data without a further understanding of the background is almost impossible. This is often neglected in classical teaching of mathematical statistics. It is only if procedures and techniques relate to actual data that we might find interpretable results and be able to give proper recommendations. Thus, a study of the background and the data sources is extremely important for conducting a successful data analysis.

Check the data for errors

From textbook examples, we are used to looking at what we regard as “clean” datasets. There are no obvious errors and the data seem to be consistent. But even for those datasets, the origin and the background of the data sometimes remains somewhat unclear and 2a further exploration may reveal consistency problems or other oddities. An internet poll from KDD-Nugggets (https://www.kdnuggets.com/polls/2003/data_preparation.htm">https://www.kdnuggets.com/polls/2003/data_preparation.htm) shows that almost 2/3 of the respondents spend more than 60% of their time in a data mining project on data cleaning and data preparation. Obviously, typical data mining applications deal with complex mixtures of different generating processes. Even if we assume that the “classical” datasets we face in the daily business of a data analyst are of a better quality, there still remains a whole lot to do when initially checking the data for errors.

Explore the data

Although any of the preceding steps might already be targeted toward a possible solution, the exploration of the data with the initially collected questions in mind is at the core of any data analysis. There is almost no limit regarding the possible data analytic tools which can be used at that stage. Sometimes we might want to collect more data or similar data with the same background. Other times we need purely computer-science-related techniques to get to the most interesting subset. In any case, we can benefit strongly from the use of (interactive) graphics and in order to be able to correctly judge what we see, we need an understanding of the fundamental concepts of randomness and statistical distributions.

Review the initial questions

Sometimes checking the data for errors (step 3) might take us back to step 1, where the data were collected. Far more frequently, we might want to review the initially addressed questions after we gain further insight into the data during the exploration. New questions might arise and others might need to be reformulated or posed more precisely.

Generate hypotheses and build statistical models

Statistical tests and models are used to separate signal from noise. When datasets are large, most of the effects we see graphically are actually significant; for smaller datasets, we need statistical tests to check for significance. With really large datasets almost every effect we might want to test will turn out to be significant.

Analyze residuals and review hypotheses and models

Especially in a multivariate context, residuals of a statistical model could point to further structural features, which did not show up in any low-dimensional view. Be it a missing factor or outliers, or a further subsetting of the data, any remaining structure in the residuals might question the corresponding model.

3 Interpretation and concluding recommendations

At the end of any analysis, we want to get to an interpretation of the results. This is tightly linked to step 2, where we should acquire a solid background understanding of the data. Interpreting the results means in particular to verify results for plausibility and check their relevance. Once this is done, we can usually give recommendations concerning the problem addressed with the data analysis.