ABSTRACT

Social data are often highly heterogeneous, coming from a population composed of diverse classes of individuals, each with their own characteristics and behaviors. As a result of heterogeneity, a model learned on population data may not make accurate predictions on held-out test data or offer analytic insights into the underlying behaviors that motivate interventions. To illustrate, consider Figure 16.1, which shows data collected for a hypothetical nutrition study measuring how the outcome, body mass index (BMI), changes as a function of daily pasta calorie intake. Multivariate linear regression (MLR) analysis finds a negative relationship in the population (dotted line) between these variables. The negative trend suggests that – paradoxically – increased pasta consumption is associated with lower BMI. However, unbeknownst to researchers, the hypothetical population is heterogeneous, composed of classes that varied in their fitness level. These classes (clusters in Figure 16.1) represent, respectively, people who do not exercise, people with normal activity level, and athletes. When the data are disaggregated by fitness level, the trends within each subgroup are positive (dashed lines), leading to the conclusion that increased pasta consumption is in fact associated with a higher BMI. Recommendations for pasta consumption arising from the naive analysis are opposite to those arising from a more careful analysis that accounts for the confounding effect of different classes of people. The trend reversal is an example of Simpson's paradox, which has been widely observed in many domains, including biology, psychology, astronomy, and computational social science (Chuang et al., 2009; Kievit et al., 2013; Minchev et al., 2019; Blyth, 1972).