ABSTRACT

This chapter has been about thinking about your data as a distribution and specifically characterizing that distribution by its central tendency in the form of the arithmetic mean or average. I have urged you to think about that distribution in a particular way: guessing what a single individual’s score might be in that distribution. Lots of statistical analysis is about trying to guess one thing from another (the highfalutin’ term for that is estimation). And we start guessing with the mean, because lots of our variables approximate normal distributions, and the mean has such nice properties for those distributions.

A simple thing like a central tendency can get complicated fast. For example, another way of characterizing a distribution would be to find the value of your variable (hloc in our case) that divides the sample into two equal parts. Then you would be right in the middle of the distribution. This is called the median. Another way of characterizing a distribution is called the mode, which is simply the most frequent score. The median and the mode are useful characterizations. In our sample, the mean, as we have seen, is 7.85, and the median is 8.0 and the mode is 8.0. This is another clue regarding the degree to which hloc approximates a normal distribution. The more normal the distribution, the closer the mean, median, and mode. When those three measures of central tendency start to diverge, the distribution looks less and less like a bell curve.

There are actually good reasons why some distributions depart more from normality. For example, I’ve got another data set from Brazil with a scale of depressive symptoms in it. The mean for that distribution is 19.5, the median is 16, and the mode is 7. What that indicates is that there is a distribution that has a long right-hand “tail”: there are a few people reporting rather high levels of depressive symptoms, and they drag the mean up (the tail of the distribution refers to the curve superimposed on top of the distribution and the way that at each end the curve goes down). The point dividing the scale into two groups is lower than that, and the point where “most” people pile up is lower still. Why does this distribution look this way? Think about it: would you want to live in a world where depressive symptoms are normally distributed?

But even the departure of a distribution from normality—what is called the skewness of the distribution—does not preclude treating it as if it were closer to normal, as long as it’s not too skewed and as long as we keep our wits about us (that is, we know that we are doing it). Why would we want to do that? Well, we will see that the payoffs from being able to use the statistical techniques based on the normal distribution are large in terms of analytic strength, but, again, we need to be aware of what we are doing, so as not to say things we really ought not to say.

If we just can’t treat the distribution of a variable as approximating normality, we need not despair—there are lots of other things we can do! But, again, that’s the “gravy” part of this. What I want you to understand are the basics of statistical inference, and concentrating on variables that tend toward normality and taking advantage of that tendency are what we want to do right now.

Also, just for the sake of completeness, I have been emphasizing the use of the arithmetic mean. There are other ways of computing means (such as the “harmonic” mean and the “geometric” mean), but again, that’s not our point here. Our point is to learn the basics, and the main point here is that the arithmetic mean is a good guess regarding somebody’s score value in a distribution, even though you know you will almost certainly be wrong.