ABSTRACT

As we discussed earlier, statistics attempts to find the likely underlying probability distribution that produced the data we observed. In almost all applications of statistics, the true underlying probability distribution (or model) is unknown. As a result, the process of finding the correct model is a process of careful sleuthing, which inevitably will include two general steps. One is the initial guess on the model form (what distribution), and the other is the estimation of the unknown model parameters. In this book, we use the term model as a generic term to describe the probability distribution model. Inevitably, the first question in any statistical analysis should be about the form of the distribution. How should we decide which model is appropriate for the problem we have? This question, a version of the problem of induction originated from Hume [1777], is impossible to answer in general. This can be explained in two levels. First, there are many alternative models that may lead to the same likelihood of producing the data we observed. Second, even when we find a unique model that can be used to explain the observation made so far, we cannot be sure that the model would still be correct for the future. In Hume’s words, our inductive practices have no rational foundation, for no form of reason will certify it. Philosophical arguments about the impossibility of causal inference aside, statistical thinking is a form of inductive process that follows a quasi-falsificationism approach. The basis of Fisher’s statistical reasoning can be interpreted using Popper’s falsification theory, which is an attempt to solve the problem of induction. Popper suggests that there is no positive solution to the problem of induction (“no matter how many instances of white swans we may have observed does not justify the conclusion that all swans are white”). But theories, while they cannot be logically proved by empirical observations, can sometimes be refuted by them (e.g., sighting a black swan). Furthermore, a theory can be “corroborated” if its logical consequences are confirmed by suitable experiments. Statistical inference starts with an assumption or theory, usually in the form of a specific probabilistic distribution. Because statistical assumptions cannot be directly refuted, inference is usually based on the evidence presented in data that is contradictory to the theory. If the evidence is strong, we reject the theory. Once a theory is corroborated, that is, a probability distribution model is established as the likely representation of the true underlying distribution, model parameters are estimated. In most tests, statistical inference is presented in terms

specific values of the parameter of interest. This is because the theory about probability distribution is inevitably subject-matter specific. As a result, the discussion of statistical inference is largely conditional on the knowledge of the underlying distribution. Hypothesis testing procedure is the focus of this chapter.