ABSTRACT
From the Bayesian perspective, there are two types of quantities: known
and unknown. The goal is to use the known quantities along with a specied
parametric expression to make inferential statements about the unknown
quantities. The denition of such unknown quantities is very general; it can
be any missing data or unknown parameters. In the most basic scenario
there is a single unknown model parameter of interest, , and some observed
data D. We then stipulate a joint probability function that describes how
these quantities behave in conjunction, p(;D). Starting with the denition
of conditional probability, p(Dj) = p(;D)=p(), rearrange to get:
p(D; ) = p()p(Dj): (3.1)
This says that the joint distribution of the data and the model parameter is
a product of the unconditional distribution of the model parameter and the
distribution of the data given a value of the parameter. The rst quantity
in this product is called the prior distribution of , and the second is
just the customary likelihood function. Of course we are really interested
in obtaining p(jD), which is the distribution of the unknown quantity
given the known quantity. This is where Bayes' law comes into play. Since
p(D; ) = p(;D), then we can produce the following equality:
p()p(Dj) = p(D)p(jD); (3.2)
simply by applying the denition of conditional probability two dierent
ways. What Bayes (1763) did was rearrange (3.2) to produce:
p(jD) =
p()p(Dj)
; (3.3)
which gives the desired probability statement on the left-hand side. This
states that the distribution of the unknown parameter conditioned on the
observed data is equal to the product of the prior distribution assigned to
the parameter and the likelihood function, divided by the unconditional
probability of the data. The form of (3.3) can be expressed as:
(jD) =
p()L(jD)
R
p()L(jD)d
; (3.4)
where L(jD) is an expression for p(Dj) reminding us that this is a like-
lihood function, and
R
p()L(jD)d is an expression achieved for p(D)
by integrating the numerator over the support of . This term is typically
called the normalizing constant, the , or the prior predictive distribution,
although it is actually the marginal distribution of the data, and ensures
that (jD) integrates to one as required by the denition of a probabil-
ity function. A more compact and succinct form of (3.4) is developed by
dropping the denominator and using proportional notation since p(D) does
not depend on and therefore provides no relative inferential information
about more likely values of :
(jD) / p()L(jD); (3.5)
meaning that the unnormalized posterior (sampling) distribution of the
parameter of interest is proportional to the prior distribution times the
likelihood function:
Posterior Probability / Prior Probability Likelihood Function:
It is often (but not always, see later chapters) easy to renormalize the
posterior distribution as the last stage of the analysis.