ABSTRACT

From the Bayesian perspective, there are two types of quantities: known

and unknown. The goal is to use the known quantities along with a specied

parametric expression to make inferential statements about the unknown

quantities. The denition of such unknown quantities is very general; it can

be any missing data or unknown parameters. In the most basic scenario

there is a single unknown model parameter of interest, , and some observed

data D. We then stipulate a joint probability function that describes how

these quantities behave in conjunction, p(;D). Starting with the denition

of conditional probability, p(Dj) = p(;D)=p(), rearrange to get:

p(D; ) = p()p(Dj): (3.1)

This says that the joint distribution of the data and the model parameter is

a product of the unconditional distribution of the model parameter and the

distribution of the data given a value of the parameter. The rst quantity

in this product is called the prior distribution of , and the second is

just the customary likelihood function. Of course we are really interested

in obtaining p(jD), which is the distribution of the unknown quantity

given the known quantity. This is where Bayes' law comes into play. Since

p(D; ) = p(;D), then we can produce the following equality:

p()p(Dj) = p(D)p(jD); (3.2)

simply by applying the denition of conditional probability two dierent

ways. What Bayes (1763) did was rearrange (3.2) to produce:

p(jD) =

p()p(Dj)

; (3.3)

which gives the desired probability statement on the left-hand side. This

states that the distribution of the unknown parameter conditioned on the

observed data is equal to the product of the prior distribution assigned to

the parameter and the likelihood function, divided by the unconditional

probability of the data. The form of (3.3) can be expressed as:

(jD) =

p()L(jD)

R

p()L(jD)d

; (3.4)

where L(jD) is an expression for p(Dj) reminding us that this is a like-

lihood function, and

R

p()L(jD)d is an expression achieved for p(D)

by integrating the numerator over the support of . This term is typically

called the normalizing constant, the , or the prior predictive distribution,

although it is actually the marginal distribution of the data, and ensures

that (jD) integrates to one as required by the denition of a probabil-

ity function. A more compact and succinct form of (3.4) is developed by

dropping the denominator and using proportional notation since p(D) does

not depend on and therefore provides no relative inferential information

about more likely values of :

(jD) / p()L(jD); (3.5)

meaning that the unnormalized posterior (sampling) distribution of the

parameter of interest is proportional to the prior distribution times the

likelihood function:

Posterior Probability / Prior Probability Likelihood Function:

It is often (but not always, see later chapters) easy to renormalize the

posterior distribution as the last stage of the analysis.