Importance Sampling, Simulated Tempering, and Umbrella Sampling

doi:10.1201/b10905-12

ABSTRACT

The importance of so-called importance sampling inMarkov chainMonteCarlo (MCMC) is not what gives it that name. It is the idea that “any sample can come from any distribution” (Trotter and Tukey, 1956). Suppose that we have aMarkov chainX1,X2, . . . having properly normalized density f for its equilibrium distribution. Let fθ denote a parametric family of densities each absolutely continuous with respect to f . Then

μˆn(θ) = 1n n∑ i=1

g(Xi) fθ(Xi) f (Xi)

(11.1)

is a sensible estimator of μ(θ) = Eθ{g(X)} (11.2)

for all θ, because by the Markov chain law of large numbers (Meyn and Tweedie, 1993, Theorem 17.1.7),

μˆn(θ) a.s.−−→ Ef

{ g(X)

fθ(X) f (X)

} =

∫ g(x)

fθ(x) f (x)

f (x) dx = ∫ g(x)fθ(x) dx

(the requirement that fθ is absolutely continuous with respect to f is required so that we divide by zero in the middle expressions with probability zero, so the value of the integral is not affected). With one sample from one distribution f (x) we learn about μ(θ) for all θ. Monte Carlo standard errors (MCSEs) for importance sampling are straightforward: we

just calculate the MCSE for the functional of the Markov chain (Equation 11.1) that gives our importance sampling estimator. This means we replace g in Equation 1.6 in Chapter 1 (this volume) by g fθ/f . We are using here both the principle of “importance sampling” (in using the distribution

with density f to learn about the distributionwith density fθ) and the principle of “common random numbers” (in using the same sample to learn about fθ for all θ). The principle of common random numbers is very important. It means, for example, that

∇μ˜n(θ) = 1n n∑ i=1

g(Xi) ∇fθ(Xi) f (Xi)

∇μ(θ) = Eθ{∇g(X)},

which relies on the same sample being used for all θ. Clearly, using different samples for different θ would not work at all. The argument above relies on f and fθ being properly normalized densities. If we replace

themwith unnormalized densities h and hθ, we need a slightly different estimator (Geweke, 1989). Now we suppose that we have a Markov chain X1, X2, . . . having unnormalized density h for its equilibrium distribution, and we let hθ denote a parametric family of unnormalized densities, each absolutely continuous with respect to h. Deﬁne the so-called “normalized importance weights”

wθ(x) = hθ(x) h(x)

hθ(xi) h(xi)

(11.3)

so

μ˜n(θ) = n∑ i=1

g(Xi)wθ(Xi) (11.4)

is sensible estimator of Equation 11.2 for all θ, because of the following. Deﬁne

d(θ) = ∫ hθ(x) dx,

d = ∫ h dx,

so hθ/d(θ) and h/d are properly normalized probability densities. Then by the law of large numbers,

μ˜n(θ) a.s.−−→

{ g(x)

hθ(x) h(x)

}

{ hθ(x) h(x)

} = ∫ g(x)

hθ(x) h(x)

· h(x) d

dx ∫ hθ(x) h(x)

· h(x) d

dx =

d(θ) d

∫ g(x)

hθ(x) d(θ)

dx

d(θ) d

∫ hθ(x) d(θ)

dx = Eθ{g(X)}

(the requirement that hθ is absolutely continuous with respect to h is required so that we divide by zero in the middle expressions with probability zero, so the value of the integral is not affected). MCSEs for importance sampling are now a little more complicated. The estimator

(Equation 11.4) is a ratio of two functionals of the Markov chain

μ˜n(θ) = 1 n

g(Xi) hθ(Xi) h(Xi)

1 n

hθ(Xi) h(Xi)

.