Introduction | 5 | Algebraic Statistics | Giovanni Pistone, Eva Riccom

ABSTRACT

One of the most basic issues in statistical modeling is to set problems up correctly, or at least well. This means, typically, that a sample space needs to be deﬁned together with some distribution on this sample space with some parameters. After that one can decide if the parameters or even the form of the distribution are known, and, given the motivation and resources, enter into full-blown statistical inference. Great care needs to be taken with data capture or, to put it more precisely, with experimental design, if the model is to be properly postulated, tested and used for prediction. Some of the questions which need to be addressed in carrying out these

operations are intrinsically algebraic, or can be recast as algebraic. By algebra here we will typically mean polynomial algebra. It may not at ﬁrst be obvious that polynomials have a fundamental role to play. Here is, perhaps, the simplest example possible. Suppose that two people

(small enough) stand together on a bathroom scale. Our model is that the measurement is additive, so that if there is no error, and θ1 and θ2 are the two weights, the reading should be

Y = θ1 + θ2

Without any other information it is not possible to estimate, or compute, the individual weights θ1 and θ2. If there is an unknown zero correction θ0 then Y = θ0 + θ1 + θ2 and we are in worse trouble. In a standard regression model we write in matrix notation

Y = Zθ + ε

and our ability to estimate the parameter vector θ, under standard theory, is equated with “Z is N × p full rank” or Rank(Z) = p < N where θ is a p-vector and N is the number of design points. An example is the one-dimensional polynomial regression

Y (x) = p−1∑

θjx j + εx

Then, if the experimental design consists of p distinct points a(1), . . . , a(p),

Z = [ a(i)j

] i=1,...,p;j=0,...,p−1

has full rank, and for submodels with fewer than p terms, the Z-matrix also has full rank. Algebraic methods have been used extensively in the construction of de-

signs with suitable properties. However, particularly in the construction of balanced factorial designs with particular aliasing properties, abstract algebra in the form of group theory has also been used to study the identiﬁability problem. Most students and professionals in statistics will recall a course on experimental design in which Abelian group theory is used in the form of confounding relations such as

I = ABC

and unless they are experts in experimental design, they may have remained somewhat mystiﬁed thereafter. We return to this example in Section 1.3. Let us consider a simple example. Here is a heuristic proof that there is

a unique quadratic curve through the points (a(1), y1), (a(2), y2), (a(3), y3)

yi = r(a(i)), i = 1, 2, 3

We can think of a(1), a(2), a(3) as the points of an experimental design at which we have observed y1, y2, y3, respectively, without error. We also assume that a(1), a(2), a(3) are distinct. Deﬁne the polynomial

d(x) = (x− a(1))(x− a(2))(x− a(3)) whose zeros are the design points. Take any competing polynomial, p(x), through the data that is such that p(a(i)) = yi (for i = 1, 2, 3). Write

p(x) = s(x)d(x) + r(x)

where r(x) is the remainder when p(x) is divided by d(x). Now we can appeal to algebra and say that, given the polynomial p(x), r(x) is unique. But it is clear from the equation that

yi = p(a(i)) = r(a(i)), (i = 1, 2, 3)

since by construction d(a(i)) = 0, i = 1, 2, 3. The polynomial p above can be interpreted in two ways: (i) as a contin-

uous function with value yi at the point a(i) and (ii) as a representation of the function deﬁned only on the design points and again with value yi at a(i) (for i = 1, 2, 3). The ﬁrst way is very convenient when we do regression analysis and thus we call p an interpolator. The other interpretation is more suited for applications in discrete probability. Here we have tried to solve an identiﬁability problem directly by exhibit-

ing a minimal degree interpolator rather than check the rank of a Z-matrix.