ABSTRACT

Suppose that by some method we have already picked p variables, where p may be zero, out of k variables available to include in our predictor subset. If the remaining variables contain no further information which is useful for predicting the response variable, then we should certainly not include any more. But how do we know when the remaining variables contain no further information? In general, we do not; we can only apply tests and take gambles based upon the outcome of those tests. The simplest such test known to the author is that of augmenting the set

of predictor variables with one or more artificial variables whose values are produced using a random number generator. When the selection procedure first picks one of these artificial variables, the procedure is stopped and we go back to the last subset containing none of the artificial variables. Let us suppose that we have reached the stage in a selection procedure when there is no useful information remaining (though we would not know this in a real case), and that there are 10 remaining variables plus one artificial variable. A priori the chance that the artificial variable will be selected next is then 1 in 11. Hence it is likely that several useless variables will be added before the artificial variable is chosen. For this method to be useful and cause the procedure to stop at about the right place, we need a large number of artificial variables, say of the same order as the number of real variables. This immediately makes the idea much less attractive; doubling the number of variables increases the amount of computation required by a much larger factor. In Table 4.1, the RSS’s are shown for the five best-fitting subsets of 2,

3, 4 and 5 variables for the four data sets used in examples in section 3.10. The name ‘CLOUDS’ indicates the cloud-seeding data. The numbers of added variables were:

CLOUDS 5, STEAM 9, DETROIT 11 and POLLUTE 10.