ABSTRACT

This section provides a brief overview of small area estimation and the Elbers et al. (2003) ELL method. We consider a target variable, denoted by Y, for which we seek estimates for a number of small subpopulations. These subpopulations usually correspond to small geographical areas, but can instead represent different subgroups that may

be collocated (in which case the technique is sometimes called small domain estimation). In the original ELL method for poverty measures, Y is log-transformed per capita expenditure. For extensions to the under-nourishment measures, log kilocalorie intake per person or per adult equivalent is used instead. For stunting, underweight and wasting in children, Y is standardized height-for-age, weightfor-age, and weight-for-height respectively. Provided there are at least some sample data available for each small subpopulation, direct estimates of Y for these subpopulations can be derived from the sample survey data, for which Y has been measured directly on the final-stage sampled units (e.g. households or eligible children). Because sample sizes within even the sampled subpopulations are typically very small, these direct estimates are however generally not reliable. The core idea of small area estimation is that auxiliary information, denoted X, which is available from the survey and may also be available from other sources such as a census even for unsampled parts of the population, can be used to improve the estimates, giving lower standard errors than are possible using only direct estimates. In the ELL method, but not in those small area methods covered in Rao (2003), X represents additional variables that have been measured for the whole population, either by a census or via a GIS database. (For the Rao 2003 methods X is generally available only on the sampled units but, unlike ELL, the range of statistical models can be nonlinear.) For ELL, a linear regression-type relationship between Y and X namely:

(12.2)

is estimated from the survey data only, using both the available target variable and the auxiliary variables. (For the Rao 2003 methods, this model can be more general and may be nonlinear as for example in logistic regression that instead uses proportions.) In (12.2) β represents the regression coefficients determining the effect of the X variables on Y, and u is a random error term that represents the part of Y that cannot be explained using the auxiliary information. If we can assume that this same relationship applies to the whole population, it can be used to predict Y for all units for which we have measured X but not necessarily Y. Even though these predictions contain prediction error that may be substantial at household or child level, when amalgamated over the subpopulations of interest small area estimates based on these predicted Y values will often have smaller standard errors than the direct estimates, even given this uncertainty in the predicted values, because they are based on much larger samples. The idea is to “borrow strength” from the considerably greater coverage of the census data (which since it includes all the population may be several orders of magnitude larger than the sample size for the survey).