ABSTRACT

The approach described in this section assumes that if two water samples are similar, and one of them has an unknown value in some variable, there is a high probability that this value is similar to the value of the other sample. In order to use this intuitively appealing method, we need to define the notion of similarity. This notion is usually defined using a metric over the multivariate space of the variables used to describe the observations. Many metrics exist in the literature, but a common choice is the Euclidean distance. This distance can be informally defined as the square root of the sum of the squared differences between the values of any two cases, that is,

d(x,y) =

√√√√ p∑ i=1

(xi − yi)2 (2.1)

The method we describe below will use this metric to find the ten most similar cases of any water sample with some unknown value in a variable, and then use their values to fill in the unknown. We will consider two ways of using their values. The first simply calculates the median of the values of the ten nearest neighbors to fill in the gaps. In case of unknown nominal variables (which do not occur in our algae dataset), we would use the most frequent value (the mode) among the neighbors. The second method uses a weighted

the to the case of the neighbors increases. We use a Gaussian kernel function to obtain the weights from the distances. If one of the neighbors is at distance d from the case to fill in, its value will enter the weighted average with a weight given by

w(d) = e−d (2.2)

This idea is implemented in function knnImputation() available in the book package. The function uses a variant of the Euclidean distance to find the k nearest neighbors of any case. This variant allows the application of the function to datasets with both nominal and continuous variables. The used distance function is the following:

d(x,y) =

√√√√ p∑ i=1

δi(xi,yi) (2.3)

where δi() determines the distance between two values on variable i and is given by

δi(v1, v2) =

 1 if i is nominal and v1 6= v20 if i is nominal and v1 = v2(v1 − v2)2 if i is numeric (2.4) These distances are calculated after normalizing the numeric values, that is,

yi = xi − x¯ σx

(2.5)

Let us now see how to use the knnImputation() function:

In case you prefer to use the strategy of using the median values for filling in the unknowns, you could use the call

In summary, after these simple instructions we have the data frame free of NA values, and we are better prepared to take full advantage of several R functions.