ABSTRACT

Perhaps the most intuitive way to group observations into clusters is to compare each data point to each other and put the data points that are closest together into the same cluster. For example, if the observations are 5.1, 5.3, 7.8, 7.9, 6.3, we probably want to put 5.1 and 5.3 being only 0.2 apart, in the same cluster, and 7.8 and 7.9 in the same cluster, as they are only 0.1 apart. We then have to decide what to do with 6.3: We could decide that it belongs in its own cluster or put it in one of the other two clusters. If we decide to put it in one of the other clusters, we have to decide which one to put it in: It’s 1.0 and 1.2 away from the observations in the rst cluster and 1.5 and 1.6 away from the observations in the second cluster. is simple example illustrates the major decisions we have to make in a distance-based clustering strategy. And, of course, we’re not actually going to do clustering by looking at the individual observations and thinking about it: We are going to decide a set of rules or “algorithm” and have a computer automatically perform the procedure.