ABSTRACT

Clustering is the task of partitioning a given set of objects so that similar objects belong to the same group. This general idea is extremely useful in unsupervised learning in which little to no prior knowledge about the data is available. Clustering is usually the first data-analysis technique employed when analyzing big data. The importance of clustering and its applications may be found in a number of books [1] on the subject, and we avoid the detailed discussion here. Instead, we focus on formulating clustering and analyzing some mathematically well-defined problems. The first issue that we need to address when formulating the clustering problem is how is the data represented? In case it is possible to study and measure the defining properties and attributes of the objects that need to be clustered, then individual objects may be represented as points in some Euclidean space ℝ d , where the dimension d denotes the number of attributes that define objects. In this scenario, the similarity or dissimilarity between objects may be defined as a function of the Euclidean distance between the points corresponding to the objects. In many situations, representing data as points in Euclidean space may not be possible, and all that is known is how similar or dissimilar a pair of objects is. As the goal of clustering is to group similar objects in the same group, this pairwise similarity/dissimilarity information suffices. In such cases, what is given is a distance function D: 𝒳 × 𝒳 → ℝ+, where 𝒳 denotes the set of objects. In most practical scenarios, (𝒳, D) is a metric. Any metric (𝒳, D) satisfies the following three properties: (i) ∀x ∈ 𝒳, D(x, x) = 0, (ii) ∀x, y ∈ 𝒳, D(x, y) = D(y, x) (symmetry), and (iii) ∀x, y, z ∈ 𝒳, D(x, z) ≤ D(x, y) + D(y, z) (triangle inequality). The Euclidean setting may be regarded as a special case of the metric setting where 𝒳 = ℝ d and where the distance D(x, y) between two points x, y ∈ 𝒳 is the Euclidean distance (or squared Euclidean distance).