ABSTRACT

Cluster analysis is used to classify observations into a ¿ nite and small number of groups based upon two or more variables (Finch, 2005). The term cluster analysis was ¿ rst used in 1939 by Tryon (Tryon,1939). ‘Numerical taxonomy’ is another term used for cluster analysis in some areas of biology (Romesburg, 2004). There is no a priori hypothesis in cluster analysis, unlike other statistical analysis. In cluster analysis the variables are arranged in a natural system of groups (Kirkwood, 1989). The heterogeneous data collected are sorted into series of sets. Data in a cluster are considered to be ‘similar’ or highly correlated to each other. Clusters can be exclusive (a particular variable is included in only one cluster) and overlapping (a particular variable is included in more than one cluster). Cluster analysis method is used in a variety of research problems (Hartigan, 1975; Scoltock, 1982; Moore et al., 2010). It is applied extensively in the ¿ elds of toxicogenomics (Hamadeh et al., 2002), genetics (Shannon et al., 2003; Makretsov et al., 2004) and molecular biology (Furlan et al., 2011). Cluster analysis only discovers structures in data, but does not explain why such structures exist. Cluster analysis can be carried out using several methods. Three commonly used methods are described below:

Hierarchical cluster analysis

As the name indicates, hierarchical cluster analysis produces a hierarchy of clusters. The clusters thus produced are graphically presented. This graphical output is known as a dendrogram (from Greek dendron ‘tree’, gramma ‘drawing’). The dendrogram can be used to examine how clusters

are formed in hierarchical cluster analysis (Schonlau, 2002). Hierarchical clustering can be of two types. One type is agglomerative clustering, where grouping of clusters is done small clusters to large ones. The other type is divisive clustering, where grouping of clusters is done large clusters to small ones. For illustrative purpose a dendrogram is given in Figure 13.1.