ABSTRACT

Clustering is used to divide or partition objects into groups based on the similarity or dissimilarity to one another, called clusters. Clustering consists of four steps: relevant feature selection, algorithm design, cluster validation, and visualization and evaluation. Clustering and cluster analysis are important techniques for bioinformatics experimentation and discovery. Distance–based clustering is uses to find similarity or dissimilarity in terms of distance between data points of the same cluster or data points of other clusters. Mahalanobis distance is bases on finding correlation between variables to measure distance, which helps classify future data belonging to a specific class. Simple matching distance methods can simplify nominal features by combining feature groups. k–Means clustering partitions n instances into k clusters by assigning each data point to the partition with the nearest centroid. The k–means algorithm works only for numerical data, its variant, the k–modes algorithm. The biggest advantage of hierarchical clustering methods is that help researchers visualize and represent genes more conveniently.