Validation and Benchmarking | 9 | Data Mining for Bioinformatics

ABSTRACT

This chapter provides an explanation of model selection and evaluation techniques used on classification models and describes the cluster evaluation techniques. A wide range of performance evaluation techniques are available in data mining. To generate accurate generalization error estimates, various validation strategies can be used in tandem with data mining. The validation techniques are motivated by two factors: model selection and performance estimation. Models are affects by an imbalance in the number of samples in each class. Bioinformatics is plagues by imbalanced datasets. The holdout method is considered to be the simplest form of performance estimation that partitions the data into two disjoint sets: a train set and a test set. In the three-way split, model selection and performance estimates are computes at the same time. The k-fold cross-validation is the most prominently used performance estimation technique in data mining and bioinformatics applications. Random subsampling is also referred to as Monte Carlo cross-validation (MCCV).