ABSTRACT

The recent years have seen an exponential growth in the amount of biological information, including that on DNA (deoxyribonucleic acid) and protein sequences, which is accessible in open databases. This was supported by more attention to improve computational procedures to automatically classify large capacities of massive sequence data into several groups analogous to their structure, their role in the chromosomes, and/or their function. Broadly used sequence classification procedures were developed for modeling sequences in a way that traditional machine learning procedures, including neural network and support vector machines, can be employed easily. Furthermore, conventional data analysis methods often fail to handle huge data amounts professionally. In this context, data mining tools can be applied for knowledge extraction from large data amounts. Lately, the biological data collection such as DNA-/protein-sequences is increasing rapidly due to the development of current technologies and the exploration of new methods such as the microarrays. Consequently, data mining method is applied to extract significant information from the massive biological data sequences amount. One significant research area is the protein sequences classification into several classes/subclasses, or families. The current chapter provides a comprehensive coverage of data mining for biological sequences concept and applications. It includes related work of data mining biological applications with both fundamental concepts and innovative methods. Significant insight and suggested future research areas for biological data mining are introduced.