ABSTRACT

Genomic data mining research has made considerable progress in the recent decades. In this review we have discussed different steps involved in rational, robust, and successful data mining. Protein function identification is a key step in deriving knowledge from mining of genomes. These tasks were originally initiated with sequence similarity and alignment-based methodology. The machine meaning based alignment free algorithms were then found to be effective alternatives. With concerted efforts by different researchers in rigorous algorithm development, these methods were refined yielding improved prediction accuracies. Recent advances in NGS methodologies have paved the way for rapid sequencing of genomes. Reliable and fast annotation of these genomes requires big data handling methods and algorithms. More sophisticated tools including different deep learning-based algorithms are now routinely employed. This review discusses different steps involved in rational, robust, and successful data mining. More specifically the focus is on annotation of protein functions with special reference to human proteins. This chapter includes review of various domain features and descriptors both sequence and structural, selection of informative features and employment of different high-performance classifiers and algorithms. This chapter finally provides illustrations and tables for ready reference and use by all practicing computational biologists and bioinformaticians.