ABSTRACT

As the ability to acquire information about nucleic acid sequences and proteins has increased due to development of massive parallel sequencing methods, proteomics, and mass spectrometry, the biological diversity we observe is becoming immense. With whole-genome sequencing now becoming more widely adopted, the variety of the biome can be explored like never before. The inability to propagate most of the living organisms in vitro is no longer an obstacle to their discovery [1]. It is estimated that only 10% of the genomic sequences discovered are very closely related or identical to known sequences, an additional 20% may be assigned to a particular family, and more than 70% have no relation to any previously known organisms [1-7]. At the current state of the art, multiple protein sequence alignments from closely related species are mere prerequisites for further analysis based on motif, domain, multidomain, or three-dimensional (3D) relatedness [8-10]. This inevitably leads to the need for development of new algorithms that can reliably uncover true coding sequences, understand the possible structure and function of the newly discovered proteins, evaluate the evolutionary trends [11-14], recognize proteins with beneficial medicinal properties, and ultimately, predict and track emerging pathogens.