ABSTRACT

With the advent of high-throughput deoxyribonucleic acid (DNA) sequencing techniques, a score of prokaryotic and eukaryotic genomes have recently been sequenced to completion (e.g., Escherichia coli,5 yeast,10 human,26 Arabidopsis25). This large amount of sequence data can be searched in silico, i.e., on the computer, using a number of bioinformatics tools that are developed to infer a large number of open reading frames or protein coding regions that occur in each of these genomes. Eukaryotic genes are more difficult to predict due to their split nature where coding (exon) and noncoding (intron) regions of the gene are interspersed. The gene-prediction algorithms are either generic, based on all the information available about genes and the proteins they encode from various organisms, or they are organism-specific, based on actual sequence information available only in that organism. In either case, such computer-based gene predictions should still be regarded as tentative. Ideally, molecular biologists must ensue further experimental confirmation of the gene function.