ABSTRACT

With the advent of high-throughput DNA sequencing techniques, a score of prokaryotic and eukaryotic genomes have recently been sequenced to completion (e.g., Escherichia coli,1 yeast,2 human,3 and Arabidopsis4). This large amount of sequence data can be searched in silico, that is, on the computer using a number of bioinformatics tools that are developed to infer a large number of open reading frames or protein coding regions that occur in each of these genomes. Eukaryotic genes are more dif cult to predict due to their split nature where coding (exon) and noncoding (intron) regions of the gene are interspersed. The gene prediction algorithms are either generic, based on all the information available about genes and the proteins they encode from various organisms, or are organism speci c, based on actual sequence information available only for that organism. In either case, such computer-based gene predictions should still be regarded as tentative and more experimental con-rmation of the gene function must ideally be examined.