ABSTRACT

We have developed a system which uses neural networks and dynamic programming (DP) to identify protein coding regions in genomic DNA sequences. Nine scores are calculated on all subintervals of the sequence which evaluate the likelihood that the subinterval belongs to one of four classes; first, last or internal exon or intron. These scores are weighted by a neural network and used as input to a DP algorithm. DP is used to find the highest scoring combination of introns and exons subject to a few simple constraints on gene structure. The neural network weights are optimized by training on input vectors which measure the difference between the predicted optimal solution by DP and the biologically correct solution. The system is trained by maximizing the difference between the correct parse and a sample of incorrect parses. On a test set of genomic sequences from GenBank, we obtained correlation coefficients for exon nucleotide prediction as high as 0.94. This is superior to the results obtained by purely rule-based systems.