ABSTRACT

A sensitive technique for protein sequence motif recognition based on neural networks has been developed by Frishman and Argos, and by Vogt et al. It involves three major steps, (i) At each alignment position of a set of N matched sequences, a set of N aligned oligopeptides is specified with preselected window length. N neural networks are subsequently and successively trained on N − 1 amino acid spans after eliminating each ith oligopeptide. A test for recognition of each of the ith spans is performed. The average neural network recognition over N such trials is used as a measure of conservation for the particular windowed region of the multiple alignment. This process is repeated for all possible spans of given length in the multiple alignment, (ii) The M most conserved regions, delineated by significance thresholds, are regarded as motifs and the oligopeptides within each are used to train extensively M individual neural networks, (iii) The M networks are then applied in a search for related primary structures in a large databank of known protein sequences. The oligopeptide spans in the database sequence with strongest neural net output for each of the M networks are saved and then scored according to the output signals and the proper combination which follows the expected N- to C-terminal sequence order. The motifs found from the database search with highest similarity scores can then be used to retrain the M neural nets which can be subsequently utilized for further searches in the databank, thus providing even greater sensitivity to recognize distant familial proteins. This technique was successfully applied to the integrase, DNA-polymerase and immunoglobulin families.