ABSTRACT

The Sequence-Structure Gap is Rapidly Increasing. Currently, databases for protein sequences (e.g., SWISS-PROT/TrEMBL* (14)) are expanding rapidly, largely due to large-scale genome sequencing projects: At

* Abbreviations used: 3D, three-dimensional; 3D structure, three-dimensional (coordinates of protein structure); 1D, one-dimensional; 1D structure, one-dimensional (e.g., sequence or string of secondary structure); ASP, method identifying regions of structure ambivalent in response to global changes (1); DSSP, database containing the secondary structure and solvent accessibility for proteins of known 3D structure; HMMSTR, hidden Markov model-based prediction of secondary structure (2); HSSP, database of protein structure-sequence alignments; rmsd, root mean square deviation; JPred, method combining other prediction methods (3, 4); JPred2, divergent profile (PSI-BLAST) based neural network prediction (5); MaxHom, dynamic programming algorithm for conservation weight-based multiple sequence alignment; PDB, Protein Data Bank of experimentally determined 3D structures of proteins (6); PHD, Pairwise profilebased neural network prediction of secondary structure; PHDpsi, divergent profile (PSIBLAST) based neural network prediction (7, 8); PROF, divergent profile-based neural network prediction trained and tested with PSI-BLAST (9); PSI-BLAST, gapped and iterative specific profile-based, fast and accurate alignment method (10); PSIPRED, divergent profile (PSI-Blast) based neural network prediction (11); SAM-T99sec, neural network prediction using hidden Markov models as input (12)1 SSpro, profile-based advanced neural network prediction method (13); SWISS-PROT, database of protein sequences (14); U, protein sequence of unknown 3D structure (e.g., protein to be predicted).