ABSTRACT

The rst complete protein sequence determined was bovine insulin, sequenced by Frederick Sanger and colleagues in the early 1950s. About 10 years later, there were more than 100 protein sequences published, and the rst protein sequence database, Atlas of Protein Sequence and Structure, was created by Margaret Dayhoff, who is also credited as a founder of the eld of bioinformatics. However, the atlas contained very few uncharacterized proteins and was mainly used to investigate sequence diversity between homologous proteins (such as globins) from diverse organisms. After the introduction of rapid DNA-sequencing methods in the mid 1970s, more and more protein sequences were predicted by translating sequenced DNA (or cDNA), and thus, the number of uncharacterized protein sequences began to increase. As several large-scale genome-sequencing projects have been completed, a large amount of data concerning the number and distribution of proteins has become available. However, there are several issues that should be considered when predicting the protein-coding regions in DNA sequences. First, predicting correct start and stop codons of a gene and the splicing pattern may be exceedingly difcult. Second, after predicting a putative open reading frame (ORF) and translating it into a protein sequence, how do we know that this particular ORF is expressed as there is no experimental evidence? By searching databases, it may be possible to identify other proteins with similar sequences that have been demonstrated to be expressed, which is an indication that the particular DNA codes for a protein since a conserved ORF is likely to be expressed. Third, mRNA from some genes can be edited, resulting in the generation of splice variants and leading to the biosynthesis of different polypeptides from a single gene. Finally, sequence databases often contain raw data derived directly from experiments and various sequencing projects, making it possible that deduced sequences are not correct due to frameshifted fragments, sequences from pseudogenes, and sequencing errors. Some databases are highly curated, that is, entries in the database are analyzed and veried by human experts.