ABSTRACT

Gene annotations databases are widely used as public repositories of biological knowledge.1 Understanding the results of almost any molecular biology experiment involves consulting such annotation databases. Our current knowledge is spread out over a number of databases (DBs), such as: Entrez Gene [294], UniProt [29], Protein Data Bank [47], RefSeq [344], RGD, SGD, WormBase, and Gene Ontology (GO) [25], to name just a few. Many such databases

ical entities. For instance, UniProt focuses on proteins, Entrez Gene focuses on genes, EPD focuses on eukaryotic promoters, etc. Other databases aim to provide a wider angle but focus on specific organisms. Examples could include RGD for rat, SGD for yeast, WormBase for C. Elegans, etc. Obtaining a complete understanding of an experiment, usually requires combining information from several such annotations databases. Unique key identifiers (IDs) in the internal structure of each such database represent biological entities such as genes, proteins, and mRNAs. Design and implementation restrictions specific to each database ensure that, within each database, the data are consistent, coherent, and non-redundant. However, most of these annotation databases have been developed by independent groups, which have used completely different designs and completely different sets of key identifiers for the same biological entities. Because of this, the ensemble of such annotation databases, which is the current repository of all our biological knowledge is inconsistent, incoherent, and highly redundant.