ABSTRACT
Since 1951 when the protein sequence of insulin became available
[1], the amount of available sequence information has been growing
at an everincreasing rate with more than 108 sequences known
today [2]. Information about protein structures as well as about
biochemical properties of proteins is increasing as well albeit at a
slower rate due to the increased experimental effort as compared
to automated sequencing methods. This progress leads to an
ever-growing wealth of information and should also increase our
understanding of proteins in equal measure. However, this is often
not the case. Scientists today are flooded with information about
protein sequences, structures, biochemical properties, interactions,
purification schemes, and many more biological data. However, a
widely accepted data model to store, assess, and systematically
classify these data is still lacking. Protein sequences are available
from data repositories such as GenBank or Uniprot [3], but the
sequence entries often lack annotation, provide information that
contradicts the content of other entries, and contain only a limited
number of links to other types of information. Especially naming of
members of protein families with established naming schemes such
as the metalloβ-lactamases (MBLs) is sometimes ambiguous. Thus,
large amounts of data are isolated and lack links to other, critically
important pieces of information. While the number of needles is
increasing, the haystack is also getting bigger and bigger.