ABSTRACT

Since 1951 when the protein sequence of insulin became available

[1], the amount of available sequence information has been growing

at an everincreasing rate with more than 108 sequences known

today [2]. Information about protein structures as well as about

biochemical properties of proteins is increasing as well albeit at a

slower rate due to the increased experimental effort as compared

to automated sequencing methods. This progress leads to an

ever-growing wealth of information and should also increase our

understanding of proteins in equal measure. However, this is often

not the case. Scientists today are flooded with information about

protein sequences, structures, biochemical properties, interactions,

purification schemes, and many more biological data. However, a

widely accepted data model to store, assess, and systematically

classify these data is still lacking. Protein sequences are available

from data repositories such as GenBank or Uniprot [3], but the

sequence entries often lack annotation, provide information that

contradicts the content of other entries, and contain only a limited

number of links to other types of information. Especially naming of

members of protein families with established naming schemes such

as the metalloβ-lactamases (MBLs) is sometimes ambiguous. Thus,

large amounts of data are isolated and lack links to other, critically

important pieces of information. While the number of needles is

increasing, the haystack is also getting bigger and bigger.