ABSTRACT

Comprehensive protein annotation and function assignment are the first steps in target selection and validation for drug discovery. Current advances in genome sequencing and high-throughput structural genomics have resulted in an explosive growth in the number of protein sequences and structures without an assigned function. Approximately one-third of protein-coding genes in newly sequenced prokaryotic genomes, and even larger numbers in eukaryotic genomes, lack functional assignment. There are extreme cases like the genome of Plasmodium falciparum, where the function of approximately 60% of predicted proteins is unknown [1]. This situation is not limited to protein sequences but recently has expanded to include protein structures: Up to 60% of protein structures deposited in the Protein Data Bank (PDB) [2] by some structural genomics centers do not have any function assignment. That vast and constantly growing repository of sequences and structures is a rich potential source for the identification of new drug-discovery targets. Protein function has multiple definitions. To a cell biologist, function might refer to the network of interactions in which the protein participates or to the location to a certain cellular compartment. To a biochemist, function refers to the metabolic process in which a protein is involved or to the reaction catalyzed by an enzyme. Developmental biologists or physiologists might include temporal patterns of expression or tissue specificity in their definition of protein function. From a drug-discovery point of view, function assignment means elucidation of biochemical function, although additional levels of annotation can be used as qualifiers to evaluate the prospective usefulness of a potential drug-discovery target. This level of function assignment is usually called the molecular function. Biochemical and/or molecular function can be deduced in many cases from any combination of sequence, structure, and contextual information. In some cases, further levels of protein function such as cellular location, interacting partners, participation in regulatory networks or metabolic pathways, and so forth, are possible. Function assignment can be achieved experimentally in the laboratory or computationally, and generally there is a strong feedback between the experimental and in silico research components in drug-discovery efforts. Computational findings can assist laboratory biologists and chemists to direct experimental design, and subsequent experimental findings can suggest new courses of action for computational biologists. For the computational biologist, the terms function

annotation and function assignment are somewhat interchangeable and blurred. Preferably, the use of the term function assignment should be limited to designate the attribution of an enzymatic function or gene ontology. For the individual pieces of evidence that point to a certain biological function, it is more appropriate to use the term function annotation. Frequently, the complexity of function assignment is beyond a single sentence. For instance, tubulin, a component of microtubules and a target for anticancer drugs like Taxol or Vinca alkaloids [3], is not only a structural protein but also a GTP-hydrolizing enzyme. Its structural role is not limited to being part of the cytoskeleton but includes roles in intracellular protein and organelle traffic, protein and organelle scaffolding, formation of the mitotic spindle during cell division, or as a component of motile systems. Primary sequence determines protein structure, and in turn protein structure determines protein function. Function is the only element of this first paradigm, central to protein function assignment that cannot be addressed computationally. Therefore, the inference of molecular function from sequence or structure is one of the ultimate goals for postgenomic bioinformatics. The second paradigm in the field of protein function assignment is that similar sequences or similar structures have similar function, hence function assignment can be performed by transferring the annotation of a protein experimentally characterized to the protein being annotated (fig. 3.1). The second paradigm is not an absolute truth. Enzyme Commission numbers classify structurally similar enzymes as functionally dissimilar and vice versa, and very divergent functions are possible in proteins with high levels of sequence conservation. Structural databases such as SCOP [4] or CATH [5] group functionally dissimilar proteins into structurally similar groups. Thus, one should carefully evaluate transference of function based on

sequence or structural similarity, always taking into account that the computational results are simply models that require experimental confirmation. Target protein annotation is not limited to de novo function assignment. There is a sizeable degree of contamination in public databases due to function assignments being erroneously transferred to newly annotated proteins. For mission critical tasks, every annotation should be deemed suspect. Consequently, correction of assigned function, known as reannotation, plays a supplementary yet crucial role in computational function annotation and assignment. In addition, protein annotation can be used to add value to existing assignments. Proteins with known functions can be re-examined to discover new functions that could lead to novel approaches to modulate an already validated drugable target. For instance, through in silico methods it might be possible to identify new ligand-binding or protein-protein interaction sites in a target protein for which pharmacological value is already established. The challenges of highthroughput automated function assignment in genome annotation projects and the challenges faced by annotators in target validation projects are different. The main goal of genome-scale annotation is to provide function assignments for all proteins in a genome in an acceptable timescale. There is a trade-off between the depth and quality of the annotations and the time devoted to that process. Conversely, annotators evaluating the possible value of a target for drug discovery have accuracy as their first priority. Accuracy demands human curation and examination of the problem from many possible angles to minimize the chances of erroneous assignment. Genome-scale annotation has a certain built-in margin of error, which means that, from a drug discovery point of view, every public annotation is technically questionable. Often, the confirmation of the function assignment is a fairly routine process, but in other cases that confirmation requires considerable effort. Despite the need of manual curation for mission critical targets, high-throughput automated protein annotation methods are still extremely useful for target drug discovery as target preprocessing and prioritization tools. In that role, high-throughput automatic methods can reduce the number of potential targets from thousands or tens of thousands to a manageable number that can in turn be re-evaluated and validated by expert human curators.