ABSTRACT

This chapter describes the intricacies involved in handling prevalent databases used in bioinformatics. The evolutionary nature of the biological data renders unique characteristics that are describes as highly heterogeneous, large in data volume, dynamic, hierarchical, not standardized, lacking database management applications and data access tools for biological databases, and data integration and annotation. The categorization aims to differentiate biological databases into two categories, systems point solution and general solution databases. Gene expression data, raw data are obtains in the form of microarray chip images, a product of the microarray experiment. The Protein Data Bank (PDB) is one of the largest repositories of known protein structures. The inherent large number of dimensions, called the curse of dimensionality, has ubiquitous effects throughout the sciences, specifically in bioinformatics. In multisource integration, the problems faced are derivatives of the problems of each independent source. Data cleaning that uses domain knowledge to duplicate record identification and for de–duplication is a necessary component of data preprocessing.