ABSTRACT

Modern technology, where sophisticated instruments are coupled with the massive use of computers, has made molecular biology a science where the size of the data to be gathered and analyzed poses serious computational problems. Very large data sets are ubiquitous in computational molecular biology: The European Molecular Biology Laboratory (EMBL) nucleotide sequence database has almost doubled its size every year in the past 10 years, and, currently, the archive comprises over 1.7 billion records covering almost 1.7 trillion base pairs of sequences. Similarly, the protein data bank (PDB) has seen an exponential growth, with the number of protein structures deposited (each of which is a large data set by itself) currently at over 50,000. An assembly task can require to reconstruct a large genomic sequence starting from hundreds of thousands of short (100-1000 bp) DNA fragments. Microarray experiments produce information about the expression of hundreds of thousands of genes in hundreds of individuals at once (data set in the order of Gigabytes), and the list goes on and on.