ABSTRACT

At the dawn of computational biology in the 1960s, datasets were small. Protein sequences were first distributed in the printed Dayhoff atlases [29] and later on CD-ROM, with bioinformaticians eyeballing entire datasets and shuffling data by hand. By the 1990s, bioinformaticians were using spreadsheet programs and scientific software packages to analyze increasingly large datasets that included several phage and bacterial genomes. In 2003, the pregenomic era ended with the online publication of the human genome [7,14,26] and the National Institutes of Health invested heavily in sequencing related organisms to aid in annotation. By the mid-2000s, Sanger sequencing was replaced by faster and cheaper next-generation sequencing technologies, resulting in an explosion of data, with bioinformaticians racing 187to develop automated and scalable computational tools to analyze and mine it [3].