ABSTRACT

Large scale sequence analysis is a complex task that involves the integration of results from numerous computational tools. For high-throughput data analysis, these tools must be tied together in a coordinated system that can automate the execution of a set of analyses in sequence or in parallel. To this end, a diverse array of software systems for biological sequence analysis have emerged in recent years. For example, the Ensembl pipeline [1] automates the annotation of several eukaryotic genomes, Mungall et al [2] have created a robust pipeline for annotation and analysis of the Drosophila genome, GenDB [3] is used as an annotation system for several prokaryotic genomes and Yuan et al [4] have published resources for annotating the rice and other plant genomes. ese pipelines are extensive in their scope, are well-designed and meet their objectives. In surveying these and other systems, we have identied three critical areas that are essential for building on the design of existing biological sequence analysis pipelines:

• ere is a need for exible architecture so that one software system can be used to analyse dierent data sets that may require dierent analysis tools.