Sequence Searching on Supercomputers

doi:10.4324/9780429501463-9

ABSTRACT

Supercomputers allow the biologist to ask, and to answer, questions that would otherwise be impractical or impossible. An example of current relevance is searching the GenBank database with an entire HIV genome. The use of supercomputers in sequence similarity searching at Los Alamos is not quite like working on similar machines at, for example, an NSF-sponsored supercomputer center. The chapter discusses experiences with two programs: SEQF, a CFT implementation of Wilbur- and Lipman-type5 search code, and FASTA. A number of optimizations have been made, with others pointed out. The guiding principle of these optimizations is to minimize the number of calls to system routines for memory allocation and I/O handling. The chapter also discusses the steps involved in taking an existing similarity code and improving its performance. Addition-dly, buffering results in memory to eliminate the character- and line-at-a-time nature of the I/O operations would improve performance.