ABSTRACT

In Chapter 2 we learned how to search databases with text queries. All of these were exact matches—that is, we were expecting to find the exact accession number or exactly spelled words. In this chapter, a much harder database-searching problem is introduced. How do you find matches when your query is not a short accession number or a text term, but instead a DNA sequence that is 500 nucleotides long? In addition to finding all the exact matches, can you find those sequences with mismatches, clearly related to the query but not 100% identical? For all the hits that are not exact matches, can calculations generate statistics that help evaluate which hits are significant, and which should be ignored? On top of these challenges, can this search of a database, that contains millions of sequences, show the results in a reasonable time? These and other questions will be answered here. A computer program called BLAST is one of the most commonly used tools in bioinformatics and will be introduced in this chapter. The next three chapters will explore further uses of BLAST.