Motif Finding | 16 | Biological Sequence Analysis Using the SeqAn C++

ABSTRACT

Motif finding means to find matching substrings of d ≥ 2 given sequences a1, . . . , ad, where matching can either be meant to be exact, i.e., the matching substrings must be exactly the same, or approximate, i.e., differences between the substrings are allowed. The matching substrings are called motifs. In the following, we will first concentrate on the pairwise motif finding problem, that is finding motifs in d = 2 sequences. In contrast to pattern matching (see Chapter 9), where we search for a complete needle sequence in a haystack sequence, (pairwise) motif addresses the problem of finding parts of the needle within the haystack. In most cases, we are only interested in motifs that fulfill certain criteria of quality, for example a minimal length, a minimum alignment score or – in the case of approximate motif finding – a maximum mismatch count. SeqAn offers algorithms for solving various kinds of motif finding problems which are spread over several modules of the library:

• Local Alignments: Algorithms for solving the global alignment problem (see Section 8.5), that is to find an optimal alignment between two complete sequences, can be adapted for motif finding, that is to find optimal alignments between substrings of two sequences. This is called local alignment, and we describe it in Section 10.1. The motifs found by this search are local alignments stored in alignment data structures; see Section 8.2.