ABSTRACT

The problem of identifying meaningful patterns (i.e., motifs) from biological data has been studied extensively due to its paramount importance. This problem in general requires finding short patterns of interest from voluminous data. Several variants of this motif search problem have been identified in the literature. In this chapter we survey some of the algorithms that have been proposed for these problems. We are interested in the following three versions:

Problem 1. Input are n sequences of length m each. Input also are two integers l and d. The problem is to find a motif (i.e., a sequence) M of length l. It is given that each input sequence contains a variant of M . The variants of interest are sequences that are at a hamming distance of d from M . (This problem is also known as the planted (l, d)-motif search problem.)

Problem 2. The input is a database DB of sequences S1, S2, . . . , Sn. Input also are integers l, d, and q. Output should be all the patterns in DB such that each pattern is of length l and it occurs in at least q of the n sequences. A pattern U is considered an occurrence of another pattern V as long as the edit distance between U and V is at most d. We refer to this problem as the edited motif search problem.