ABSTRACT

Institut Gaspard-Monge, University of Marne-la-Valle´e, Marne-la-Valle´e, France

Marie-France Sagot

INRIA, Laboratoire de Biome´trie et Biologie E´volutive, University Claude Bernard, Lyon, France

1. MOTIFS IN SEQUENCES

Conserved patterns of any kind are of great interest in biology, because they are likely to represent objects upon which strong constraints are potentially acting and may therefore perform a biological function. Among the objects that may model biological entities, we shall consider only strings in this chapter. As is by now well known, biological sequences, whether DNA, RNA or proteins, may be represented as strings over an alphabet of four letters (DNA/RNA) or 20 letters (proteins). Some of the basic problems encountered in classical text analysis have their counterpart when the texts are biological sequences; among them is pattern matching. However, this problem comes

MD: KONOPKA, JOB: 04359,

with a twist once we are in the realm of biology; exact patterns hardly make sense in this case. By exact, we mean identical, and there are, in fact, at least two types of ‘‘nonidentical’’ patterns one must consider in biology. One comes from looking at what ‘‘hides’’ behind each letter of the DNA/RNA or protein alphabet, and the other corresponds to the more familiar notion of ‘‘errors.’’ The errors concern mutational events that may affect a molecule during DNA replication. Those of interest to us are point mutations, that is, mutations operating each time on single letters of a biological sequence: substitution, insertion, or deletion. Considering substitutions only is sometimes enough for dealing with some problems.