Pattern Matching in Strings

doi:10.1201/9781584888239-17

ABSTRACT

The present chapter describes a few standard algorithms used for processing texts. They apply, for example, to themanipulation of texts (text editors), to the storage of textual data (text compression), and to data retrieval systems. The algorithms of the chapter are interesting in diﬀerent respects. First, they are basic components used in the implementations of practical software. Second, they introduce programmingmethods that serve as paradigms in other ﬁelds of computer science (system or software design). Third, they play an important role in theoretical computer science by providing challenging problems. Although data are stored variously, text remains the main form of exchanging information. This

is particularly evident in literature or linguistics where data are composed of huge corpora and dictionaries. This applies as well to computer science where a large amount of data are stored in linear ﬁles. And this is also the case in molecular biology where biological molecules can often be approximated as sequences of nucleotides or amino-acids. Moreover, the quantity of available data in these ﬁelds tends to double every 18months. This is the reason why algorithms should be eﬃcient even if the speed of computers increases regularly. The manipulation of texts involves several problems among which are pattern matching, approx-

imate pattern matching, comparing strings, and text compression. The ﬁrst problem is partially treated in the present chapter, in that we consider only one-dimensional objects. Extensions of the methods to higher dimensional objects and solutions to the second problem appear in the Chapter 15. The third problem includes the comparison of molecular sequences, and is developed in the corresponding chapter. Finally, an entire chapter is devoted to text compression. Pattern matching is the problem of locating a collection of objects (the pattern) inside raw text.

This is the opposite of the database approach in which texts are structured in ﬁelds which themselves