General Pattern Matching

doi:10.1201/9781584888239-19

ABSTRACT

This chapter reviews combinatorial and algorithmic problems related to searching and matching of strings and slightly more complicated structures such as arrays and trees. These problems arise in a vast variety of applications and in connection with various facets of storage, transmission, and retrieval of information. A list would include the design of structures for the eﬃcient management of large repositories of strings, arrays and special types of graphs, fundamental primitives such as the various variants of exact and approximate searching, speciﬁc applications such as the identiﬁcation of periodicities and other regularities, eﬃcient implementations of ancillary functions such as compression and encoding of elementary discrete objects, etc. The main objective of studies in pattern matching is to abstract and identify some primitive manipulations, develop new techniques and eﬃcient algorithms to perform them, both by serial and parallel or distributed computation, and implement the new algorithms. Some initial pattern matching problems and techniques arose in the early 1970s in connection

with emerging technologies and problems of the time, e.g., compiler design. Since then, the range of applications of the tools and methods developed in pattern matching has expanded to include text, image and signal processing, speech analysis and recognition, data compression, computational biology, computational chemistry, computer vision, information retrieval, symbolic computation,

and

computational learning, computer security, graph theory and VLSI, etc. In little more than two decades, an initially sparse set of more or less unrelated results has grown into a considerable body of knowledge. A complete bibliography of string algorithms would contain more than 500 articles. Aho [2] references over 140 papers in his recent survey of string-searching algorithms alone; advanced workshops and schools, books and special issues in major journals have already been dedicated to the subject and more are planned for the future. The interested reader will ﬁnd a few reference books and conference proceedings in the bibliography of this chapter. While each application domain presents peculiarities of its own, a number of pattern matching

primitives are shared, in nearly identical forms, within wide spectra of distant and diverse areas. For instance, searching for identical or similar substrings in strings is of paramount interest to software development and maintenance, philology or plagiarism detection in the humanities, inference of common ancestries in molecular genetics, comparison of geological evolutions, stereo matching for robot vision, etc. Checking the equivalence (i.e., identity up to a rotation) of circular strings ﬁnds use in determining the homology of organisms with circular genomes, comparing closed curves in computervision, establishing theequivalenceofpolygons incomputergraphics, etc. Finding repeated patterns, symmetries, and cadences in strings is of interest to data compression, detectionof recurrent events in symbolic dynamics, genome studies, intrusion detection in distributed computer systems, etc. Similar considerations hold for higher structures. In general, an intermediate objective of studies in pattern matching is to understand and characterize combinatorial structures and properties that are susceptible of exploitation in computational matching and searching on discrete elementary structures. Most pattern matching issues are still subject to extensive investigation both within serial and

parallel models of computation. This survey concentrates on sequential algorithms, but the reader is encouraged to explore for himself the rich repertoire of parallel algorithms developed in recent years. Most of these algorithms bear very little resemblance to their serial counterparts. Similar considerations apply to some algorithms formulated in a probabilistic setting. Pattern matching problems may be classiﬁed according to a number of paradigms. One way is

based on the type of structure (strings, arrays, trees, etc.) in terms of which they are posed. Another, is according to the model of computation used, e.g., RAM, PRAM, Turing machine. Yet another one is according to whether the manipulations that one seeks to optimize need be performed online, oﬀ-line, in real time, etc. One could distinguish further betweenmatching and searching and, within the latter, between exact and approximate searches, or vice versa. The classiﬁcation used here is thus somewhat arbitrary. We assume some familiarity of the reader with exact string searching, both onand oﬀ-line, which is covered in a separate chapter of this volume.We start by reviewing some basic variants of string searching where occurrences of the pattern need not be exact. Next, we review algorithms for string comparisons. Then, we consider pattern matching on two-dimensional arrays and ﬁnally on rooted trees.