ABSTRACT

String set matching is an important operation in computational biology. For example, when proteomics data is used for genome annotation in a process called proteogenomic mapping [1-5], a set of peptide identifi cations obtained using mass spectrometry is matched against a target genome translated in all six reading frames. Given a large number of peptides and long translated genome strings, the fundamental problem here is to effi ciently search a large text corpus (i.e., the translated genome) to identify the locations of individual strings that belong to a large set of patterns (i.e., the set of peptides).