chapter  9
24 Pages

◾ Pattern Matching and Text Mining

Learning goal: You can nd patterns in sequences and natural lan-guage texts. 9.1 IN THIS CHAPTER YOU WILL LEARN

• How to express a consensus sequence using a regular expression

• How to use Python regular expression tools to search substrings in strings

• How to search for the occurrence of a functional motif in a protein sequence

• How to search a given word or a set of words in a text (e.g., scientic abstracts)

• How to identify motifs in nucleotide sequences (e.g., transcription factor or miRNA binding sites)

9.2.1 Problem Description

A sequence functional motif is dened as a short amino acid or nucleotide sequence containing one or more residues involved in a function. Phosphorylation sites, mannosylation sites, recognition motifs,

glycosylation sites, transcription binding sites, etc. are examples of functional motifs. Sequence functional motifs can be represented using a special notation called regular expressions. A regular expression, sometimes called regexp, is a string syntax composed of characters and metacharacters that represent sets of strings. In other words, if you want to represent several strings with a single expression, it is necessary to introduce rules and symbols that allow “meta” meanings like wildcards, repeated characters, and logical groups. An example oen used in biology is the character N in DNA sequences. e sequence AGNNT could be AGAAT, AGCTT, AGGGT, or one of many other alternatives. Regular expressions work in a similar way but use a more complex set of special characters.