ABSTRACT
Learning goal: You can nd patterns in sequences and natural lan-guage texts. 9.1 IN THIS CHAPTER YOU WILL LEARN
• How to express a consensus sequence using a regular expression
• How to use Python regular expression tools to search substrings in strings
• How to search for the occurrence of a functional motif in a protein sequence
• How to search a given word or a set of words in a text (e.g., scientic abstracts)
• How to identify motifs in nucleotide sequences (e.g., transcription factor or miRNA binding sites)
9.2.1 Problem Description
A sequence functional motif is dened as a short amino acid or nucleotide sequence containing one or more residues involved in a function. Phosphorylation sites, mannosylation sites, recognition motifs,
glycosylation sites, transcription binding sites, etc. are examples of functional motifs. Sequence functional motifs can be represented using a special notation called regular expressions. A regular expression, sometimes called regexp, is a string syntax composed of characters and metacharacters that represent sets of strings. In other words, if you want to represent several strings with a single expression, it is necessary to introduce rules and symbols that allow “meta” meanings like wildcards, repeated characters, and logical groups. An example oen used in biology is the character N in DNA sequences. e sequence AGNNT could be AGAAT, AGCTT, AGGGT, or one of many other alternatives. Regular expressions work in a similar way but use a more complex set of special characters.