ABSTRACT

Sequence analysis is the main source of information for most new genes. Significant sequence similarity among proteins may imply that the proteins share the same secondary and tertiary structure and have close biological functions. This chapter discusses the basics of sequence comparison, scoring schemes, and the statistics of sequence alignments, which is essential for distinguishing true relations among proteins from chance similarities. The statistical significance of similarity scores for “real” sequences is estimated by computing the probability that the same score could have been obtained for random sequences. Low-complexity sequences, also known as “simple sequences”, are abundant in proteins. These compositionally biased sequences are frequent in structural proteins such as collagens and cell-wall proteins. Low-complexity sequences pose a problem for sequence homology searches. Because of the repetitive nature of these sequences, they often result in high-scoring similarities that are biologically meaningless.