ABSTRACT

A good deal has been learned about macromolecular sequence data since the first sequences were determined. Because evolution has preserved the essential features of these molecules, a significant sequence similarity between two macromolecules suggests a related function or origin. Statistical questions are natural in this setting. The scientist wants to find biologically significant relationships between sequences. Although statistical significance is neither necessary nor sufficient for biological significance, it is a good indicator. If there are 50,000 sequences in a database, we must have an automatic way to reject all but the most interesting results from a search of these sequences. Simulation is too time-consuming for large numbers of comparisons of varying sequence lengths and compositions but can be used for specific comparisons of interest. However, a blend of theory and simulation gives a solution for the logarithmic region even for general scoring schemes.