ABSTRACT

An elementary measure of sequence dissimilarity, d 2, is described. The computer algorithm used for its evaluation is discussed in detail. The potential sensitivity of the measure is demonstrated by comparison of sequences with randomly changed letters. The biologic efficacy of d 2 is demonstrated by using the measure to detect the members of the Alu repetitive sequence family in sequenced primate DNA, and also to detect the members of the IS1 repetitive sequence family in sequenced bacterial DNA. We explore a natural weighting scheme for each word’s contribution to d 2 and show its utility for finding Alu sequences.