ABSTRACT

The overwhelming and exhausting avalanche of complete genome sequences has lead to a concomitant flood of information concerning complete proteomes. In principle, knowing the nucleotide sequence of a whole genome enables the location all of its genes and therefore the sequences of all the proteins that are encoded by these genes. Fortuitous identities are necessarily more numerous with nucleotide sequences; hence there is a loss in selectivity. Although the raw data have come initially from genomic nucleotide sequences, we speak only of protein sequence comparisons. The most rigorous algorithm for aligning a pair of sequences was provided by Needleman and Wunsch, which guarantees that the alignment score thus obtained is the largest possible score for the two sequences. The way of building the clusters in the “Systers” databank is simple and ingenuous. Syntenic regions shared by different chromosomes mean that the chromosomal segments contain the same homologous genes.