ABSTRACT

By the mid-1990s, the extraordinary advances in DNA cloning, amplification and sequencing had made feasible the mapping and sequencing of whole genomes. The exponential growth of data led to extensive and progressively well-annotated genome databases, and suites of computational tools for gene prediction, homolog identification and gene structural and expression analyses. For the first time the full complement of DNA sequence information in bacteria, simple eukaryotes, fungi, plants and animals began to be revealed, enabling comparative genomics to interrogate evolutionary relationships and functional indices at increasingly high resolution. Prokaryote genomes were confirmed to be dominated by protein-coding genes, with phenotypic diversity achieved primarily by proteomic variation. On the other hand, animals differing by orders of magnitude in developmental complexity were found to have a similar number and repertoire of protein-coding genes - only about 20,000 in both nematodes and mammals - the ‘G-value enigma. By contrast, increased developmental complexity correlated with the extent of intronic and intergenic non-protein-coding DNA, indicating that phenotypic radiation and developmental sophistication in multicellular organisms is achieved mainly by regulatory expansion. Mysterious non-protein-coding ‘ultraconserved’ sequences and large numbers of ‘pseudogenes’ were found in mammalian genomes. Different classes of transposable element and retroviral-derived sequences were characterized in plant and animal genomes and shown to be major drivers of phenotypic innovation.