ABSTRACT

Sanger sequencing was considered the golden standard for de novo genome assembly. However, it is prohibitively expensive and time-consuming to assemble a genome using this first-generation technology, as it took $3 billion and 13 years to generate the human genome draft assembly. The demand for low-cost and fast genome sequencing provides the very impetus for the development of next-generation sequencing (NGS) technologies. The dramatically reduced cost of NGS makes whole-genome shotgun sequencing much more affordable and accessible to individual labs. De novo genome assembly from the relatively short and enormous number of reads generated from most NGS platforms, however, poses serious challenges to assembling algorithms that were designed for Sanger sequences. The short length of NGS reads means that they carry less information and as a result lead to more uncertainties in the assembling process. To remedy this situation, higher coverage is required, which significantly increases the number of reads required and therefore the computational complexity. For example, using Sanger sequences with lengths up to 800 bp, assembling the human genome used approximately 8× coverage; for NGS reads of 35 to 100 bp, the same task needs 50× to 100× coverage [244].