ABSTRACT

In general, next-generation sequencing (NGS) data analysis is divided into three stages. In the primary analysis stage, bases are called based on deconvolution of the optical or physicochemical signals generated in the sequencing process. Regardless of sequencing platforms or applications, the base call results are usually stored in the standard FASTQ format. Each FASTQ file contains a massive number of reads, which are the sequence readouts of DNA fragments sampled from a sequencing library. In the secondary analysis stage, reads in the FASTQ files are quality checked, preprocessed, and then mapped to a reference genome. The data quality check or control (QC) step involves examining a number of sequence reads quality metrics. Based on data QC results, the NGS sequencing files are preprocessed to filter out low-quality reads, trim off portions of reads that have low-quality base calls, and remove adapter sequences or other artificial sequences (such as polymerase chain reaction [PCR] primers) if they exist. Subsequent mapping (or aligning) of the preprocessed reads to a reference genome aims to determine where in the genome the reads come from, the critical information required for most tertiary analysis (except de novo genome assembly). The stage of tertiary analysis is highly application-specific and detailed in the chapters of Section III. This chapter focuses on steps in the primary and secondary stages, especially on reads QC, preprocessing, and mapping, which are common and shared among most applications (Figure 5.1).