ABSTRACT

This chapter looks at Q/C for assessing the quality of the raw sequence data. Q/C serves as a quick screening tool for excluding data with serious quality issues and flagging data with questionable quality. The most important parameters to check are base quality, nucleotide distribution, GC content distribution, duplication rate, adapter sequence contamination. On can use a tool called FastQC, which you can download from the Babraham Bioinformatics Website. Base quality scores are the single most important parameter in the Q/C of raw sequencing data. The total percentage of GC content sequenced can also be used as a quality control parameter. Duplicate reads can arise from other sources in addition to PCR duplicates, including sequencing artifacts such as poly-A and poly-N reads, noise in cluster detection, and from genomic DNA shearing at the same location in different molecules. The Kmer Content module of FastQC measures the number of each 7-mer at each position in the sequencing library.