ABSTRACT

Learning goal: You can nd common, unique, and redundant items in data sets. 6.1 IN THIS CHAPTER YOU WILL LEARN

• How to nd common items in two or more data sets

• How to merge data sets

• How to remove duplicates from a data set

• How to detect data set overlaps, intersections, and dierences with the set data structure

• How to remove noise from NGS raw data

6.2.1 Problem Description e output of the NGS data analysis program Cucompare is the transcripts.tracking le described in the caption of Figure  6.1 and shown in Appendix C, Section C.9, “An Example of the Cucompare Output for ree Samples (q1, q2, and q3)” (for three biological samples). e rst row of the le for three samples (q1, q2, and q3) looks like the following:

The file is a tab-separated table reporting the results of a comparison among the transcriptomes obtained from different DNA sequence samples. In particular, six samples are taken into account: WT1, WT2, and WT3, which are three replicas of a wild type cell type (denoted with q1, q2, and q3 in the file, respectively), and T1, T2, and T3 (denoted with q4, q5, and q6 in the file, respectively), which are three replicas of the same cell type after pharmacological treatment (T stands for treated). Replicas are necessary to ensure robustness of data, and to this aim, you may want to retain only transcripts that have been observed in the transcriptome of all replicas or at least in two out of three of them.