ABSTRACT

Epigenetics refers to the study of factors that impact the gene expression level and cause alterations in the form of phenotypic characters. Such types of changes are very much noticed in several diseases, including cancers. In this research, four types of data are used for the correct prediction of lung cancer, including DNA methylation data, histone data, human genome data, and RNA-Seq data. Four feature selection methods-ReliefF, gain ratio, principal component analysis, and correlation-based feature selection-and seven different classifiers-random forests, support vector machines with Gaussian kernel functions and linear kernel functions, logistic regressions, naive Bayes, artificial neural networks, and convolutional neural networks-were implemented in this study. These datasets have been processed using a custom R-script. In the data analysis, tools like Weka 3 and Python have been used. With the help of machine learning and deep learning methods, an improvement in the accuracy and area under the curve (AUC) of the lung cancer prediction was achieved compared to existing results. It is observed that the CNN model overperformed the other six classification methods. An improved AUC of 0.998 (+13.4%) as compared to the AUC of 0.864 in existing results was significant.