ABSTRACT

Predicting the relation between protein sequences, structure, and their functions is an extremely complex process, which prompts the need for adopting various analytical techniques and artificial intelligence tools. In this chapter, a natural language processing (NLP)-based feature extraction technique followed by a distributed deep learning approach for secondary structure prediction of protein is proposed. The proposed feature extraction method is a combination of ‘prot2vec’, a pre-built embedding along with vectors generated from input dataset. Further, a convolutional network classifier is applied to process the extracted features and predicts the structures into α helix, β sheet, or Coils. The proposed model of structure prediction of proteins is implemented in an Apache Spark-based big data framework and scores an accuracy of 84.9%. Our results clearly confirm that an ensemble NLP approach combined with a deep learning network achieves a better accuracy in classifying protein secondary structures and the performance is improved when implemented in a distributed big data computing environment.