ABSTRACT

Natural language processing is a fast-evolving field with popular applications in language translation, speech recognition, text classification, and summarization of text, among others. The central aspect of NLP is vectorizing text data. Text vectorizing techniques such as the bag-of-words and term frequency-inverse document frequency (or TF-IDF) will be discussed in this chapter. While the bag-of-words keeps track of word count, TF-IDF scores words according to the level of importance in the text and corpus. Once the text is vectorized, PCA can be used to view these in low-dimensional space and can subsequently be fed to Machine Learning algorithms for analysis. In this chapter, the abstract of scientific literature of four categories is utilized to train logistic regression and a neural network-based NLP model. The model proved to be efficient for the categorization of new text data.