ABSTRACT

This chapter introduces you to preprocessing procedures. These must be applied prior to applying machine learning algorithms for topic extraction and sentiment analysis. They ensure the accuracy of the analysis, reduce biases inherent in the data, and helps us to focus on relevant aspects of the data. In the following, we start with an outline of relevant preprocessing techniques. We then demonstrate basic data cleaning procedures, for example how to remove missing values or drop duplicate entries in a data frame. We then proceed with feature extraction techniques on corpus, document, and token level, for example how to create a weighted and unweighted document term-matrix, how to tokenize text documents and filter tokens by their parts of speech or how to perform named entity recognition.