ABSTRACT

Institute of Information Systems, Friedrich-Alexander-University Erlangen-Nuremberg, Germany

Pavlina Davcheva

Institute of Information Systems, Friedrich-Alexander-University Erlangen-Nuremberg, Germany

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

Used Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Data Set Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

6.2 Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 6.2.1 Step 1: Preparation and Loading of Data Set . . . . . . . . . . . . . . . . . . . . . . . . . 137 6.2.2 Step 2: Preprocessing and Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . 138 Substitution of Emoticons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Stopwords and Punctuation Marks Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 Text-normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.2.3 Step 3: Sentiment Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 6.2.4 Step 4: Evaluation of the Trained Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 6.2.5 Step 5: Visualization of Review Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

The chapter presents a process for sentiment classification of product review data. Our data set consists of reviews on six categories of products from Amazon.com. We use the star review system to determine whether the review is positive or negative. Emoticons substitution, tokenization, stopwords removal, and text normalization are applied, features are generated, which then are used for training the classifier. The resulting classifier is evaluated based on a k-fold stratified cross-validation strategy using the accuracy and a confusion matrix as measure for determining the quality of the prediction. As a final step, we demonstrate two visualization techniques to reveal the context behind the sentiment.