ABSTRACT

CONTENTS 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

3.1.1 Computational Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 3.2 Anatomy of an email Message . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 3.3 Reading the email Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 3.4 Text Mining and Naïve Bayes Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 3.5 Finding the Words in a Message . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

3.5.1 Splitting the Message into Its Header and Body . . . . . . . . . . . . . . . . . . . . . . 116 3.5.2 Removing Attachments from the Message Body . . . . . . . . . . . . . . . . . . . . . . 117 3.5.3 Extracting Words from the Message Body . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 3.5.4 Completing the Data Preparation Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

3.6 Implementing the Naïve Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 3.6.1 Test and Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 3.6.2 Probability Estimates from Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 3.6.3 Classifying New Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 3.6.4 Computational Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

3.7 Recursive Partitioning and Classification Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 3.8 Organizing an email Message into an R Data Structure . . . . . . . . . . . . . . . . . . . . . . 140

3.8.1 Processing the Header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 3.8.2 Processing Attachments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 3.8.3 Testing Our Code on More email Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 3.8.4 Completing the Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

3.9 Deriving Variables from the email Message . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 3.9.1 Checking Our Code for Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

3.10 Exploring the email Feature Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 3.11 Fitting the rpart() Model to the email Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 3.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

3.1 Introduction People are terrific at spotting spam in their mail reader with a quick glance at the subject line and sender, and when that approach is not conclusive, a glimpse at the contents

an us the and irritation of having to sort through them in our inbox? Spam filters used by mail readers examine various characteristics of an email before deciding whether to place it in your inbox or spam folder. This decision is in part based on a statistical analysis of a large amount of email that has been hand classified as spam (unwanted) or ham (wanted). In this chapter, we examine over 9000 messages that have been classified by SpamAssassin (https://spamassassin.apache.org) for the purpose of developing and testing spam filters.