ABSTRACT

Over recent years, the utilisation of short message service (SMS) has been growing significantly. Along with it, there is a notable increase in spam messages from spammers. SMS spam is any kind of unsolicited text in the form of promotional content, Web links or any other irrelevant text note that is sent to your mobile phone for advertisement purposes. The low cost of SMS offered by telecom companies is one of the factors for high usage of SMS. The surge in unsolicited messages across all platforms including emails and SMS has created a need for the advancement and refinement to counteract spam messages, especially SMS spam messages. It really disturbs the users. Hence a variety of methods have been used to detect spams. We are using a dataset of real SMS spams from UCI Machine Learning Repository. The dataset contains a total of 5,574 SMS messages, in which 774 SMS messages were spam and 4827 were ham. In preprocessing of the data, we removed the stop words that do not give much significance to the data. For feature extraction, we use Wordnet Lemmatiser for tokenisation and Count Vectoriser for converting the words into vectors. Various machine learning algorithms are deployed to this dataset for training and testing. The machine learning algorithms are as follows: naive Bayes classifier, logistic regression, random forest classifier and decision tree classifier. We evaluate these models with the evaluation metrics such as precision score, recall score and accuracy score. We are taking accuracy as the primary evaluation metric to be considered for the most effective algorithm for detecting SMS spam messages. Among the abovementioned existing machine learning algorithms, random forest classifier is the most suitable algorithm as it possesses a higher accuracy of 98.13 percent for SMS spam detection.