ABSTRACT

Contents 14.1 Introduction ....................................................................................................................274 14.2 Related Work ..................................................................................................................275 14.3 Research Contribution ....................................................................................................276 14.4 Twitter Tweet Classification ............................................................................................276 14.5 Approach ........................................................................................................................ 277

14.5.1 Data Normalization ............................................................................................ 277 14.5.2 Creation of Topic Dictionaries ............................................................................ 278 14.5.3 Document Vectorization ..................................................................................... 278 14.5.4 Determining Security Associations ..................................................................... 279 14.5.5 Pruning of Security Association Rules ................................................................ 280 14.5.6 Assessing Security Association Rules................................................................... 282 14.5.7 Performance Constraints .................................................................................... 283 14.5.8 Application along Temporal Dimension ............................................................. 283 14.5.9 Comparison of Dictionary-Based Topic Identification with Statistical Topic

Models ................................................................................................................ 284 14.6 Experimental Results...................................................................................................... 284

14.6.1 Tweet Corpus ..................................................................................................... 284 14.6.2 Training Set and Test Set Details........................................................................ 285 14.6.3 Creation of Security Dictionaries ........................................................................ 285 14.6.4 Creation of Security Associations ........................................................................ 285 14.6.5 Pruning of Security Association Rules ................................................................ 286

The increasing use of online social networks by criminals is of great concern to law enforcement agencies across the world. Identifying messages relevant to the domain of security can serve as a stepping-stone in criminal network analysis. Terrorists have recently moved to Twitter, where they are using specific hashtags to spread their ideologies and messages. In this chapter, we discuss an application of machine learning in detecting hidden subversive groups in Twitter by presenting a variant of the rule approach for classifying messages of radical groups in Twitter. The approach incorporates security dictionaries of enriched themes relevant to law enforcement agencies where each theme is categorized by semantically related words. Themes identified are mapped to categories that arise as a result of security associations. The approach successfully caters to the problem of multilabel classification of messages by assigning two or more categories to messages. High accuracy of the rule-based approach makes it very viable in its application in the domain of security. Using this approach, we are able to classify messages on the basis of topics of interest to the security community. We also present results of our approach obtained through experiments with Twitter and also offer a discussion on the temporal effects of our approach.