ABSTRACT

We address the difficulty of automatic misinformation detection through state-of-the-art natural language processing techniques. Machine learning and natural language processing are often touted as the perfect solutions for the problem of detecting misinformation at scale. We argue, however, that current approaches tend to fall short because of the unavailability of reliably annotated data. Given the scarcity of quality labelled data, we first conduct a data collection effort by leveraging fact-checking websites. Second, we perform a comparative feature analysis of the news articles with true vs false content. Finally, we conduct a set of text classification experiments using a variety of methods and show that the quality of the training data in terms of its labelling system and balanced coverage of topics directly affects the classification accuracy on test data (news articles that the classifier did not see during training). Feature analysis experiments show specific trends of linguistic patterns in fake news articles. In predictive classification, some of these features such as n-grams and semantic features help considerably in recognising false from true news articles and some features such as readability features tend to be less helpful. We also compare deep learning classification models against feature-based models and show that, given the small size of currently available data, feature-based models are more capable of cross-topic generalisation. This result points to the need for automatic classification methods that are informed by linguistic, corpus linguistic, and stylistic research. Finally, we show that using data labelled based on the reputation of the sources does not result in accurate classification of test data, which, in turn, motivates future data collection with reliable tagging and on diverse topics.