Some Research Issues of Harmful and Violent Content Filtering for Social Networks in the Context of Large-Scale and Streaming Data with Apache Spark

doi:10.1201/9780429270567-11

Chapter

Some Research Issues of Harmful and Violent Content Filtering for Social Networks in the Context of Large-Scale and Streaming Data with Apache Spark

ABSTRACT

In recent years, the interconnecting online resources such as social networks, social video sharing, blogs, online news, forums, etc. have increased tremendously. Especially, social networks and forums have become popular and indispensable for people's everyday lives as the virtual entertainment activities, communication as well as facilitating e-commercial businesses. Well-known social networks such as Facebook ¹ , Twitter ² , LinkedIn ³ , BBC News ⁴ , Sina News ⁵ , Weibo ⁶ , Tumblr ⁷ , Instagram ⁸ , YouTube ⁹ , etc. have hundreds of millions of daily active clients. These online environments have become an avenue for harmful or violent content in terms of posts, news, comments, images, etc. that are posted on the social networks, which might lead to the depression or severe mental illnesses for other people. Therefore, detecting and filtering harmful/violent content in the social networks are absolutely necessary these days. Recently, large corporations such as Google, Facebook, Twitter, etc. have spent a lot of efforts and money for harmful/violent content validations on their own products. However, currently most of content checking and validation are mostly conducted manually by humans. Manual validations for all text-based contents (comments, posts, news, etc.) and multimedia-based contents (videos, images, voice, etc.) are highly difficult or even impossible due to huge amounts of daily data in these social networks. Therefore, in this chapter, we proposed a novel framework for automatically detecting and filtering harmful/violent contents on the social networks in the context of Big Data. We constructed our harmful/violent content detecting and filtering system on the multiple Apache-family platforms. Our proposed system architecture is developed for the detection of negative contents and interactions from users in terms of abusive/harmful/violent contents, which are carried out through two main types: text-based contents (posts, comments, news, etc.) and images by using advanced deep learning architecture of convolutional neural network (CNN). The CNN is applied and developed on the Apache Spark platform for detecting the harmful/violent content, which might exist in the large-scale and high-velocity text-based and image-based contents that are collected from multiple online resources, like social networks, news, forums, etc. To collect the data from online resources, we developed a distributed-based web crawler that is based on Apache Nutch for both text-based and image-based content gathering. The collected data is processed and stored in the distributed storage environment of Apache Hadoop Distributed File System (HDFS). After that, this collected data is processed and embedded on the Apache Spark Streaming platform in order to ensure the capability of high-speed data handling. The combination of Apache Spark-based distributed processing environment for detecting and filtering of both text-based and image-based harmful/violent content is considered as important and suitable platforms for these social threats.