Arabic Offensive Language and Hate Speech Detection Using Ensemble Transformers and Data Augmentation

doi:10.1201/9781003393061-13

Chapter

Arabic Offensive Language and Hate Speech Detection Using Ensemble Transformers and Data Augmentation

ABSTRACT

Offensive content is a harmful substance that has infiltrated the internet. Hence, preventing the spread of such a toxic language may help in the prevention of numerous psychological, social, and political harmful effects. A way to contribute is through the classification of social media comments as offensive/hate speech or not. In this chapter, we aim at detecting offensive content and hate speech on social media. We employed a multidomain data set whose comments were collected and annotated by the organizers of a shared task on offensive language and Hate speech. We base our system on AraBERT, MarBERT, and Qarib transformer models, and then improve the classification performance using ensemble learning, which handles the models outputs according to the F1-measure average. Since the studied data set is highly imbalanced, we have then performed augmentation of low represented data and investigated its effect on the classification. We then analyzed common and specific classification errors besides highlighting the main shortcomings and potential improvements. Our system based on transfer and ensemble learning achieved an improvement in comparison to the best ranked team in OSACT5 shared task, whose corpora are used within this study.