ABSTRACT

The proliferation of artificial intelligence (AI) use cases in human-interactive contexts has encouraged the research and development of explainable AI (XAI) methods. Cybersecurity professionals are gradually adopting machine learning (ML) to keep up with rapidly changing threats. The ease with which domain experts and users can grasp and have faith in the models’ operation is a key factor in determining whether they will be widely adopted. The stakeholder desire for openness and explainability increases as these black-box models are used to make crucial forecasts. In this study, we proposed an ML-based system that can detect anomalies from the BPF tracking honeypot (BETH) cybersecurity datasets. This study also focuses on the explainability of black-box ML models to understand the prediction of the algorithms for stockholders and domain experts. After a long exploratory data analysis, we applied Isolation Forest (IF), One Class Support Vector Machine (OCSVM), Random Forest (RF), Gradient Boosting Classifier (GBC), Logistic Regression (LR), Stochastic Gradient Descent (SGD), Gaussian Naïve Bayes, and Extreme Gradient Boosting (XGB). RF, GBC, and XGB show the highest accuracy, 99%, with 99% precision, recall, F1 score specificity, and Cohen kappa. We apply the Shapash and Eli5 XAI tools to show the explainability of the RF model. We find the importance of features, tolerance to the model, individual features, local explainability for any instance, and comparison of random instances. The explainability shows that the feature named ‘userId’ gets the highest’ precedence over others. The top five features contributing to the model are ‘userId’, ‘par’ntProc’ss’d, ‘process-Nam’, ’processId’, and ‘args’.’ The remaining features are less important for detecting the anomalies, and individual features show their contribution to the model negatively and positively.