ABSTRACT
Malware detection is fundamental to safe and secure computing systems, from the cloud to Internet of Things (IoT) devices and Operational Technology (OT) systems. Malware detection is a process that inputs software samples, extracts their static and dynamic features and classifies them as malware or benign exploiting a range of Machine Learning (ML) algorithms and Deep Neural Networks (DNNs). The need for significant amounts of training data to obtain effective and efficient detection models is limited by the absence of sufficient benchmark datasets and by the intellectual property and privacy constraints that do not allow for data sharing among organizations.
In our work, we present an effective Federated Learning (FL) solution for malware detection, which achieves high accuracy in malware detection with a detection model that is developed in a distributed fashion among members of a federation that are not required to exchange source data. We consider a federation of Edge or near-Edge devices that are deployed as security providers for their organizational networks. Each device trains its own neural network (NN) model with its own data; local models are combined in a global, aggregated NN model exploiting cross-silo FL, and the global model is distributed to the federation members. We evaluate the FL solution with the EMBER dataset and demonstrate that our approach reaches accuracy above 93%, which is the accuracy of the non-federated centralized NN model. Our work demonstrates that our FL solution is effective and efficient achieving high accuracy without need to exchange source data, i.e. respecting privacy, 74while it scales well with the size of the federation. Importantly, the approach demonstrates that organizations are highly motivated to participate in the federation because they achieve significantly higher malware detection accuracy than the one they would achieve by exploiting only their own training data.
