ABSTRACT

As the sheer volume of incoming samples exponentially increases every year, one of the critical tasks in the defense pipeline is to identify probabilistically suspicious samples that require further investigation. The task essentially involves filtering out a large volume of unknown samples and producing a manageable set of potentially malicious samples with sufficiently high confidence. Today the majority of malware samples possess metamorphic properties where various mutations over the original set of code blocks occur before they are released. This arbitrary custom-designed mutation algorithm applied at each outbreak constitutes the crux of the constant battle between the attackers and the defenders. Therefore, it is crucial to capture this non-linear metamorphic pattern unique to the mutation in order to detect the variants. This paper compares the performance of various clustering methods against metamorphic malware samples to identify the model that best suits in practical threat hunting. Results have shown that Adversarial autoencoder performs better than well-known techniques such as HDBSCAN, KNN, and SDHASH.