ABSTRACT

In the pay-per-click (PPC) model of online advertising, an advertiser pays an amount to the publishers for every click generated on the published advertisement, which results in click fraud. Click fraud is deliberate clicking by a publisher on the advert. The highly skewed class distribution of the dataset makes the identification of fraudsters more challenging for current machine learning methods. This work thus proposes a reliable click-fraud detection (CFD) system for the efficient investigation of fraudulent publishers. The proposed CFD system has many novel features. First, the problem of class imbalance is overcome using the synthetic minority oversampling technique (SMOTE) and random under-sampling (RUSBOOST). Second, a novel Hybrid-Manifold Feature Subset Selection (H-MFSS) is proposed to obtain optimal informative features. Third, the gradient tree boosting (GTB) model addresses the challenges encountered in investigating and classifying the behavior of fraudsters from balanced and optimally selected user-click data. Experiments are conducted on FDMA2012 mobile advertising user-click data in dual mode: with all features (original data and data sampled through data sampling methods); and with selected features (original data and data sampled through data sampling methods). Classification bias towards the majority class is avoided by evaluating the performance of the models using the average precision (AP), recall (SE), specificity (SP), and G-mean (GM) metrics rather than accuracy. The efficacy of the proposed GTB model is further evaluated by comparing the performance with 12 other conventional machine learning models. The empirical results prove that GTB generalizes well with an achieved AP score of 64.86% without sampling, 65.25% with RUSBoost and 66.78% with SMOTE using significant selected features. A significant improvement in the classification performance is achieved with the impact of sampling methods and selected optimal features.