Experimental Studies on the Impact of Data Sampling with Severely Imbalanced Big Data

doi:10.1201/9781003034971-1

Chapter

Experimental Studies on the Impact of Data Sampling with Severely Imbalanced Big Data

ABSTRACT

This introduction presents an overview of the key concepts discussed in the subsequent chapters of this book. The book focuses on sampling to reduce the impact of class imbalance on machine learning models. It demonstrates that classification performance across several imbalanced big datasets across different application domains can be significantly improved using Random Undersampling without substantially altering the composition of the original data. The book provides an overview of related works. It describes the Machine Learning (ML) classification algorithms and libraries, to include the evaluation strategy with validation techniques and performance metrics. The book introduces the datasets and how they were processed, model training, and performance evaluation. It involves a real-world Medicare fraud problem, with severe class imbalance. To ease the process of using ML, engineers build the algorithms within software modules or packages, making sure that they work reliably, quickly, and at-scale.