Investigation on Improving the Performance of Class-imbalanced Medical Health Datasets

doi:10.1201/9781003438816-1

Chapter

Investigation on Improving the Performance of Class-imbalanced Medical Health Datasets

ABSTRACT

Data has increased significantly in recent years due to technical and technological advancements, especially in the medical field. Machine learning is an astounding field with precise outcomes in medical domains such as detection, diagnosis, imaging, personalized medicine, etc. The machine learning algorithms analyze the feature-engineered data and produce precise outcomes using different learning methods, such as supervised and unsupervised. In the case of medical applications, these algorithms play a vital role in disease diagnosis and recognition of patterns, even in the absence of medical experts. For instance, during the coronavirus (Covid-19) pandemic, machine learning algorithms have accurately identified the infected persons using chest X-ray recordings, real-time polymerase chain reaction (RT-PCR) tests, and blood samples in the early stages. However, the learning algorithms have certain limitations in recognizing an imbalanced dataset collected for the deadliest diseases such as coronary heart disease, strokes, respiratory illness, COPD, cancers, diabetes, Alzheimer’s disease, TB, cirrhosis, etc. To investigate the 2performance of such a class-imbalanced dataset and to improve its performance in terms of false positive rates, this chapter utilizes various methods such as random oversampling, random undersampling, and SMOTE to handle the class-imbalance problem in two different medical datasets. Then, the balanced dataset is evaluated using algorithms like naïve Bayes, decision tree, and k-nearest neighbor in different evaluation models. The learning algorithms are evaluated with familiar metrics – precision, accuracy, recall, F-measure, and ROC area. Experiments on two utilized class-imbalanced datasets show that SMOTE performs better in handling class-imbalance problems.