High Dimensionality Dataset Reduction Methodologies in Applied Machine Learning

doi:10.1201/9781003111290-7-9

ABSTRACT

A common problem faced while handling multi-featured datasets is the high amount of dimensionality that they often consist of, leading to barriers in generalized hands-on machine learning. These datasets also give a drastic impact on the performance of machine learning algorithms, being memory inefficient and frequently leading to model overfitting. It often becomes difficult to visualize or gain insightful knowledge on the data features such as presence of outliers.

This chapter will help data analysts reduce data dimensionality using various methodologies such as:

Feature Selection using Covariance Matrix

t-distributed Stochastic Neighbour Embedding (t-SNE)

Principal Component Analysis (PCA)

Under applications of Dimensionality Reduction Algorithms with Visualizations, firstly, we introduce the Boston Housing Dataset and use the Correlation Matrix to apply Feature Selection on the strongly correlated data and perform Simple Linear Regression over the new features. Then we apply t-SNE to MNIST Handwritten Digits Dataset and use k-Nearest Neighbors (kNNs) clustering for classification. Lastly, we use the UCI Breast Cancer Dataset to perform PCA Analysis with Support Vector Machine (SVM) Classification. Finally, we explore the benefits of using Dimensionality Reduction Methods and provide a comprehensive overview of reduction in storage space, efficient models, feature selection guidelines, redundant data removal, and outlier analysis.