Data Science: Data Preprocessing Using Data Reduction Techniques

Glimpse Of Data Preprocessing
Presently, a user is generating billions of data per day according to one survey conducted by google. Datasets are becoming increasingly detailed. This may pose issues because there is a lot of data in the dataset, and the most significant feature may be hidden among worthless data. As a result, data pre-processing has become critical for any dataset. In this article, we’ll go over various data reduction methods for removing extraneous data from our atasets and making it easier for our model to run these datasets.
The SkLearn website listed different feature selection methods. Here, we will see different feature selection methods on the same data set to compare their performances.
Dataset Used
The dataset used for carrying out data reduction is the ‘Iris’ available in sklearn.datasets library
- After importing, loading iris dataset into work environment.

Loading Dataset
The data have four features. To test the effectiveness of different feature selection methods, we add some noise features to the data set.

Data Reduction Techniques
Using Python, Performing the following data pre-processing tasks: variance threshold reduction, univariate feature selection, recursive feature elimination, and PCA.
- Variance Threshold Reduction
The Variance Threshold technique to feature selection is a basic baseline strategy. All features whose variance does not meet a certain threshold are removed. It removes all zero-variance features by default.

Variance Threshold
However, because our dataset lacks a zero variance feature, our data is unaffected. We can see that the data hasn’t been reduced in terms of features (columns).
2. Principal Component Analysis (PCA)
Principal Component Analysis PCA is a method for lowering the dimensionality of such datasets, boosting interpretability while minimizing information loss.
If your learning algorithm is too slow owing to a big input dimension, using PCA to speed it up can be a good solution. PCA is, without a doubt, the most popular application. Another frequent PCA application is data visualisation.
Being able to see your data in a variety of machine learning apps is beneficial. Data visualisation in two or three dimensions is simple. This study employed a four-dimensional Iris dataset. We’ll use PCA to reduce the data from four dimensions to two or three, allowing you to plot it and possibly better comprehend it.
So, let’s import the necessary libraries for PCA and run PCA on the Iris Dataset for visualisation.

Importing PCA Library
PCA Projection to 2D
The original data has 4 columns (sepal length, sepal width, petal length, and petal width). In this section, the code projects the original data which is 4 dimensional into 2 dimensions. The new components are just the two main dimensions of variation.

4 columns are converted to 2 principal columns
Concatenating DataFrame along axis = 1. resultant_Df is the final DataFrame before plotting the data.

Concatenating target column into dataframe
Now, let’s visualize the dataframe

PCA Projection to 3D
The original data has 4 columns (sepal length, sepal width, petal length, and petal width). In this section, the code projects the original data which is 4 dimensional into 3 dimensions. The new components are just the three main dimensions of variation.

getting 3 principal component columns
Now lets visualize 3D graph,

3D Representation
Univariate Feature Selection
- The best features are chosen using univariate statistical tests in univariate feature selection.
- To evaluate if there is a statistically significant association between each attribute and the target variable, we compare them.
- We disregard the other features while analysing the link between one feature and the target variable. That is why it is referred to as “univariate.”
- Each feature has its own score on the test.
- Finally, all of the test results are compared, and the features with the highest scores are chosen.
- These objects accept a scoring function that returns univariate scores and p-values (or only scores in the case of SelectKBest and SelectPercentile) as an input.
- f_classif
Also known as Analysis Of Variance (ANOVA) which compute the ANOVA F-value for the provided sample. Click here for more information.

f_classif
2. Chi2
This score can be used to select the features with the highest values for the test chi-squared statistic from data, which must contain only non-negative features such as booleans or frequencies (e.g., term counts in document classification), relative to the classes.

Chi2
3. Mutual_info_classif
Estimate mutual information for a discrete target variable.
Mutual information (MI)between two random variables is a non-negative value, which measures the dependency between the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency.
The function relies on nonparametric methods based on entropy estimation from k-nearest neighbors distances.

Mutual_info_classif
Recursive Feature Elimination
Given an external estimator that provides weights to features, recursive feature elimination (RFE) picks features by recursively examining fewer and smaller sets of features (e.g., the coefficients of a linear model). The estimator is trained on the original set of features first, and then the significance of each feature is determined using the coef_ or feature importances_ attributes. The least important features are then cut from the existing set of features. This procedure is repeated recursively on the pruned set until the desired number of features to pick is reached.

Recursive Feature Elimination
Conclusion
I evaluated and contrasted the results of different feature selection algorithms on the same data in my blog.
The model performs better when only the leftover features are employed after feature selection when all of the features are used to train the model.
PCA was used to visualise the dataframe in 2D and 3D with fewer components after feature selection.