Data Science: Data Preprocessing Using Data Reduction Techniques

Neel shah
6 min readOct 30, 2021

--

Glimpse Of Data Preprocessing

Presently, a user is generating billions of data per day according to one survey conducted by google. Datasets are becoming increasingly detailed. This may pose issues because there is a lot of data in the dataset, and the most significant feature may be hidden among worthless data. As a result, data pre-processing has become critical for any dataset. In this article, we’ll go over various data reduction methods for removing extraneous data from our atasets and making it easier for our model to run these datasets.

The SkLearn website listed different feature selection methods. Here, we will see different feature selection methods on the same data set to compare their performances.

Dataset Used

The dataset used for carrying out data reduction is the ‘Iris’ available in sklearn.datasets library

  1. After importing, loading iris dataset into work environment.

Loading Dataset

The data have four features. To test the effectiveness of different feature selection methods, we add some noise features to the data set.

Data Reduction Techniques

Using Python, Performing the following data pre-processing tasks: variance threshold reduction, univariate feature selection, recursive feature elimination, and PCA.

  1. Variance Threshold Reduction

The Variance Threshold technique to feature selection is a basic baseline strategy. All features whose variance does not meet a certain threshold are removed. It removes all zero-variance features by default.

Variance Threshold

However, because our dataset lacks a zero variance feature, our data is unaffected. We can see that the data hasn’t been reduced in terms of features (columns).

2. Principal Component Analysis (PCA)

Principal Component Analysis PCA is a method for lowering the dimensionality of such datasets, boosting interpretability while minimizing information loss.

If your learning algorithm is too slow owing to a big input dimension, using PCA to speed it up can be a good solution. PCA is, without a doubt, the most popular application. Another frequent PCA application is data visualisation.

Being able to see your data in a variety of machine learning apps is beneficial. Data visualisation in two or three dimensions is simple. This study employed a four-dimensional Iris dataset. We’ll use PCA to reduce the data from four dimensions to two or three, allowing you to plot it and possibly better comprehend it.

So, let’s import the necessary libraries for PCA and run PCA on the Iris Dataset for visualisation.

Importing PCA Library

PCA Projection to 2D

The original data has 4 columns (sepal length, sepal width, petal length, and petal width). In this section, the code projects the original data which is 4 dimensional into 2 dimensions. The new components are just the two main dimensions of variation.

4 columns are converted to 2 principal columns

Concatenating DataFrame along axis = 1. resultant_Df is the final DataFrame before plotting the data.

Concatenating target column into dataframe

Now, let’s visualize the dataframe

PCA Projection to 3D

The original data has 4 columns (sepal length, sepal width, petal length, and petal width). In this section, the code projects the original data which is 4 dimensional into 3 dimensions. The new components are just the three main dimensions of variation.

getting 3 principal component columns

Now lets visualize 3D graph,

3D Representation

Univariate Feature Selection

  • The best features are chosen using univariate statistical tests in univariate feature selection.
  • To evaluate if there is a statistically significant association between each attribute and the target variable, we compare them.
  • We disregard the other features while analysing the link between one feature and the target variable. That is why it is referred to as “univariate.”
  • Each feature has its own score on the test.
  • Finally, all of the test results are compared, and the features with the highest scores are chosen.
  • These objects accept a scoring function that returns univariate scores and p-values (or only scores in the case of SelectKBest and SelectPercentile) as an input.
  1. f_classif

Also known as Analysis Of Variance (ANOVA) which compute the ANOVA F-value for the provided sample. Click here for more information.

f_classif

2. Chi2

This score can be used to select the features with the highest values for the test chi-squared statistic from data, which must contain only non-negative features such as booleans or frequencies (e.g., term counts in document classification), relative to the classes.

Chi2

3. Mutual_info_classif

Estimate mutual information for a discrete target variable.

Mutual information (MI)between two random variables is a non-negative value, which measures the dependency between the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency.

The function relies on nonparametric methods based on entropy estimation from k-nearest neighbors distances.

Mutual_info_classif

Recursive Feature Elimination

Given an external estimator that provides weights to features, recursive feature elimination (RFE) picks features by recursively examining fewer and smaller sets of features (e.g., the coefficients of a linear model). The estimator is trained on the original set of features first, and then the significance of each feature is determined using the coef_ or feature importances_ attributes. The least important features are then cut from the existing set of features. This procedure is repeated recursively on the pruned set until the desired number of features to pick is reached.

Recursive Feature Elimination

Conclusion

I evaluated and contrasted the results of different feature selection algorithms on the same data in my blog.
The model performs better when only the leftover features are employed after feature selection when all of the features are used to train the model.
PCA was used to visualise the dataframe in 2D and 3D with fewer components after feature selection.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Neel shah
Neel shah

Written by Neel shah

Information Technology Graduate, CHARUSAT UNIVERSITY,2022

No responses yet

Write a response