Feature Engineering Techniques

Harsh
6 min readFeb 6, 2022

Standardization or Normalization of quantitative features is a standard step in a machine learning project. But, we seldom worry about what kind of feature scaling technique to use in our project.

This article by Shay Geller discusses in detail how choosing an appropriate scaling technique can increase the accuracy of our ML models. In this article, I will be showing how to implement various feature engineering techniques in Python and their effect on the data. To perform this analysis, I have used max acceleration values for different car models and years.

Photo by Mar Bocatcat on Unsplash
  1. Max Abs Scaling

It make sure that the maximum value in the column is 1. It doesn’t shift or change the center if the data.

from sklearn.preprocessing import MaxAbsScaleraccel_transformed = MaxAbsScaler().fit(accel)
transform_data = accel_transformed.transform(accel)
print("Mean of original data = {}".format(accel.mean()))
print("Standard deviation of original data = {}".format(accel.std()))
print("Max of original data = {}".format(accel.max()))
print("\n")
print("Mean of transformed data = {}".format(transform_data.mean()))
print("Standard deviation of transformed data = {}".format(transform_data.std()))
print("Max of transformed data = {}".format(transform_data.max()))
Mean of original data = 15.568090452261307
Standard deviation of original data = 2.7542223175940177
Max of original data = 24.8


Mean of transformed data = 0.6277455827524719
Standard deviation of transformed data = 0.1110573515158878
Max of transformed data = 1.0
Distribution of original data
Max Abs transformed data

2. Min Max Scaling

Min Max scaling or is used to scale values such that they range between 0 and 1 after transformation.

from sklearn.preprocessing import MinMaxScalerscaler = MinMaxScaler()
accel_transformed = scaler.fit(accel)
transform_data = accel_transformed.transform(accel)
print("Mean of original data = {}".format(accel.mean()))
print("Standard deviation of original data = {}".format(accel.std()))
print("Max of original data = {}".format(accel.max()))
print("\n")
print("Mean of transformed data = {}".format(transform_data.mean()))
print("Standard deviation of transformed data = {}".format(transform_data.std()))
print("Max of transformed data = {}".format(transform_data.max()))
Mean of original data = 15.568090452261307
Standard deviation of original data = 2.7542223175940177
Max of original data = 24.8


Mean of transformed data = 0.45048157453936344
Standard deviation of transformed data = 0.1639418046186915
Max of transformed data = 0.9999999999999999
Transformed Data

3. Normalizer

Normalizer transforms each row of the data set so and resamples it to transform it to unit norm.

from sklearn.preprocessing import Normalizertransformer = Normalizer(norm='l1').fit(accel)
transform_data = transformer.transform(accel)
print("Mean of original data = {}".format(accel.mean()))
print("Standard deviation of original data = {}".format(accel.std()))
print("Max of original data = {}".format(accel.max()))
print("\n")
print("Mean of transformed data = {}".format(transform_data.mean()))
print("Standard deviation of transformed data = {}".format(transform_data.std()))
print("Max of transformed data = {}".format(transform_data.max()))
Mean of original data = 15.568090452261307
Standard deviation of original data = 2.7542223175940177
Max of original data = 24.8


Mean of transformed data = 1.0
Standard deviation of transformed data = 0.0
Max of transformed data = 1.0
Transformed Data

4. Power Transformer (Yeo-Johnson)

This scaling applies a power transformation to the feature to make them mode like Gaussian distribution.

from sklearn.preprocessing import PowerTransformerpt = PowerTransformer('yeo-johnson')
transform = pt.fit(accel)
transform_data = pt.transform(accel)
print("Mean of original data = {}".format(accel.mean()))
print("Standard deviation of original data = {}".format(accel.std()))
print("Max of original data = {}".format(accel.max()))
print("\n")
print("Mean of transformed data = {}".format(transform_data.mean()))
print("Standard deviation of transformed data = {}".format(transform_data.std()))
print("Max of transformed data = {}".format(transform_data.max()))
Mean of original data = 15.568090452261307
Standard deviation of original data = 2.7542223175940177
Max of original data = 24.8
Mean of transformed data = -1.0711699534071861e-15
Standard deviation of transformed data = 0.9999999999999999
Max of transformed data = 3.043039377098009
Transformed Data

5. Quantile Transformation — Normal

Quantile Transformation normal is another technique to transform a data set to normal distribution.

from sklearn.preprocessing import quantile_transformtransform_data = quantile_transform(accel, n_quantiles=398, random_state=1, copy=True, output_distribution='normal')print("Mean of original data = {}".format(accel.mean()))
print("Standard deviation of original data = {}".format(accel.std()))
print("Max of original data = {}".format(accel.max()))
print("\n")
print("Mean of transformed data = {}".format(transform_data.mean()))
print("Standard deviation of transformed data = {}".format(transform_data.std()))
print("Max of transformed data = {}".format(transform_data.max()))
Mean of original data = 15.568090452261307
Standard deviation of original data = 2.7542223175940177
Max of original data = 24.8


Mean of transformed data = 0.0008961428054773563
Standard deviation of transformed data = 1.050687337227496
Max of transformed data = 5.19933758270342
Transformed Data

6. Quantile Transformation — Uniform

Quantile Transformation normal is another technique to transform a data set to uniform distribution.

from sklearn.preprocessing import quantile_transformtransform_data = quantile_transform(accel, n_quantiles=398, random_state=1, copy=True, output_distribution='uniform')print("Mean of original data = {}".format(accel.mean()))
print("Standard deviation of original data = {}".format(accel.std()))
print("Max of original data = {}".format(accel.max()))
print("\n")
print("Mean of transformed data = {}".format(transform_data.mean()))
print("Standard deviation of transformed data = {}".format(transform_data.std()))
print("Max of transformed data = {}".format(transform_data.max()))
Mean of original data = 15.568090452261307
Standard deviation of original data = 2.7542223175940177
Max of original data = 24.8
Mean of transformed data = 0.5001329063453287
Standard deviation of transformed data = 0.28934184755702674
Max of transformed data = 1.0
Transformed Data

7. Robust Scaler

This scaling technique removes the outliers first and then applies standard scaler to scale the data set.

from sklearn.preprocessing import RobustScalertransformer = RobustScaler(with_centering = True, with_scaling = True).fit(accel)
transform_data = transformer.transform(accel)
print("Mean of original data = {}".format(accel.mean()))
print("Standard deviation of original data = {}".format(accel.std()))
print("Max of original data = {}".format(accel.max()))
print("\n")
print("Mean of transformed data = {}".format(transform_data.mean()))
print("Standard deviation of transformed data = {}".format(transform_data.std()))
print("Max of transformed data = {}".format(transform_data.max()))
Mean of original data = 15.568090452261307
Standard deviation of original data = 2.7542223175940177
Max of original data = 24.8
Mean of transformed data = 0.020325508137703445
Standard deviation of transformed data = 0.8221559156997077
Max of transformed data = 2.776119402985078
Transformed Data

8. Standard Scaler

This is the most commonly used scaling technique. It subtracts mean from each data point and then scale it to unit variance.

from sklearn.preprocessing import StandardScalerscaler = StandardScaler()
scaler.fit(accel)
transform_data = scaler.transform(accel)
print("Mean of original data = {}".format(accel.mean()))
print("Standard deviation of original data = {}".format(accel.std()))
print("Max of original data = {}".format(accel.max()))
print("\n")
print("Mean of transformed data = {}".format(transform_data.mean()))
print("Standard deviation of transformed data = {}".format(transform_data.std()))
print("Max of transformed data = {}".format(transform_data.max()))
Mean of original data = 15.568090452261307
Standard deviation of original data = 2.7542223175940177
Max of original data = 24.8


Mean of transformed data = -2.6779248835179653e-16
Standard deviation of transformed data = 0.9999999999999998
Max of transformed data = 3.351911531892361
Transformed Data

--

--