Ensuring consistency in the numerical input data is crucial to enhancing the performance of machine learning algorithms. To achieve this uniformity, it is necessary to adjust the data to a standardized range.
Standardization and Normalization are both widely used techniques for adjusting data before feeding it into machine learning models.
In this article, you will learn how to utilize the StandardScaler
class to scale the input data.
What is Standardization?
Before diving into the fundamentals of the StandardScaler class, you need to understand the standardization of the data.
Standardization is a data preparation method that involves adjusting the input (features) by first centering them (subtracting the mean from each data point) and then dividing them by the standard deviation, resulting in the data having a mean of 0 and a standard deviation of 1.
The formula for standardization can be written like the following:
- standardized_val = ( input_value - mean ) / standard_deviation
Assume you have a mean value of 10.4 and a standard deviation value of 4. To standardize the value of 15.9, put the given values into the equation as follows:
standardized_val = ( 15.9 - 10.4 ) / 3
standardized_val = ( 5.5 ) / 4
standardized_val = 1.37
The StandardScaler
stands out as a widely used tool for implementing data standardization.
What is StandardScaler?
The StandardScaler
class provided by Scikit Learn applies the standardization on the input (features) variable, making sure they have a mean of approximately 0 and a standard deviation of approximately 1.
It adjusts the data to have a standardized distribution, making it suitable for modeling and ensuring that no single feature disproportionately influences the algorithm due to differences in scale.
Why Bother Using it?
Well, so far you've already understood the idea of using StandardScaler in machine learning but just to highlight, here are the primary reasons why you should use StandardScaler:
For the betterment of the performance of the machine learning models
Maintains the consistency of data points
Useful when working with machine learning algorithms that can be negatively influenced by differences in the scale of the features of the data.
How to Use StandardScaler?
First, you should bring in the StandardScaler
class from the sklearn.preprocessing
module. After that, create an instance of the StandardScaler
class by using StandardScaler()
. Following that, apply the fit_transform
method to the input data by fitting it to the created instance.
# Imported required libs
import numpy as np
from sklearn.preprocessing import StandardScaler
# Creating a 2D array
arr = np.asarray([[12, 0.007],
[45, 1.5],
[75, 2.005],
[7, 0.8],
[15, 0.045]])
print("Original Array: \n", arr)
# Instance of StandardScaler class
scaler = StandardScaler()
# Fitting and then transforming the input data
arr_scaled = scaler.fit_transform(arr)
print("Scaled Array: \n", arr_scaled)
An instance of the StandardScaler
class is created and stored in the variable scaler
. This instance will be used to standardize the data.
The fit_transform
method of the StandardScaler
object (scaler
) is called with the original data arr
as the input.
The fit_transform
method will compute the mean and deviation for each data point in the input data arr
and then apply the standardization to the input data.
Here's the original array and the standardized version of the original array.
Original Array:
[[1.200e+01 7.000e-03]
[4.500e+01 1.500e+00]
[7.500e+01 2.005e+00]
[7.000e+00 8.000e-01]
[1.500e+01 4.500e-02]]
Scaled Array:
[[-0.72905466 -1.09507083]
[ 0.55066894 0.79634605]
[ 1.71405403 1.43610862]
[-0.92295217 -0.09045356]
[-0.61271615 -1.04693028]]
Does Standardization Affect the Accuracy of the Model?
In this section, you'll see how the model's performance is affected after applying standardization to features of the dataset.
Let's see how the model will perform on the raw dataset without standardizing the feature variables.
# Evaluate KNN on the breast cancer dataset
from sklearn import datasets
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from numpy import mean
# load dataset
df = datasets.load_breast_cancer()
X = df.data
y = df.target
# Instantiating the model
model = KNeighborsClassifier()
# Evaluating the model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=10, n_jobs=-1)
# Model's average score
print(f'Accuracy: {mean(scores):.2f}')
The breast cancer dataset is loaded from the sklearn.datasets
and then the features (df.data
) and target (df.target
) are stored inside the X
and y
variables.
The K-nearest neighbors classifier (KNN) model is instantiated using the KNeighborsClassifier
class and stored inside the model variable.
The cross_val_score
function is used to evaluate the KNN model's performance. It passes the model (KNeighborsClassifier()
), features (X
), target (y
), and specifies that accuracy (scoring='accuracy'
) should be used as the evaluation metric.
This will evaluate the accuracy scores by dividing the dataset equally into 10 parts (cv=10
) which means the dataset will be trained and tested 10 times. Here, n_jobs=-1
means using all the available CPU cores for faster cross-validation.
Finally, the average of the accuracy scores (mean(scores)
) is printed.
Accuracy: 0.93
Without standardizing the dataset's feature variables, the average accuracy score is 93%.
Using StandardScaler for Applying Standardization
# Evaluate KNN on the breast cancer dataset
from sklearn import datasets
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from numpy import mean
# loading dataset and configuring features and target variables
df = datasets.load_breast_cancer()
X = df.data
y = df.target
# Standardizing features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Instantiating model
model = KNeighborsClassifier()
# Evaluating the model
scores = cross_val_score(model, X_scaled, y, scoring='accuracy', cv=10, n_jobs=-1)
# Model's average score
print(f'Accuracy: {mean(scores):.2f}')
The dataset's features undergo scaling with the StandardScaler()
, and the resulting scaled dataset is stored in the X_scaled
variable.
Next, this scaled dataset is used as input for the cross_val_score
function to compute and subsequently display the accuracy.
Accuracy: 0.97
It is noticeable that the accuracy score has significantly increased to 97% when compared to the previous accuracy score of 93%.
The application of StandardScaler()
, which standardized the data's features, has notably improved the model's performance.
Conclusion
StandardScaler is used to standardize the input data in a way that ensures that the data points have a balanced scale, which is crucial for machine learning algorithms, especially those that are sensitive to differences in feature scales.
Standardization transforms the data such that the mean of each feature becomes zero (centered at zero), and the standard deviation becomes one.
Let's recall what you've learned:
What actually is StandardScaler
What is standardization and how it is applied to the data points
Impact of StandardScaler on the model's performance
🏆Other articles you might be interested in if you liked this one
✅How do learning rates impact the performance of the ML and DL models?
✅How to build a custom deep learning model using transfer learning?
✅How to build a Flask image recognition app using a deep learning model?
✅How to join, combine, and merge two different datasets using pandas?
✅How to perform data augmentation for deep learning using Keras?
✅Upload and display images on the frontend using Flask in Python.
✅What are Sessions and how to use them in a Flask app as temporary storage?
That's all for now
Keep Coding✌✌