Data is the most valuable asset in machine learning, it solely holds the potential to make a machine learning model robust. Data plays an important role while training a model, the model trained can be underfitted or overfitted and it totally depends on the data.
The data you've gathered should be of high quality, so structure, construct, and clean it in such a way that it has the potential to produce a robust model.
In this article, you'll learn how to use pandas to find and remove columns from one dataset that don't match those in another.
Objective
When you have multiple datasets and want to merge them to augment the data points, some columns of the datasets do not match. You must find and remove them so that they do not have an adverse effect on the model while training.
What You'll Do
There are a few datasets that contain used car information such as price, name, distance driven, fuel type, and more. You will search the dataset for unmatched columns and then remove them.
The datasets used in this article are downloaded from Kaggle.
Dataset Overview
import pandas as pd
# Reading Car Dataset
df = pd.read_csv("car.csv")
print(f"Shape of the Datset: {df.shape}")
----------
Shape of the Dataset: (4340, 8)
The car.csv
dataset has 4380 rows and 8 columns, and you will now find the columns that do not match the columns of the car.csv
dataset and remove them.
The following are the names of the dataset's columns:
print("Column Names:")
for column in df.columns:
print(column.capitalize())
----------
Column Names:
Name
Year
Selling_price
Km_driven
Fuel
Seller_type
Transmission
Owner
You now know which columns to keep in the second dataset and which to remove.
import pandas as pd
# Reading Car Dataset
df = pd.read_csv("used_car.csv")
print(f"Shape of the Datset: {df.shape}")
----------
Shape of the Dataset: (7906, 18)
As you can see the used_car.csv
dataset has 7906 rows and 18 columns. This dataset contains more columns, but you must keep the columns consistent with the car.csv
dataset and remove the remaining ten columns from the used_car.csv
dataset.
Finding Mismatched Columns
In this section, you will identify the columns in the used_car.csv
dataset that are not present in the car.csv
dataset.
columns_to_delete = df2.columns.difference(df.columns)
The above code will find the columns that are not present in the car.csv
dataset from the used_car.csv
dataset using the difference()
method on the dataset's columns
attribute. The result will be stored inside the columns_to_delete
variable.
print("Columns not Present in First Dataset:")
for counter, col in enumerate(columns_to_delete, start=1):
print(f"{counter}- {col}")
The above code will list all the columns that are not present in the car.csv
dataset. Run the above code and you'll get the following output.
Columns not Present in First Dataset:
1- City
2- Region
3- Sales_ID
4- State or Province
5- engine
6- max_power
7- mileage
8- seats
9- sold
10- torque
There are a total of 10
columns that are not present in the car.csv
dataset. The next step is to delete or remove them from the used_car.csv
dataset.
Deleting Mismatched Columns
In this section, you will delete the above-identified columns from the used_car.csv
dataset and pandas
provide a much simpler way to do this.
new_df2 = df2.drop(columns_to_delete, axis=1)
The above code will delete all those 10 columns from the used_car.csv
dataset using the drop()
function on df2
along the columns axis (axis=1
). The new dataset is stored inside the new_df2
variable.
print("Columns Present in the Second Dataset:")
for column in new_df2.columns:
print(column.capitalize())
The above code will print the column names that remain in the used_car.csv
dataset.
Columns Present in the Second Dataset:
Name
Year
Selling_price
Km_driven
Fuel
Seller_type
Transmission
Owner
Now both datasets car.csv
and used_car.csv
contain the same columns. You can now merge both datasets along the row axis to augment the data.
Save new_df2
as the new dataset in CSV format using the pandas to use it for further purposes.
new_df2.to_csv('new_car_data.csv', index=False)
The above code will save new_df2
as new_car_data.csv
in the current directory.
Conclusion
Data cleaning is a critical step in the data preprocessing pipeline, ensuring that your data is of high quality and suitable for building accurate and robust machine learning models.
Any irregularity in data can make a huge difference in the model's training. So you need to make a consistent dataset.
🏆Other articles you might be interested in if you liked this one
✅Merge, combine, and concatenate multiple datasets using pandas.
✅Find and delete duplicate rows from the dataset using pandas.
✅Create multi-threaded Python programs using a threading module.
✅Create temporary files and directories using tempfile module in Python.
✅Upload and display images on the frontend using Flask.
✅How does the learning rate affect the ML and DL models?
That's all for now
Keep Coding✌✌