Find and Delete Mismatched Columns From DataFrames Using pandas

Find and Delete Mismatched Columns From DataFrames Using pandas

·

4 min read

Data is the most valuable asset in machine learning, it solely holds the potential to make a machine learning model robust. Data plays an important role while training a model, the model trained can be underfitted or overfitted and it totally depends on the data.

The data you've gathered should be of high quality, so structure, construct, and clean it in such a way that it has the potential to produce a robust model.

In this article, you'll learn how to use pandas to find and remove columns from one dataset that don't match those in another.

Objective

When you have multiple datasets and want to merge them to augment the data points, some columns of the datasets do not match. You must find and remove them so that they do not have an adverse effect on the model while training.

What You'll Do

There are a few datasets that contain used car information such as price, name, distance driven, fuel type, and more. You will search the dataset for unmatched columns and then remove them.

The datasets used in this article are downloaded from Kaggle.

Dataset Overview

import pandas as pd

# Reading Car Dataset
df = pd.read_csv("car.csv")
print(f"Shape of the Datset: {df.shape}")

----------
Shape of the Dataset: (4340, 8)

The car.csv dataset has 4380 rows and 8 columns, and you will now find the columns that do not match the columns of the car.csv dataset and remove them.

The following are the names of the dataset's columns:

print("Column Names:")
for column in df.columns:
    print(column.capitalize())

----------
Column Names:
Name
Year
Selling_price
Km_driven
Fuel
Seller_type
Transmission
Owner

You now know which columns to keep in the second dataset and which to remove.

import pandas as pd

# Reading Car Dataset
df = pd.read_csv("used_car.csv")
print(f"Shape of the Datset: {df.shape}")

----------
Shape of the Dataset: (7906, 18)

As you can see the used_car.csv dataset has 7906 rows and 18 columns. This dataset contains more columns, but you must keep the columns consistent with the car.csv dataset and remove the remaining ten columns from the used_car.csv dataset.

Finding Mismatched Columns

In this section, you will identify the columns in the used_car.csv dataset that are not present in the car.csv dataset.

columns_to_delete = df2.columns.difference(df.columns)

The above code will find the columns that are not present in the car.csv dataset from the used_car.csv dataset using the difference() method on the dataset's columns attribute. The result will be stored inside the columns_to_delete variable.

print("Columns not Present in First Dataset:")
for counter, col in enumerate(columns_to_delete, start=1):
    print(f"{counter}- {col}")

The above code will list all the columns that are not present in the car.csv dataset. Run the above code and you'll get the following output.

Columns not Present in First Dataset:
1- City
2- Region
3- Sales_ID
4- State or Province
5- engine
6- max_power
7- mileage
8- seats
9- sold
10- torque

There are a total of 10 columns that are not present in the car.csv dataset. The next step is to delete or remove them from the used_car.csv dataset.

Deleting Mismatched Columns

In this section, you will delete the above-identified columns from the used_car.csv dataset and pandas provide a much simpler way to do this.

new_df2 = df2.drop(columns_to_delete, axis=1)

The above code will delete all those 10 columns from the used_car.csv dataset using the drop() function on df2 along the columns axis (axis=1). The new dataset is stored inside the new_df2 variable.

print("Columns Present in the Second Dataset:")
for column in new_df2.columns:
    print(column.capitalize())

The above code will print the column names that remain in the used_car.csv dataset.

Columns Present in the Second Dataset:
Name
Year
Selling_price
Km_driven
Fuel
Seller_type
Transmission
Owner

Now both datasets car.csv and used_car.csv contain the same columns. You can now merge both datasets along the row axis to augment the data.

Save new_df2 as the new dataset in CSV format using the pandas to use it for further purposes.

new_df2.to_csv('new_car_data.csv', index=False)

The above code will save new_df2 as new_car_data.csv in the current directory.

Conclusion

Data cleaning is a critical step in the data preprocessing pipeline, ensuring that your data is of high quality and suitable for building accurate and robust machine learning models.

Any irregularity in data can make a huge difference in the model's training. So you need to make a consistent dataset.


🏆Other articles you might be interested in if you liked this one

Merge, combine, and concatenate multiple datasets using pandas.

Find and delete duplicate rows from the dataset using pandas.

Create multi-threaded Python programs using a threading module.

Create temporary files and directories using tempfile module in Python.

Upload and display images on the frontend using Flask.

How does the learning rate affect the ML and DL models?


That's all for now

Keep Coding✌✌

Did you find this article valuable?

Support Team - GeekPython by becoming a sponsor. Any amount is appreciated!