Pandas provide numerous functions and methods to clean and preprocess the dataset to make it production-ready.
In this article, we'll see the methods provided by pandas to handle missing values in a dataset.
df.fillna()
The DataFrame.fillna()
is used to fill in the missing values with the desired value. Let's see how we can use it.
data = {"A": [2, np.nan, 19, 34, np.nan],
"B": [np.nan, 23, 12, 34, np.nan]}
df = pd.DataFrame(data)
df
--------------------
A B
0 2.0 NaN
1 NaN 23.0
2 19.0 12.0
3 34.0 34.0
4 NaN NaN
Filling Arbitrary Value
filled_df = df.fillna(0)
print(filled_df)
--------------------
A B
0 2.0 0.0
1 0.0 23.0
2 19.0 12.0
3 34.0 34.0
4 0.0 0.0
We passed an arbitrary value (0
) to fill those NaN
values in the dataset df
.
Fill Using a Dataset
We can also use a dataset to fill in the missing values.
df2 = pd.DataFrame({"A": [1,2,3,4,5], "B": [6,7,8,9,10]})
fill_using_df = df.fillna(df2)
print(fill_using_df)
--------------------
A B
0 2.0 6.0
1 2.0 23.0
2 19.0 12.0
3 34.0 34.0
4 5.0 10.0
When using the fillna()
method with df2
, the NaN
values in the original DataFrame df
are replaced by the corresponding values in df2
. If a cell in df
is NaN
, the method will look for the corresponding value in df2
(at the same position) and use that value to fill in the NaN
.
Filling Different Values in Each Column
If we want to fill in different values in each column, we can use the following approach.
values = {"A": 100, "B": 200}
diff_val = df.fillna(value=values)
print(diff_val)
--------------------
A B
0 2.0 200.0
1 100.0 23.0
2 19.0 12.0
3 34.0 34.0
4 100.0 200.0
The value dictionary holds values to fill NaN
in columns A
and B
in the dataset. By using df.fillna(value=values)
, the NaN
value in column A
is filled with the value 100
and NaN
value in column B
is filled with the value 200
.
df.interpolate()
The DataFrame.interpolate()
method provides various interpolation techniques to fill in the missing values.
Instead of filling in hard-coded values, we can use an interpolation method to fill missing values that make the dataset even more expressive and real.
Filling Computed Value
df2 = df.interpolate() # default: linear method and axis=0
print(df2)
--------------------
A B
0 2.0 NaN
1 10.5 23.0
2 19.0 12.0
3 34.0 34.0
4 34.0 34.0
When we use df.interpolate()
, the default linear
interpolation method is used that fills the NaN
values equally spaced ignoring the index.
For example, in column A
, 10.5
is filled which is equally spaced between the values 2.0
and 19.0
with the difference of 8.5
.
But if we see the fourth row in both columns, they are filled with the same value (34.0
) as above them because there were no values to compute in the fifth row.
Filling Nearest Values
data = {"A": [3, np.nan, 2, np.nan, 4],
"B": [1, 4, np.nan, 2, 5]}
df = pd.DataFrame(data)
df3= df.interpolate(method='nearest')
print(df3)
--------------------
A B
0 3.0 1.0
1 3.0 4.0
2 2.0 4.0
3 2.0 2.0
4 4.0 5.0
When we use method='nearest'
, the NaN
values are filled with the nearest valid values.
In this case, the second row in column A
is filled with the value of 3.0
. Why so? The nearest value is decided based on the index close to the NaN
value index. The index 0
(3.0
) is closest to the index 1
. The same is applied to all the NaN
values.
Filling Values Considering Index Values
data = {"A": [3, np.nan, 9, np.nan, 4],
"B": [1, 10, np.nan, 20, 5]}
df = pd.DataFrame(data)
df4= df.interpolate(method='values') # or method='index'
print(df4)
--------------------
A B
0 3.0 1.0
1 6.0 10.0
2 9.0 15.0
3 6.5 20.0
4 4.0 5.0
The NaN
values are filled equally spaced considering the values of the index surrounding the NaN
value index.
df.ffill() and df.bfill()
The DataFrame.ffill()
method is used to fill the last valid value in the missing place whereas the DataFrame.bfill()
method is used to fill the next valid value.
Forward Filling With ffill()
data = {"A": [2, np.nan, 19, 34, np.nan],
"B": [np.nan, 23, 12, 34, np.nan]}
df = pd.DataFrame(data)
forward_fill = df.ffill()
print(forward_fill)
--------------------
A B
0 2.0 NaN
1 2.0 23.0
2 19.0 12.0
3 34.0 34.0
4 34.0 34.0
We can see that NaN
values are filled with the preceding valid values, for instance, the second row of column A
is filled with 2.0
which is the same value above it.
Backward Filling With bfill()
data = {"A": [2, np.nan, 19, 34, np.nan],
"B": [np.nan, 23, 12, 34, np.nan]}
df = pd.DataFrame(data)
backward_fill = df.bfill()
print(backward_fill)
--------------------
A B
0 2.0 23.0
1 19.0 23.0
2 19.0 12.0
3 34.0 34.0
4 NaN NaN
In this case, the NaN
is filled with the next valid values, for instance, the first row of column B
is filled with 23.0
which is the next value in the column.
We can also see that the fifth row of columns A
and B
remains unfilled (NaN
) due to the absence of the next valid value in the dataset.
🏆Other articles you might be interested in if you liked this one
✅Pandas df.ffill() and df.bfill() to handle missing values.
✅Merge, combine, and concatenate multiple datasets using pandas.
✅Find and delete duplicate rows from the dataset using pandas.
✅How to efficiently manage memory use when working with large datasets in pandas?
✅How to find and delete mismatched columns from datasets in pandas?
✅How does the learning rate affect the ML and DL models?
That's all for now
Keep Coding✌✌