3 Methods to Handle Missing Values in Dataset Using Pandas

I still can’t believe what happened to me! For years, I struggled to win anything from the lottery until I came across Dr. Benjamin, a powerful spell caster. At first, I was skeptical, but after speaking with him, I decided to give it a try. Dr. Benjamin prepared a special spell and gave me the exact lottery numbers to play. I followed his instructions carefully, and to my greatest surprise, I hit the Powerball jackpot worth $340,000,000.00! My life has completely changed overnight — from financial struggles to total freedom. I can now take care of my family, help people in need, and live the kind of life I always dreamed of. Dr. Benjamin is truly a blessing, and I’ll forever be grateful for his amazing help. I promise not to ever stop telling people about the goods works you've done in my life.. If you need his help reach him today via email drbenjaminlottospell711@gmail.com

Pandas provide numerous functions and methods to clean and preprocess the dataset to make it production-ready.

In this article, we'll see the methods provided by pandas to handle missing values in a dataset.

df.fillna()

The DataFrame.fillna() is used to fill in the missing values with the desired value. Let's see how we can use it.

data = {"A": [2, np.nan, 19, 34, np.nan],
        "B": [np.nan, 23, 12, 34, np.nan]}
df = pd.DataFrame(data)
df
--------------------
      A     B
0   2.0   NaN
1   NaN  23.0
2  19.0  12.0
3  34.0  34.0
4   NaN   NaN

Filling Arbitrary Value

filled_df = df.fillna(0)
print(filled_df)
--------------------
      A     B
0   2.0   0.0
1   0.0  23.0
2  19.0  12.0
3  34.0  34.0
4   0.0   0.0

We passed an arbitrary value (0) to fill those NaN values in the dataset df.

Fill Using a Dataset

We can also use a dataset to fill in the missing values.

df2 = pd.DataFrame({"A": [1,2,3,4,5], "B": [6,7,8,9,10]})
fill_using_df = df.fillna(df2)
print(fill_using_df)
--------------------
      A     B
0   2.0   6.0
1   2.0  23.0
2  19.0  12.0
3  34.0  34.0
4   5.0  10.0

When using the fillna() method with df2, the NaN values in the original DataFrame df are replaced by the corresponding values in df2. If a cell in df is NaN, the method will look for the corresponding value in df2 (at the same position) and use that value to fill in the NaN.

Filling Different Values in Each Column

If we want to fill in different values in each column, we can use the following approach.

values = {"A": 100, "B": 200}
diff_val = df.fillna(value=values)
print(diff_val)
--------------------
       A      B
0    2.0  200.0
1  100.0   23.0
2   19.0   12.0
3   34.0   34.0
4  100.0  200.0

The value dictionary holds values to fill NaN in columns A and B in the dataset. By using df.fillna(value=values), the NaN value in column A is filled with the value 100 and NaN value in column B is filled with the value 200.

df.interpolate()

The DataFrame.interpolate() method provides various interpolation techniques to fill in the missing values.

Instead of filling in hard-coded values, we can use an interpolation method to fill missing values that make the dataset even more expressive and real.

Filling Computed Value

df2 = df.interpolate()  # default: linear method and axis=0
print(df2)
--------------------
      A     B
0   2.0   NaN
1  10.5  23.0
2  19.0  12.0
3  34.0  34.0
4  34.0  34.0

When we use df.interpolate(), the default linear interpolation method is used that fills the NaN values equally spaced ignoring the index.

For example, in column A, 10.5 is filled which is equally spaced between the values 2.0 and 19.0 with the difference of 8.5.

But if we see the fourth row in both columns, they are filled with the same value (34.0) as above them because there were no values to compute in the fifth row.

Filling Nearest Values

data = {"A": [3, np.nan, 2, np.nan, 4],
        "B": [1, 4, np.nan, 2, 5]}
df = pd.DataFrame(data)
df3= df.interpolate(method='nearest')
print(df3)
--------------------
     A    B
0  3.0  1.0
1  3.0  4.0
2  2.0  4.0
3  2.0  2.0
4  4.0  5.0

When we use method='nearest', the NaN values are filled with the nearest valid values.

In this case, the second row in column A is filled with the value of 3.0. Why so? The nearest value is decided based on the index close to the NaN value index. The index 0 (3.0) is closest to the index 1. The same is applied to all the NaN values.

Filling Values Considering Index Values

data = {"A": [3, np.nan, 9, np.nan, 4],
        "B": [1, 10, np.nan, 20, 5]}
df = pd.DataFrame(data)
df4= df.interpolate(method='values')  # or method='index'
print(df4)
--------------------
     A     B
0  3.0   1.0
1  6.0  10.0
2  9.0  15.0
3  6.5  20.0
4  4.0   5.0

The NaN values are filled equally spaced considering the values of the index surrounding the NaN value index.

df.ffill() and df.bfill()

The DataFrame.ffill() method is used to fill the last valid value in the missing place whereas the DataFrame.bfill() method is used to fill the next valid value.

Forward Filling With ffill()

data = {"A": [2, np.nan, 19, 34, np.nan],
        "B": [np.nan, 23, 12, 34, np.nan]}
df = pd.DataFrame(data)

forward_fill = df.ffill()
print(forward_fill)
--------------------
      A     B
0   2.0   NaN
1   2.0  23.0
2  19.0  12.0
3  34.0  34.0
4  34.0  34.0

We can see that NaN values are filled with the preceding valid values, for instance, the second row of column A is filled with 2.0 which is the same value above it.

Backward Filling With bfill()

data = {"A": [2, np.nan, 19, 34, np.nan],
        "B": [np.nan, 23, 12, 34, np.nan]}
df = pd.DataFrame(data)

backward_fill = df.bfill()
print(backward_fill)
--------------------
      A     B
0   2.0  23.0
1  19.0  23.0
2  19.0  12.0
3  34.0  34.0
4   NaN   NaN

In this case, the NaN is filled with the next valid values, for instance, the first row of column B is filled with 23.0 which is the next value in the column.

We can also see that the fifth row of columns A and B remains unfilled (NaN) due to the absence of the next valid value in the dataset.

🏆Other articles you might be interested in if you liked this one

✅Pandas df.ffill() and df.bfill() to handle missing values.

✅Merge, combine, and concatenate multiple datasets using pandas.

✅Find and delete duplicate rows from the dataset using pandas.

✅How to efficiently manage memory use when working with large datasets in pandas?

✅How to find and delete mismatched columns from datasets in pandas?

✅How does the learning rate affect the ML and DL models?

That's all for now

Keep Coding✌✌

3 Methods to Handle Missing Values in Dataset Using Pandas

Comments (1)

More from this blog

Power-up API Testing with EchoAPI's AI

How to Install PyPy and Use it?

10 Useful yet Rarely Used OS Functions in Python

How to Stream Video to Frontend in FastAPI

How to Fix Circular Imports in Python

df.fillna()

Filling Arbitrary Value

Fill Using a Dataset

Filling Different Values in Each Column

df.interpolate()

Filling Computed Value

Filling Nearest Values

Filling Values Considering Index Values

df.ffill() and df.bfill()

Forward Filling With ffill()

Backward Filling With bfill()

Command Palette

Comments (1)

More from this blog

df.fillna()

Filling Arbitrary Value

Fill Using a Dataset

Filling Different Values in Each Column

df.interpolate()

Filling Computed Value

Filling Nearest Values

Filling Values Considering Index Values

df.ffill() and df.bfill()

Forward Filling With ffill()

Backward Filling With bfill()