Python Pandas: How to remove nan and inf values

In this blog, learn how to effectively handle missing and invalid data using Python pandas for seamless data analysis.

As a data scientist or software engineer, you know that working with data can be challenging, especially when dealing with missing or invalid values. In this post, I’ll show you how to use Python pandas to remove NaN and -inf values from your data.

What are NaN and -inf values?

NaN stands for Not a Number and is a special floating-point value used to represent missing or undefined values. NaN values can occur when performing mathematical operations on invalid values, such as dividing by zero or taking the square root of a negative number.

-Inf stands for negative infinity and is another special floating-point value used to represent values that are too small to be represented by a finite number. -Inf values can occur when performing mathematical operations on extremely small values.

Why remove NaN and -inf values?

NaN and -inf values can cause problems when performing calculations or statistical analysis on your data. They can skew your results, produce incorrect values, or cause errors in your code. Therefore, it’s often necessary to remove them before proceeding with your analysis.

How to remove NaN and -inf values in Python pandas

Python pandas provides several methods for removing NaN and -inf values from your data. The most commonly used methods are:

  • dropna(): removes rows or columns with NaN or -inf values
  • replace(): replaces NaN and -inf values with a specified value
  • interpolate(): fills NaN values with interpolated values

Using dropna()

The dropna() method removes rows or columns with NaN or -inf values from your data. By default, it removes all rows with at least one NaN or -inf value. You can specify the axis parameter to remove columns instead of rows.

import pandas as pd
# create a dataframe that contains NaN values
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': [6, -7, 8, -9, 10],
    'C': [11, 12, 13, None, 15],
    'D': [16, 17, 18, 19, 20]
})
print(df)

Output:

   A   B     C   D
0  1   6  11.0  16
1  2  -7  12.0  17
2  3   8  13.0  18
3  4  -9   NaN  19
4  5  10  15.0  20
# drop rows that contain NaN values
df = df.dropna()
print(df)

Output:

   A   B     C   D
0  1   6  11.0  16
1  2  -7  12.0  17
2  3   8  13.0  18
4  5  10  15.0  20

In this example, the dropna() method removes the fourth row from the DataFrame, which contains a None value in column C.

Using replace()

The replace() method replaces NaN and -inf values with a specified value. You can specify the value to replace NaN and -inf with using the value parameter.

import pandas as pd
import numpy as np
# create a dataframe that contains NaN and -inf values
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': [6, -7, 8, -9, 10],
    'C': [11, 12, 13, np.nan, 15],
    'D': [16, 17, -np.inf, 19, 20]
})
print(df)

Output:

   A   B     C     D
0  1   6  11.0  16.0
1  2  -7  12.0  17.0
2  3   8  13.0  -inf
3  4  -9   NaN  19.0
4  5  10  15.0  20.0
# replace NaN and -inf values with 0
df = df.replace([np.nan, -np.inf], 0)
print(df)

Output:

   A   B     C     D
0  1   6  11.0  16.0
1  2  -7  12.0  17.0
2  3   8  13.0   0.0
3  4  -9   0.0  19.0
4  5  10  15.0  20.0

In this example, the replace() method replaces all NaN and -inf values with 0.

Using interpolate()

The interpolate() method fills NaN values with interpolated values based on the values of neighboring rows or columns. You can specify the interpolation method to use using the method parameter.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': [6, -7, 8, -9, 10],
    'C': [11, 12, 13, np.nan, 15],
    'D': [16, 17, 18, 19, 20]
})
# Interpolate a value to replace NaN based on its neighbors
df = df.interpolate(method='linear')

Output:

   A   B     C   D
0  1   6  11.0  16
1  2  -7  12.0  17
2  3   8  13.0  18
3  4  -9  14.0  19
4  5  10  15.0  20

In this example, the interpolate() method fills the NaN value in column C with an interpolated value (14) based on the values of neighboring rows.

Conclusion

NaN and -inf values can cause problems when working with data, but Python pandas provides several methods for removing or replacing them. By using the dropna(), replace(), and interpolate() methods, you can clean your data and proceed with your analysis without worrying about invalid values.

Remember to always carefully consider the impact of removing or replacing NaN and -inf values on your analysis and to document your data cleaning process.


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.