Python Pandas: How to remove nan and inf values
As a data scientist or software engineer, you know that working with data can be challenging, especially when dealing with missing or invalid values. In this post, I’ll show you how to use Python pandas to remove NaN
and -inf
values from your data.
What are NaN
and -inf
values?
NaN
stands for Not a Number and is a special floating-point value used to represent missing or undefined values. NaN
values can occur when performing mathematical operations on invalid values, such as dividing by zero or taking the square root of a negative number.
-Inf
stands for negative infinity and is another special floating-point value used to represent values that are too small to be represented by a finite number. -Inf
values can occur when performing mathematical operations on extremely small values.
Why remove NaN
and -inf
values?
NaN
and -inf
values can cause problems when performing calculations or statistical analysis on your data. They can skew your results, produce incorrect values, or cause errors in your code. Therefore, it’s often necessary to remove them before proceeding with your analysis.
How to remove NaN
and -inf
values in Python pandas
Python pandas provides several methods for removing NaN
and -inf
values from your data. The most commonly used methods are:
dropna()
: removes rows or columns withNaN
or-inf
valuesreplace()
: replacesNaN
and-inf
values with a specified valueinterpolate()
: fillsNaN
values with interpolated values
Using dropna()
The dropna()
method removes rows or columns with NaN
or -inf
values from your data. By default, it removes all rows with at least one NaN
or -inf
value. You can specify the axis parameter to remove columns instead of rows.
import pandas as pd
# create a dataframe that contains NaN values
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': [6, -7, 8, -9, 10],
'C': [11, 12, 13, None, 15],
'D': [16, 17, 18, 19, 20]
})
print(df)
Output:
A B C D
0 1 6 11.0 16
1 2 -7 12.0 17
2 3 8 13.0 18
3 4 -9 NaN 19
4 5 10 15.0 20
# drop rows that contain NaN values
df = df.dropna()
print(df)
Output:
A B C D
0 1 6 11.0 16
1 2 -7 12.0 17
2 3 8 13.0 18
4 5 10 15.0 20
In this example, the dropna()
method removes the fourth row from the DataFrame, which contains a None value in column C.
Using replace()
The replace()
method replaces NaN
and -inf
values with a specified value. You can specify the value to replace NaN
and -inf
with using the value
parameter.
import pandas as pd
import numpy as np
# create a dataframe that contains NaN and -inf values
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': [6, -7, 8, -9, 10],
'C': [11, 12, 13, np.nan, 15],
'D': [16, 17, -np.inf, 19, 20]
})
print(df)
Output:
A B C D
0 1 6 11.0 16.0
1 2 -7 12.0 17.0
2 3 8 13.0 -inf
3 4 -9 NaN 19.0
4 5 10 15.0 20.0
# replace NaN and -inf values with 0
df = df.replace([np.nan, -np.inf], 0)
print(df)
Output:
A B C D
0 1 6 11.0 16.0
1 2 -7 12.0 17.0
2 3 8 13.0 0.0
3 4 -9 0.0 19.0
4 5 10 15.0 20.0
In this example, the replace()
method replaces all NaN
and -inf
values with 0.
Using interpolate()
The interpolate()
method fills NaN
values with interpolated values based on the values of neighboring rows or columns. You can specify the interpolation method to use using the method
parameter.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': [6, -7, 8, -9, 10],
'C': [11, 12, 13, np.nan, 15],
'D': [16, 17, 18, 19, 20]
})
# Interpolate a value to replace NaN based on its neighbors
df = df.interpolate(method='linear')
Output:
A B C D
0 1 6 11.0 16
1 2 -7 12.0 17
2 3 8 13.0 18
3 4 -9 14.0 19
4 5 10 15.0 20
In this example, the interpolate()
method fills the NaN
value in column C with an interpolated value (14)
based on the values of neighboring rows.
Conclusion
NaN
and -inf
values can cause problems when working with data, but Python pandas provides several methods for removing or replacing them. By using the dropna()
, replace()
, and interpolate()
methods, you can clean your data and proceed with your analysis without worrying about invalid values.
Remember to always carefully consider the impact of removing or replacing NaN
and -inf
values on your analysis and to document your data cleaning process.
About Saturn Cloud
Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.
Saturn Cloud provides customizable, ready-to-use cloud environments for collaborative data teams.
Try Saturn Cloud and join thousands of users moving to the cloud without
having to switch tools.