How to Detect and Exclude Outliers in a Pandas DataFrame

As a data scientist or software engineer you may come across datasets with outliers that can affect your analysis and predictions Outliers are data points that are significantly different from other data points in the dataset These can be caused by measurement errors data entry errors or even real events that are rare but impactful

Detecting and excluding outliers is crucial to ensure the accuracy and reliability of your analysis. In this blog post, we will discuss how to detect and exclude outliers in a pandas DataFrame.

Understanding Outliers

Before we dive into the techniques to detect and exclude outliers, let’s understand what outliers are and how they can affect your analysis.

Outliers can be identified using statistical methods such as the z-score and the interquartile range (IQR). The z-score measures how many standard deviations a data point is away from the mean, while the IQR measures the spread of the middle 50% of the data. If a data point is too far away from the mean or outside the IQR, it can be considered an outlier.

Outliers can affect your analysis in several ways. They can skew your data and affect the mean and standard deviation, making it difficult to obtain accurate estimates. Outliers can also affect the regression line and lead to incorrect predictions. Therefore, it is crucial to identify and exclude outliers before conducting any analysis.

Detecting Outliers

There are various techniques to detect outliers in a pandas DataFrame. Let’s discuss some of the most commonly used methods.

Z-Score

The z-score is a statistical measure that indicates how many standard deviations a data point is away from the mean. The z-score can be calculated using the following formula:

z = (x - mean) / std

where x is the data point, mean is the mean of the dataset, and std is the standard deviation of the dataset.

To identify outliers using the z-score, we can set a threshold value, say 3. Any data point with a z-score greater than 3 or less than -3 can be considered an outlier. We can use the scipy library in Python to calculate the z-score and identify outliers.

import pandas as pd
import numpy as np
from scipy import stats

#Create a sample DataFrame of student heights
df = pd.DataFrame({'Height': [170, 160, 130, 190, 180, 150, 140, 200, 175, 165]})

# Calculate the z-score for each student's height
z = np.abs(stats.zscore(df['Height']))

# Identify outliers as students with a z-score greater than 3
threshold = 3
outliers = df[z > threshold]

# Print the outliers
print(outliers)

In the above code, we calculate the z-score for column Height using stats.zscore() and set a threshold of 3 to identify outliers. We then filter the DataFrame to obtain the outliers.

Interquartile Range (IQR)

The interquartile range (IQR) is a measure of the spread of the middle 50% of the data. The IQR can be calculated as the difference between the 75th percentile and the 25th percentile of the dataset. Any data point outside the range of 1.5 times the IQR below the 25th percentile or above the 75th percentile can be considered an outlier.

To identify outliers using the IQR, we can use the quantile() function in pandas to calculate the 25th and 75th percentiles of the dataset. We can then calculate the IQR and use it to identify outliers.

# calculate IQR for column Height
Q1 = df['Height'].quantile(0.25)
Q3 = df['Height'].quantile(0.75)
IQR = Q3 - Q1

# identify outliers
threshold = 1.5
outliers = df[(df['Height'] < Q1 - threshold * IQR) | (df['Height'] > Q3 + threshold * IQR)]

In the above code, we calculate the 25th and 75th percentiles of column Height using quantile() and calculate the IQR. We then set a threshold of 1.5 to identify outliers and filter the DataFrame to obtain the outliers.

Excluding Outliers

Once we have identified the outliers in our dataset, we can either exclude them from our analysis or replace them with more accurate values.

Excluding Outliers

To exclude outliers from our analysis, we can simply remove the rows containing the outliers from our DataFrame. We can use the drop() function in pandas to remove the rows containing the outliers.

# drop rows containing outliers
df = df.drop(outliers.index)

In the above code, we use the drop() function to remove the rows containing the outliers identified in the previous section.

Replacing Outliers

To replace outliers with more accurate values, we can use various techniques such as interpolation or imputation. Interpolation involves filling in the missing values using the values of neighboring data points, while imputation involves estimating the missing values based on other features of the dataset.

# replace outliers with median value
df.loc[z > threshold, 'Height'] = df['Height'].median()

In the above code, we replace the outliers in column Height with the median value of the column.

Conclusion

Detecting and excluding outliers is crucial to ensure the accuracy and reliability of your analysis. In this blog post, we discussed how to detect and exclude outliers in a pandas DataFrame using statistical methods such as the z-score and the interquartile range.

By using these techniques, you can identify and exclude outliers from your dataset, ensuring that your analysis is accurate and reliable.


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.