How to Filter Pandas DataFrame by Substring Criteria

In this blog, we will delve into the process of filtering DataFrames based on substring criteria, a common task encountered by data scientists and software engineers when dealing with extensive datasets. We’ll explore how to accomplish this using the widely-used Python library, pandas.

As a data scientist or software engineer, it’s common to work with large datasets that require filtering based on specific criteria. One common task is filtering a DataFrame based on a substring criteria. In this blog post, we’ll explore how to achieve this using the popular Python library, pandas.

Table of Contents

  1. What is pandas?
  2. The Problem
  3. The Solution
  4. Advanced Filtering
  5. Common Errors and Solutions
  6. Conclusion

What is pandas?

Pandas is a powerful open-source data analysis and manipulation library for Python. It provides data structures for efficiently storing and manipulating large datasets, as well as a wide range of functions for data analysis tasks such as filtering, aggregating, and transforming data.

The Problem

Suppose we have a pandas DataFrame with a column containing strings, and we want to filter this DataFrame to include only rows where the string in this column contains a specific substring.

For example, consider the following DataFrame:

import pandas as pd

data = {
    'names': ['John', 'Jane', 'Bob', 'Alice', 'Mary'],
    'emails': ['john@email.com', 'jane@email.com', 'bob@gmail.com', 
               'alice@yahoo.com', 'mary@hotmail.com']
}

df = pd.DataFrame(data)
   names            emails
0   John    john@email.com
1   Jane    jane@email.com
2    Bob     bob@gmail.com
3  Alice  alice@yahoo.com
4   Mary  mary@hotmail.com

Suppose we want to filter this DataFrame to include only rows where the email address contains the substring gmail. The expected output should be:

  names         emails
2   Bob  bob@gmail.com

The Solution

To filter a pandas DataFrame based on substring criteria, we can use the str.contains() method. This method returns a boolean mask indicating whether each element in the specified column contains the specified substring.

Here’s how we can use the str.contains() method to filter the DataFrame in the previous example:

filtered_df = df[df['emails'].str.contains('gmail')]

This code first selects the ‘emails’ column of the DataFrame using the indexing operator ([]). The str.contains() method is then applied to this column, with the argument ‘gmail’ specifying the substring to search for. This method returns a boolean mask indicating which elements of the column contain the substring ‘gmail’.

Finally, this boolean mask is used to filter the original DataFrame using the indexing operator ([]), resulting in the filtered DataFrame:

  names         emails
2   Bob  bob@gmail.com

Advanced Filtering

The str.contains() method supports several optional parameters that allow for more advanced filtering. Here are a few examples:

Case Sensitivity

By default, the str.contains() method is case-sensitive. This means that it will only match substrings that have the same case as the specified substring.

To perform a case-insensitive search, we can set the case parameter to False:

filtered_df = df[df['emails'].str.contains('GMAIL', case=False)]
print(filtered_df)

This code filters the DataFrame to include only rows where the emails column contains the substring GMAIL, regardless of case.

Output:

  names         emails
2   Bob  bob@gmail.com

Regular Expressions

The str.contains() method also supports regular expressions. Regular expressions are a powerful tool for matching and manipulating text.

For example, suppose we want to filter the DataFrame to include only rows where the email address contains either gmail.com or yahoo.com. We can achieve this using a regular expression:

filtered_df = df[df['emails'].str.contains('gmail.com|yahoo.com')]
print(filtered_df)

This code filters the DataFrame to include only rows where the emails column contains either the substring gmail.com or yahoo.com.

Output:

   names           emails
2    Bob    bob@gmail.com
3  Alice  alice@yahoo.com

Common Errors and Solutions

Error 1: Incorrect Column Name: Description: If the column name is misspelled, Pandas won’t find the column, resulting in a KeyError.

Solution: Double-check the column name for accuracy.

Example:

# Incorrect column name
df[df['name'].str.contains('Alice')]  # KeyError: 'name'

Error 2: Case Sensitivity Issues: Description: Case-sensitive searches may lead to unexpected results.

Solution: Use the case parameter or convert the column to lowercase before filtering.

Example:

# Case-sensitive search
df[df['names'].str.contains('alice')]  # Empty DataFrame

Conclusion

Filtering a pandas DataFrame based on substring criteria is a common task in data analysis and manipulation. In this blog post, we explored how to achieve this using the str.contains() method. We also looked at some advanced filtering options, such as case-insensitive search and regular expressions.

By mastering these techniques, you’ll be able to efficiently filter large datasets based on specific criteria, making your data analysis tasks more efficient and effective.


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.