How to Filter Pandas DataFrames by Column of Strings

In this blog, explore how to filter Pandas DataFrames efficiently by a string column, leveraging the powerful data manipulation and analysis features of this widely-used Python library.

How to Filter Pandas DataFrames by Column of Strings

Pandas is a popular library in Python that is used extensively in data science and software engineering. It provides data structures and tools for data manipulation, analysis, and visualization. In this article, we will discuss how to filter Pandas DataFrames by a column of strings.

Introduction

Pandas DataFrames are two-dimensional labeled data structures that can hold data of different types, including strings. Filtering DataFrames by a column of strings is a common task in data science and software engineering. This can be achieved using the str attribute of a DataFrame column. The str attribute provides a set of string methods that can be used to filter, search, and manipulate strings in a DataFrame column.

Filtering by a Single String

To filter a DataFrame by a single string value in a column, we can use the str.contains() method. The str.contains() method returns a Boolean mask that can be used to select the rows that contain the specified string value in the column.

import pandas as pd

# Create a DataFrame
data = {"Name": ["John", "Jane", "Mary", "Adam"],
        "City": ["New York", "Los Angeles", "Chicago", "Houston"]}
df = pd.DataFrame(data)

# Filter the DataFrame by a string value in the "City" column
filtered_df = df[df["City"].str.contains("Los Angeles")]

print(filtered_df)

The output should be:

         Name         City
1        Jane  Los Angeles

In this example, we filtered the DataFrame by the string value “Los Angeles” in the “City” column using the str.contains() method.

Filtering by Multiple Strings

To filter a DataFrame by multiple string values in a column, we can use the str.contains() method with a regular expression. The regular expression can be used to match multiple strings separated by the or (|) operator.

import pandas as pd

# Create a DataFrame
data = {"Name": ["John", "Jane", "Mary", "Adam"],
        "City": ["New York", "Los Angeles", "Chicago", "Houston"]}
df = pd.DataFrame(data)

# Filter the DataFrame by multiple string values in the "City" column
filtered_df = df[df["City"].str.contains("Los Angeles|Chicago")]

print(filtered_df)

The output should be:

   Name         City
1  Jane  Los Angeles
2  Mary     Chicago

In this example, we filtered the DataFrame by the string values Los Angeles and Chicago in the City column using the str.contains() method with a regular expression.

Filtering by a List of Strings

To filter a DataFrame by a list of string values in a column, we can use the isin() method. The isin() method returns a Boolean mask that can be used to select the rows that contain any of the specified string values in the column.

import pandas as pd

# Create a DataFrame
data = {"Name": ["John", "Jane", "Mary", "Adam"],
        "City": ["New York", "Los Angeles", "Chicago", "Houston"]}
df = pd.DataFrame(data)

# Filter the DataFrame by a list of string values in the "City" column
filtered_df = df[df["City"].isin(["Los Angeles", "Chicago"])]

print(filtered_df)

The output should be:

   Name         City
1  Jane  Los Angeles
2  Mary     Chicago

In this example, we filtered the DataFrame by the string values Los Angeles and Chicago in the City column using the isin() method.

Conclusion

Filtering Pandas DataFrames by a column of strings is a common task in data science and software engineering. In this article, we discussed how to filter DataFrames by a single string value, multiple string values using a regular expression, and a list of string values using the str attribute and the isin() method. These methods can be used to select the rows that meet certain criteria based on the string values in a DataFrame column.


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.