How to Apply Regex to a Pandas DataFrame

In this blog, explore the step-by-step process of applying regular expressions (regex) to manipulate and extract specific data from a pandas DataFrame. As a data scientist or software engineer, harness the power of regex in conjunction with the popular Python library, Pandas, for efficient pattern matching and text processing in data manipulation and analysis tasks.

As a data scientist or software engineer, you may come across situations where you need to extract or manipulate specific data from a pandas DataFrame using regular expressions (regex). Pandas is a popular Python library for data manipulation and analysis, and regex is a powerful tool for pattern matching and text processing. In this article, we will explain how to apply regex to a pandas DataFrame, step-by-step.

What is Regex?

Regex is a sequence of characters that defines a search pattern. It is a powerful tool for pattern matching and text processing. In Python, regex is implemented in the re module. Regex patterns can be used to match, search, replace or extract specific text from a string. The basic syntax for a regex pattern in Python is as follows:

import re

pattern = r'expression'
match = re.search(pattern, string)

pattern is the regex pattern we want to match, string is the input string we want to search in, and re.search() function searches the input string for the pattern and returns the first match it finds.

What is Pandas?

Pandas is a popular Python library for data manipulation and analysis. It provides data structures for efficiently storing and manipulating large datasets, as well as tools for data cleaning, reshaping, merging, and aggregation. A pandas DataFrame is a 2-dimensional table with rows and columns, similar to a spreadsheet or SQL table. It is a powerful tool for data analysis, especially when combined with other Python libraries such as NumPy, Matplotlib, and Scikit-learn.

Applying Regex to a Pandas DataFrame

To apply regex to a pandas DataFrame, we need to use the pandas str accessor. The str accessor allows us to apply string methods to each element of a pandas Series or DataFrame. Here are the steps to apply regex to a pandas DataFrame:

  1. Import the pandas library and load the data into a pandas DataFrame.
import pandas as pd

df = pd.read_csv('data.csv')
  1. Select the column or columns that contain the text we want to match with regex.
text_column = df['text_column']
  1. Use the str accessor to apply regex to the text column.
import re

pattern = r'expression'
match = text_column.str.match(pattern)

text_column.str returns a pandas Series with string methods, and match() method applies the regex pattern to each element of the Series and returns a boolean Series with True or False values indicating whether the pattern matches the element. Along with match(), we can use contains() to get boolean for filtering and extract() or replace() to manipulate the content of the DataFrame.

  1. Use the boolean Series to filter the DataFrame.
df_filtered = df[match]

This will create a new DataFrame with only the rows that match the regex pattern in the text column.

Examples

Let’s see some examples of applying regex to a pandas DataFrame.

Example 1: Extracting Phone Numbers

Suppose we have a DataFrame with a column containing phone numbers in the format "(123) 456-7890". We want to extract the area code and the last 4 digits of each phone number. Here’s how we can do it with regex:

import pandas as pd

df = pd.DataFrame({'phone': ['(123) 456-7890', '(456) 789-0123', '(789) 012-3456']})
phone_column = df['phone']

area_code = phone_column.str.extract(r'\((\d{3})\)')
last_four_digits = phone_column.str.extract(r'-(\d{4})')

df['area_code'] = area_code
df['last_four_digits'] = last_four_digits

print(df)

This will produce the output:

            phone area_code last_four_digits
0  (123) 456-7890       123             7890
1  (456) 789-0123       456             0123
2  (789) 012-3456       789             3456

Example 2: Cleaning Text Data

Suppose we have a DataFrame with a column containing text data with extra spaces and special characters. We want to clean the text data by removing extra spaces and special characters. Here’s how we can do it with regex:

import pandas as pd

df = pd.DataFrame({'text': ['hello   world!', '   how are you? ', '   $%!# 123']})
text_column = df['text']

clean_text = text_column.str.replace(r'[^a-zA-Z0-9 ]+', '').str.replace(r'\s+', ' ').str.strip()

df['clean_text'] = clean_text

print(df)

This will produce the output:

               text    clean_text
0    hello   world!   hello world
1     how are you?     how are you
2        $%!# 123           123

Example 3: Filtering Rows Based on Pattern

Let’s consider a DataFrame containing a column of email addresses, and you want to filter out rows where the email addresses follow a specific pattern. Let’s say you want to keep only those email addresses that end with "@example.com."

import pandas as pd

# Creating a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Email': ['alice@example.com', 'bob@gmail.com', 'charlie@example.com', 'david@example.com']}
df = pd.DataFrame(data)

# Applying regex to filter rows
pattern = r'@example\.com$'
filtered_df = df[df['Email'].str.contains(pattern)]

print(filtered_df)

Output:

      Name                Email
0    Alice    alice@example.com
2  Charlie  charlie@example.com
3    David    david@example.com

Conclusion

In this article, we explained how to apply regex to a pandas DataFrame. We showed how to use the str accessor to apply regex to a text column, and how to filter the DataFrame based on the regex results. We also provided some examples of using regex to extract or clean text data. By applying regex to a pandas DataFrame, you can efficiently manipulate and analyze text data for your data science or software engineering projects.


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.