How to Extract First and Last Words from Strings as a New Column in Pandas

In this blog, when dealing with substantial datasets containing textual information as a data scientist or software engineer, it is frequently necessary to extract particular segments of the text, such as the initial and concluding words. Fortunately, Pandas, the Python library, offers a simple method to accomplish this task.

As a data scientist or software engineer, you may often find yourself working with large datasets that contain strings (text data). In such cases, it’s common to need to extract specific parts of the text, such as the first and last words. Luckily, the Python library Pandas provides a straightforward way to achieve this.

In this article, we’ll walk through a step-by-step guide on how to extract the first and last words from strings as a new column in Pandas. We’ll use a sample dataset to demonstrate the process.

Table of Contents

  1. What is Pandas?
  2. The Dataset
  3. Step-by-Step Guide
  4. Best Practices
  5. Conclusion

What is Pandas?

Before we dive into the specifics of this task, let’s briefly review what Pandas is. Pandas is a Python library that provides powerful data manipulation and analysis capabilities. It’s particularly useful for working with structured data, such as CSV files or Excel spreadsheets, and for performing tasks such as data cleaning, merging, and filtering.

Pandas has two primary data structures: Series and DataFrame. A Series is a one-dimensional labeled array that can hold any data type, while a DataFrame is a two-dimensional labeled data structure with columns of potentially different types.

The Dataset

To demonstrate how to extract the first and last words from strings in Pandas, we’ll use a sample dataset of product names. The dataset contains the following columns:

  • product_id: A unique identifier for each product.
  • product_name: The name of each product.

Here’s a sample of the dataset:

product_idproduct_name
1Apple iPhone 12 Pro Max
2Samsung Galaxy S21 Ultra
3Sony WH-1000XM4 Wireless Headphones
4Bose QuietComfort 35 II Headphones
5Google Pixel 5

Our goal is to extract the first and last words from the product_name column and add them as new columns to the dataset.

Step-by-Step Guide

Now that we have our sample dataset, let’s walk through the steps required to extract the first and last words from strings as a new column in Pandas.

Step 1: Load the Dataset

The first step is to load the dataset into a Pandas DataFrame. We can do this using the read_csv() function, which reads a CSV file and returns a DataFrame.

import pandas as pd

df = pd.read_csv("products.csv")

Step 2: Extract the First and Last Words

Method 1: Using str.split()

Next, we need to extract the first and last words from the product_name column. We can do this using the str accessor, which allows us to perform string operations on a Pandas Series.

To extract the first word, we can use the str.split() method, which splits a string into a list of words. We can then access the first element of the list using indexing ([0]).

To extract the last word, we can use a similar approach. We first split the string into a list of words using str.split(), and then access the last element of the list using indexing ([-1]).

Here’s the code to extract the first and last words:

df["first_word"] = df["product_name"].str.split().str[0]
df["last_word"] = df["product_name"].str.split().str[-1]
print(df)

This will output the following:

   product_id                         product_name first_word   last_word
0           1              Apple iPhone 12 Pro Max      Apple         Max
1           2             Samsung Galaxy S21 Ultra    Samsung       Ultra
2           3  Sony WH-1000XM4 Wireless Headphones       Sony  Headphones
3           4   Bose QuietComfort 35 II Headphones       Bose  Headphones
4           5                       Google Pixel 5     Google           5

Method 2: Using Regular Expressions (RegEx)

While the str.split() method is effective, another approach involves using regular expressions (RegEx) to match and extract the first and last words. This method can be useful when dealing with more complex string patterns.

import pandas as pd

# Method 1: Using Regular Expressions
df['first_word'] = df['product_name'].str.extract(r'^(\w+)')
df['last_word'] = df['product_name'].str.extract(r'(\w+)$')
print(df)

In this code, ^(\w+) matches the start of the string (^) followed by one or more word characters (\w+). (\w+)$ matches one or more word characters at the end of the string.

Output:

   product_id                         product_name first_word   last_word
0           1              Apple iPhone 12 Pro Max      Apple         Max
1           2             Samsung Galaxy S21 Ultra    Samsung       Ultra
2           3  Sony WH-1000XM4 Wireless Headphones       Sony  Headphones
3           4   Bose QuietComfort 35 II Headphones       Bose  Headphones
4           5                       Google Pixel 5     Google           5

Best Practices

Practice 1: Handling Missing Values

Always handle missing values appropriately before applying string operations. Use methods like fillna() to replace NaN values with a default or meaningful value.

Practice 2: Regex Validation

When using regular expressions, validate them against sample data to ensure they capture the desired patterns. Test with various cases to avoid unexpected errors.

Practice 3: Performance Considerations

For large datasets, consider the performance of different methods. Using vectorized operations like str.split() is often more efficient than applying functions row-wise.

Conclusion

In this article, we’ve shown you how to extract the first and last words from strings as a new column in Pandas. We’ve walked through the steps required to load a dataset, extract the first and last words using the str accessor, and view the result.

This is just two examples of the many powerful data manipulation capabilities that Pandas provides. Whether you’re a data scientist or a software engineer, Pandas is an essential tool for working with structured data in Python.


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.