How to Split Text in a Column into Multiple Rows using Pandas

In this blog, discover essential techniques for handling large datasets as a data scientist or software engineer, focusing on the pivotal task of splitting text within a column into multiple rows. This process is vital for effective data preprocessing, especially when dealing with unstructured data like text. Explore practical methods using Pandas, a widely-used data manipulation library in Python.

As a data scientist or software engineer, you may often find yourself working with large datasets that require cleaning and transformation. One common task is splitting text in a column into multiple rows. This is a crucial step in data preprocessing, especially when dealing withunstructured data such as text data. In this article, we will explore how to split text in a column into multiple rows using Pandas, a popular data manipulation library in Python.

What is Pandas?

Pandas is an open-source library built on top of the Python programming language for data manipulation and analysis. It provides data structures and functions for efficiently handling and processing large datasets. Pandas is widely used in data science, machine learning, and finance industries, among others.

What is Text Splitting in Pandas?

Text splitting is a process of dividing a string into multiple parts based on a delimiter. In Pandas, text splitting is commonly used to split a column containing text data into multiple rows. This is useful when a single cell in a dataset contains multiple values, such as tags, keywords, or categories. By splitting the text into multiple rows, we can easily analyze and manipulate the data.

How to Split Text in a Column into Multiple Rows

We can split text in a column into multiple rows using the str.split() function in Pandas. This function splits a string into a list of substrings based on a specified delimiter. We can then use the explode() function to convert the list into multiple rows.

The following steps demonstrate how to split text in a column into multiple rows using Pandas:

Step 1: Import Libraries and Load Data

First, we need to import the required libraries and load the dataset. In this example, we will use a sample dataset containing a column of text data. Let’s say we have the following DataFrame:

    Name                         Skills
0   John  Python, SQL, Machine Learning
1  Alice               Java, JavaScript
2    Bob           Data Analysis, Excel
import pandas as pd

# Load data
data = pd.read_csv('data.csv')

Step 2: Split Text into a List

Next, we need to split the text in the column into a list of substrings. We can use the str.split() function to achieve this.

# Split text into a list
data['Skills'] = data['Skills'].str.split(',')

The above code splits the text in the ‘Column’ column into a list of substrings based on a comma delimiter.

Step 3: Convert List into Multiple Rows

Finally, we need to convert the list of substrings into multiple rows. We can use the explode() function to achieve this.

# Convert list into multiple rows
data = data.explode('Skills')
print(data)

Output:

    Name             Skills
0   John             Python
0   John                SQL
0   John   Machine Learning
1  Alice               Java
1  Alice         JavaScript
2    Bob      Data Analysis
2    Bob              Excel

The above code converts the list of substrings in the 'Skills' column into multiple rows.

Step 4: Clean and Transform Data

After splitting the text in the column into multiple rows, we can clean and transform the data as required. For example, we can remove duplicates, lowercase the text, or apply a function to each row.

# Clean and transform data
data = data.drop_duplicates()
data = data.reset_index(drop=True)
print(data)

Output:

    Name             Skills
0   John             Python
1   John                SQL
2   John   Machine Learning
3  Alice               Java
4  Alice         JavaScript
5    Bob      Data Analysis
6    Bob              Excel

The above code removes duplicates, lowercases the text, and removes leading and trailing whitespaces in the 'Skills' column.

Step 5: Export Data

Finally, we can export the cleaned and transformed data to a file or database.

# Export data
data.to_csv('cleaned_data.csv', index=False)

The above code exports the cleaned and transformed data to a CSV file.

Conclusion

In this article, we have explored how to split text in a column into multiple rows using Pandas. This is a useful technique for preprocessing and analyzing unstructured text data. By splitting text into multiple rows, we can easily manipulate and transform the data as required. Pandas provides a simple and efficient way to split text in a column into multiple rows using the str.split() and explode() functions. We hope this article has been helpful in your data preprocessing journey.


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.