How to Split Text in a Column into Multiple Rows using Pandas
As a data scientist or software engineer, you may often find yourself working with large datasets that require cleaning and transformation. One common task is splitting text in a column into multiple rows. This is a crucial step in data preprocessing, especially when dealing withunstructured data such as text data. In this article, we will explore how to split text in a column into multiple rows using Pandas, a popular data manipulation library in Python.
What is Pandas?
Pandas is an open-source library built on top of the Python programming language for data manipulation and analysis. It provides data structures and functions for efficiently handling and processing large datasets. Pandas is widely used in data science, machine learning, and finance industries, among others.
What is Text Splitting in Pandas?
Text splitting is a process of dividing a string into multiple parts based on a delimiter. In Pandas, text splitting is commonly used to split a column containing text data into multiple rows. This is useful when a single cell in a dataset contains multiple values, such as tags, keywords, or categories. By splitting the text into multiple rows, we can easily analyze and manipulate the data.
How to Split Text in a Column into Multiple Rows
We can split text in a column into multiple rows using the str.split()
function in Pandas. This function splits a string into a list of substrings based on a specified delimiter. We can then use the explode()
function to convert the list into multiple rows.
The following steps demonstrate how to split text in a column into multiple rows using Pandas:
Step 1: Import Libraries and Load Data
First, we need to import the required libraries and load the dataset. In this example, we will use a sample dataset containing a column of text data. Let’s say we have the following DataFrame:
Name Skills
0 John Python, SQL, Machine Learning
1 Alice Java, JavaScript
2 Bob Data Analysis, Excel
import pandas as pd
# Load data
data = pd.read_csv('data.csv')
Step 2: Split Text into a List
Next, we need to split the text in the column into a list of substrings. We can use the str.split()
function to achieve this.
# Split text into a list
data['Skills'] = data['Skills'].str.split(',')
The above code splits the text in the ‘Column’ column into a list of substrings based on a comma delimiter.
Step 3: Convert List into Multiple Rows
Finally, we need to convert the list of substrings into multiple rows. We can use the explode()
function to achieve this.
# Convert list into multiple rows
data = data.explode('Skills')
print(data)
Output:
Name Skills
0 John Python
0 John SQL
0 John Machine Learning
1 Alice Java
1 Alice JavaScript
2 Bob Data Analysis
2 Bob Excel
The above code converts the list of substrings in the 'Skills'
column into multiple rows.
Step 4: Clean and Transform Data
After splitting the text in the column into multiple rows, we can clean and transform the data as required. For example, we can remove duplicates, lowercase the text, or apply a function to each row.
# Clean and transform data
data = data.drop_duplicates()
data = data.reset_index(drop=True)
print(data)
Output:
Name Skills
0 John Python
1 John SQL
2 John Machine Learning
3 Alice Java
4 Alice JavaScript
5 Bob Data Analysis
6 Bob Excel
The above code removes duplicates, lowercases the text, and removes leading and trailing whitespaces in the 'Skills'
column.
Step 5: Export Data
Finally, we can export the cleaned and transformed data to a file or database.
# Export data
data.to_csv('cleaned_data.csv', index=False)
The above code exports the cleaned and transformed data to a CSV file.
Conclusion
In this article, we have explored how to split text in a column into multiple rows using Pandas. This is a useful technique for preprocessing and analyzing unstructured text data. By splitting text into multiple rows, we can easily manipulate and transform the data as required. Pandas provides a simple and efficient way to split text in a column into multiple rows using the str.split()
and explode()
functions. We hope this article has been helpful in your data preprocessing journey.
About Saturn Cloud
Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.
Saturn Cloud provides customizable, ready-to-use cloud environments for collaborative data teams.
Try Saturn Cloud and join thousands of users moving to the cloud without
having to switch tools.