How to Combine Multiple Rows into a Single Row with Pandas
The Problem
Suppose we have a DataFrame df
that consists of multiple rows for each unique value in the id
column:
employee_id | employee_name | sales |
---|---|---|
1 | Carlos | 4 |
1 | Carlos | 1 |
2 | Dan | 3 |
3 | Samuel | 2 |
3 | Samuel | 5 |
3 | Samuel | 3 |
We want to combine the rows for each unique value in the id
column into a single row, resulting in a DataFrame that looks like this:
employee_id | employee_name | sales |
---|---|---|
1 | Carlos | 5 |
2 | Dan | 3 |
3 | Samuel | 10 |
Creating the dataFrame
First let’s create the dataframe we have above so we can apply the solutions on it:
# create dataFrame
df = pd.DataFrame({'employee_id': [1, 1, 2, 3, 3, 3],
'employee_name': ['Carlos', 'Carlos', 'Dan', 'Samuel', 'Samuel', 'Samuel'],
'sales': [4, 1, 3, 2, 5, 3],})
# view DataFrame
print(df)
The output should be the following:
employee_id employee_name sales
0 1 Carlos 4
1 1 Carlos 1
2 2 Dan 3
3 3 Samuel 2
4 3 Samuel 5
5 3 Samuel 3
Solution 1: Using groupby and aggregate
One way to combine multiple rows into a single row is to use the groupby
function to group the DataFrame by the id
column and then use the aggregate
function to apply an aggregation function to each group.
#define how to aggregate various fields
agg_functions = {'employee_name': 'first', 'sales': 'sum', }
#create new DataFrame by combining rows with same id values
df_new = df.groupby(df['employee_id']).aggregate(agg_functions)
#Display the new data frame
print(df_new)
Now the new dataFrame should look like this:
employee_id employee_name sales
1 Carlos 5
2 Dan 3
3 Samuel 10
The new DataFrame combined all of the rows in the previous DataFrame that had the same value in the employee_id and employee_name column and then calculated the sum of the values in the sales column.
Conclusion
In this article, we have explored how to combine multiple rows into a single row using pandas. By understanding these techniques, you can efficiently manipulate and analyze data in pandas, and be better equipped to tackle real-world data science and engineering challenges.
About Saturn Cloud
Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.
Saturn Cloud provides customizable, ready-to-use cloud environments for collaborative data teams.
Try Saturn Cloud and join thousands of users moving to the cloud without
having to switch tools.