How to GroupBy a Dataframe in Pandas and Keep Columns
As a data scientist or software engineer, you’re likely familiar with the Python programming language and its powerful data analysis library, Pandas. One of the most useful functions in Pandas is groupby()
, which allows you to group rows in a dataframe based on one or more columns. In this article, we’ll explore how to use groupby()
in Pandas to group a dataframe while keeping all of its columns.
Table of Contents
What is groupby()?
groupby()
is a function in Pandas that allows you to group rows of a dataframe based on one or more columns. Once the rows are grouped, you can perform operations on each group, such as calculating summary statistics or applying a function to the group.
For example, let’s say you have a dataframe of sales data for a retail store with columns for the date of each sale, the product sold, and the quantity sold. You could use groupby()
to group the sales data by product and then calculate the total quantity sold for each product.
import pandas as pd
sales_data = pd.read_csv('sales_data.csv')
# Group the sales data by product and sum the quantity sold
sales_by_product = sales_data.groupby('product')['quantity_sold'].sum()
In this example, sales_data.groupby('product')
groups the sales data by the ‘product’ column, and ['quantity_sold'].sum()
calculates the sum of the ‘quantity_sold’ column for each group.
How to groupby() a dataframe in Pandas and keep columns
By default, groupby()
returns a new dataframe with only the columns used to group the data and the result of the operation applied to each group. However, you can use the agg()
function to apply multiple operations to each group and return a dataframe with all of the original columns.
# Group the sales data by product and calculate the sum and mean of the quantity sold
sales_by_product = sales_data.groupby('product').agg({'quantity_sold': ['sum', 'mean']})
In this example, sales_data.groupby('product')
groups the sales data by the ‘product’ column, and agg({'quantity_sold': ['sum', 'mean']})
applies the sum and mean functions to the ‘quantity_sold’ column for each group. The result is a new dataframe with the ‘product’ column and two new columns, ‘quantity_sold_sum’ and ‘quantity_sold_mean’.
If you want to keep all of the original columns in the grouped dataframe, you can use the transform()
function instead of agg()
. transform()
applies a function to each group and returns a dataframe with the same shape as the original dataframe.
# Group the sales data by product and calculate the sum of the quantity sold for each group
# while keeping all of the original columns
sales_data['quantity_sold_sum'] = sales_data.groupby('product')['quantity_sold'].transform(lambda x: x.sum())
In this example, sales_data.groupby(‘product’) groups the sales data by the ‘product’ column, and transform(lambda x: x[‘quantity_sold’].sum()) applies the sum function to the ‘quantity_sold’ column for each group. The result is a new dataframe with all of the original columns and a new column, ‘quantity_sold_sum’, with the sum of the ‘quantity_sold’ column for each group.
Conclusion
groupby()
is a powerful function in Pandas that allows you to group rows of a dataframe based on one or more columns and perform operations on each group. By default, groupby()
returns a new dataframe with only the columns used to group the data and the result of the operation applied to each group. However, you can use the agg()
function to apply multiple operations to each group and return a dataframe with all of the original columns. If you want to keep all of the original columns in the grouped dataframe, you can use the transform()
function instead of agg()
. With these tools, you can easily group and analyze data in Pandas.
About Saturn Cloud
Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.
Saturn Cloud provides customizable, ready-to-use cloud environments for collaborative data teams.
Try Saturn Cloud and join thousands of users moving to the cloud without
having to switch tools.