# How to GroupBy a Dataframe in Pandas and Keep Columns

As a data scientist or software engineer, you’re likely familiar with the Python programming language and its powerful data analysis library, Pandas. One of the most useful functions in Pandas is `groupby()`

, which allows you to group rows in a dataframe based on one or more columns. In this article, we’ll explore how to use `groupby()`

in Pandas to group a dataframe while keeping all of its columns.

## Table of Contents

## What is groupby()?

`groupby()`

is a function in Pandas that allows you to group rows of a dataframe based on one or more columns. Once the rows are grouped, you can perform operations on each group, such as calculating summary statistics or applying a function to the group.

For example, let’s say you have a dataframe of sales data for a retail store with columns for the date of each sale, the product sold, and the quantity sold. You could use `groupby()`

to group the sales data by product and then calculate the total quantity sold for each product.

```
import pandas as pd
sales_data = pd.read_csv('sales_data.csv')
# Group the sales data by product and sum the quantity sold
sales_by_product = sales_data.groupby('product')['quantity_sold'].sum()
```

In this example, `sales_data.groupby('product')`

groups the sales data by the ‘product’ column, and `['quantity_sold'].sum()`

calculates the sum of the ‘quantity_sold’ column for each group.

## How to groupby() a dataframe in Pandas and keep columns

By default, `groupby()`

returns a new dataframe with only the columns used to group the data and the result of the operation applied to each group. However, you can use the `agg()`

function to apply multiple operations to each group and return a dataframe with all of the original columns.

```
# Group the sales data by product and calculate the sum and mean of the quantity sold
sales_by_product = sales_data.groupby('product').agg({'quantity_sold': ['sum', 'mean']})
```

In this example, `sales_data.groupby('product')`

groups the sales data by the ‘product’ column, and `agg({'quantity_sold': ['sum', 'mean']})`

applies the sum and mean functions to the ‘quantity_sold’ column for each group. The result is a new dataframe with the ‘product’ column and two new columns, ‘quantity_sold_sum’ and ‘quantity_sold_mean’.

If you want to keep all of the original columns in the grouped dataframe, you can use the `transform()`

function instead of `agg()`

. `transform()`

applies a function to each group and returns a dataframe with the same shape as the original dataframe.

```
# Group the sales data by product and calculate the sum of the quantity sold for each group
# while keeping all of the original columns
sales_data['quantity_sold_sum'] = sales_data.groupby('product')['quantity_sold'].transform(lambda x: x.sum())
```

In this example, sales_data.groupby(‘product’) groups the sales data by the ‘product’ column, and transform(lambda x: x[‘quantity_sold’].sum()) applies the sum function to the ‘quantity_sold’ column for each group. The result is a new dataframe with all of the original columns and a new column, ‘quantity_sold_sum’, with the sum of the ‘quantity_sold’ column for each group.

## Conclusion

`groupby()`

is a powerful function in Pandas that allows you to group rows of a dataframe based on one or more columns and perform operations on each group. By default, `groupby()`

returns a new dataframe with only the columns used to group the data and the result of the operation applied to each group. However, you can use the `agg()`

function to apply multiple operations to each group and return a dataframe with all of the original columns. If you want to keep all of the original columns in the grouped dataframe, you can use the `transform()`

function instead of `agg()`

. With these tools, you can easily group and analyze data in Pandas.

#### About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Join today and get 150 hours of free compute per month.