# How to Group By and Aggregate on Multiple Columns in Pandas

In this blog, we’ll explore the essential task of working with large datasets for data scientists or software engineers. Handling extensive data often requires grouping and aggregating information based on multiple columns. Python’s widely used data analysis library, Pandas, offers robust tools for these tasks. The focus of this article will be on demonstrating the process of grouping by and aggregating data across multiple columns using Pandas.

As a data scientist or software engineer, working with large datasets is a common task. In such cases, grouping and aggregating data based on multiple columns is often necessary. Pandas is a popular data analysis library in Python that provides powerful tools for working with data. In this article, we will discuss how to group by and aggregate on multiple columns in Pandas.

## What is Grouping and Aggregation?

Grouping is the process of dividing data into smaller subsets based on one or more criteria. Aggregation is the process of summarizing or calculating statistics for each subset. For example, if we have a dataset of sales data for a company, we may want to group the data by product type and region, and then calculate the total revenue for each group.

Let’s consider the following sales data:

``````      Product Region Quarter  Revenue
0   Product C   West      Q2   123969
1   Product A   West      Q2    93001
2   Product C  North      Q3   126552
3   Product C  North      Q2    73897
4   Product A   West      Q3   118148
5   Product A  South      Q4    73483
6   Product C  South      Q3    98555
7   Product B  North      Q4    67159
8   Product C   West      Q4   130077
9   Product C  North      Q1    85920
10  Product C  North      Q3   142067
11  Product C   East      Q1   117121
12  Product A   East      Q3   119479
13  Product C   East      Q3   139475
14  Product B  South      Q1    69457
15  Product A   West      Q1   116557
16  Product B   West      Q3   127189
17  Product B   West      Q2   128953
18  Product B   West      Q4   102995
19  Product B   East      Q1    90757
``````

## Grouping and Aggregating in Pandas

Pandas provides the `groupby()` method to group data based on one or more columns. Once the data is grouped, we can apply various aggregation functions such as `sum()`, `mean()`, `max()`, `min()`, `count()`, etc. to calculate statistics for each group.

To group data by multiple columns in Pandas, we simply pass a list of column names to the `groupby()` method. For example, if we have a dataset of sales data with columns `Product`, `Region`, `Quarter`, and `Revenue`, and we want to group the data by `Product` and `Region` columns, we can write:

``````import pandas as pd

grouped_data = sales_data.groupby(['Product', 'Region'])
``````

This will group the data by `Product` and `Region` columns. The `groupby()` method returns a `DataFrameGroupBy` object, which is a special type of pandas object that allows us to apply aggregation functions to each group.

## Aggregating Data

Once we have grouped the data, we can apply various aggregation functions to calculate statistics for each group. For example, we can calculate the total revenue for each group using the `sum()` function as follows:

``````revenue_by_group = grouped_data['Revenue'].sum()
print(revenue_by_group)
``````

Output:

``````Product    Region
Product A  East      119479
South      73483
West      327706
Product B  East       90757
North      67159
South      69457
West      359137
Product C  East      256596
North     428436
South      98555
West      254046
Name: Revenue, dtype: int64
``````

This will calculate the sum of `Revenue` column for each group. The result will be a pandas series with a hierarchical index based on the grouping columns. We can reset the index to convert the series to a pandas dataframe as follows:

``````revenue_by_group = revenue_by_group.reset_index()
print(revenue_by_group)
``````

This will convert the `revenue_by_group` series to a pandas dataframe with columns `Product`, `Region`, and `Revenue`.

Output:

``````      Product Region  Revenue
0   Product A   East   119479
1   Product A  South    73483
2   Product A   West   327706
3   Product B   East    90757
4   Product B  North    67159
5   Product B  South    69457
6   Product B   West   359137
7   Product C   East   256596
8   Product C  North   428436
9   Product C  South    98555
10  Product C   West   254046
``````

We can apply multiple aggregation functions at once using the `agg()` method. For example, if we want to calculate the total revenue, average revenue, and number of sales for each group, we can write:

``````agg_data = grouped_data['Revenue'].agg(['sum', 'mean', 'count'])
print(agg_data)
``````

This will calculate the sum, mean, and count of `Revenue` column for each group. The result will be a pandas dataframe with columns `Product`, `Region`, `sum`, `mean`, and `count`.

Output:

``````                     sum           mean  count
Product   Region
Product A East    119479  119479.000000      1
South    73483   73483.000000      1
West    327706  109235.333333      3
Product B East     90757   90757.000000      1
North    67159   67159.000000      1
South    69457   69457.000000      1
West    359137  119712.333333      3
Product C East    256596  128298.000000      2
North   428436  107109.000000      4
South    98555   98555.000000      1
West    254046  127023.000000      2
``````

## Filtering Data

We can filter the data based on the aggregated values using the `filter()` method. For example, if we want to filter the data to include only those groups where the total revenue is greater than 100,000, we can write:

``````filtered_data = grouped_data.filter(lambda x: x['Revenue'].sum() > 100000)
print(filtered_data)
``````

This will filter the data to include only those groups where the sum of `Revenue` column is greater than 100,000.

Output:

``````      Product Region Quarter  Revenue
0   Product C   West      Q2   123969
1   Product A   West      Q2    93001
2   Product C  North      Q3   126552
3   Product C  North      Q2    73897
4   Product A   West      Q3   118148
8   Product C   West      Q4   130077
9   Product C  North      Q1    85920
10  Product C  North      Q3   142067
11  Product C   East      Q1   117121
12  Product A   East      Q3   119479
13  Product C   East      Q3   139475
15  Product A   West      Q1   116557
16  Product B   West      Q3   127189
17  Product B   West      Q2   128953
18  Product B   West      Q4   102995
``````

## Best Practices

1. Data Cleaning:

• Before grouping and aggregating, clean the data by handling missing values, outliers, and irrelevant entries to ensure accurate analysis.
2. Data Exploration:

• Understand the data distribution and characteristics before applying grouping and aggregation. Use descriptive statistics to gain insights into the dataset.
3. Efficient Grouping:

• For large datasets, choose the grouping columns wisely to avoid unnecessary computation. Grouping by categorical columns is generally more efficient.
4. Code Modularity:

• Break down complex grouping and aggregation tasks into modular functions for better readability and maintainability.
5. Handling Multi-level Index:

• Be aware of multi-level index issues when applying multiple aggregation functions. Use `reset_index()` to flatten the DataFrame if needed.
6. Error Handling:

• Implement try-except blocks when reading files to handle potential errors, providing informative messages for troubleshooting.

## Conclusion

In this article, we discussed how to group by and aggregate on multiple columns in Pandas. We learned how to group data by multiple columns using the `groupby()` method, how to apply aggregation functions to calculate statistics for each group, how to apply multiple aggregation functions at once using the `agg()` method, and how to filter the data based on the aggregated values using the `filter()` method. Pandas provides powerful tools for working with data, and grouping and aggregating is an important technique for data analysis.