How to Group By and Aggregate on Multiple Columns in Pandas
As a data scientist or software engineer, working with large datasets is a common task. In such cases, grouping and aggregating data based on multiple columns is often necessary. Pandas is a popular data analysis library in Python that provides powerful tools for working with data. In this article, we will discuss how to group by and aggregate on multiple columns in Pandas.
Table of Contents
- What is Grouping and Aggregation?
- Grouping and Aggregating in Pandas
- Aggregating Data
- Filtering Data
- Best Practices
- Conclusion
What is Grouping and Aggregation?
Grouping is the process of dividing data into smaller subsets based on one or more criteria. Aggregation is the process of summarizing or calculating statistics for each subset. For example, if we have a dataset of sales data for a company, we may want to group the data by product type and region, and then calculate the total revenue for each group.
Let’s consider the following sales data:
Product Region Quarter Revenue
0 Product C West Q2 123969
1 Product A West Q2 93001
2 Product C North Q3 126552
3 Product C North Q2 73897
4 Product A West Q3 118148
5 Product A South Q4 73483
6 Product C South Q3 98555
7 Product B North Q4 67159
8 Product C West Q4 130077
9 Product C North Q1 85920
10 Product C North Q3 142067
11 Product C East Q1 117121
12 Product A East Q3 119479
13 Product C East Q3 139475
14 Product B South Q1 69457
15 Product A West Q1 116557
16 Product B West Q3 127189
17 Product B West Q2 128953
18 Product B West Q4 102995
19 Product B East Q1 90757
Grouping and Aggregating in Pandas
Pandas provides the groupby()
method to group data based on one or more columns. Once the data is grouped, we can apply various aggregation functions such as sum()
, mean()
, max()
, min()
, count()
, etc. to calculate statistics for each group.
To group data by multiple columns in Pandas, we simply pass a list of column names to the groupby()
method. For example, if we have a dataset of sales data with columns Product
, Region
, Quarter
, and Revenue
, and we want to group the data by Product
and Region
columns, we can write:
import pandas as pd
sales_data = pd.read_csv('sales_data.csv')
grouped_data = sales_data.groupby(['Product', 'Region'])
This will group the data by Product
and Region
columns. The groupby()
method returns a DataFrameGroupBy
object, which is a special type of pandas object that allows us to apply aggregation functions to each group.
Aggregating Data
Once we have grouped the data, we can apply various aggregation functions to calculate statistics for each group. For example, we can calculate the total revenue for each group using the sum()
function as follows:
revenue_by_group = grouped_data['Revenue'].sum()
print(revenue_by_group)
Output:
Product Region
Product A East 119479
South 73483
West 327706
Product B East 90757
North 67159
South 69457
West 359137
Product C East 256596
North 428436
South 98555
West 254046
Name: Revenue, dtype: int64
This will calculate the sum of Revenue
column for each group. The result will be a pandas series with a hierarchical index based on the grouping columns. We can reset the index to convert the series to a pandas dataframe as follows:
revenue_by_group = revenue_by_group.reset_index()
print(revenue_by_group)
This will convert the revenue_by_group
series to a pandas dataframe with columns Product
, Region
, and Revenue
.
Output:
Product Region Revenue
0 Product A East 119479
1 Product A South 73483
2 Product A West 327706
3 Product B East 90757
4 Product B North 67159
5 Product B South 69457
6 Product B West 359137
7 Product C East 256596
8 Product C North 428436
9 Product C South 98555
10 Product C West 254046
We can apply multiple aggregation functions at once using the agg()
method. For example, if we want to calculate the total revenue, average revenue, and number of sales for each group, we can write:
agg_data = grouped_data['Revenue'].agg(['sum', 'mean', 'count'])
print(agg_data)
This will calculate the sum, mean, and count of Revenue
column for each group. The result will be a pandas dataframe with columns Product
, Region
, sum
, mean
, and count
.
Output:
sum mean count
Product Region
Product A East 119479 119479.000000 1
South 73483 73483.000000 1
West 327706 109235.333333 3
Product B East 90757 90757.000000 1
North 67159 67159.000000 1
South 69457 69457.000000 1
West 359137 119712.333333 3
Product C East 256596 128298.000000 2
North 428436 107109.000000 4
South 98555 98555.000000 1
West 254046 127023.000000 2
Filtering Data
We can filter the data based on the aggregated values using the filter()
method. For example, if we want to filter the data to include only those groups where the total revenue is greater than 100,000, we can write:
filtered_data = grouped_data.filter(lambda x: x['Revenue'].sum() > 100000)
print(filtered_data)
This will filter the data to include only those groups where the sum of Revenue
column is greater than 100,000.
Output:
Product Region Quarter Revenue
0 Product C West Q2 123969
1 Product A West Q2 93001
2 Product C North Q3 126552
3 Product C North Q2 73897
4 Product A West Q3 118148
8 Product C West Q4 130077
9 Product C North Q1 85920
10 Product C North Q3 142067
11 Product C East Q1 117121
12 Product A East Q3 119479
13 Product C East Q3 139475
15 Product A West Q1 116557
16 Product B West Q3 127189
17 Product B West Q2 128953
18 Product B West Q4 102995
Best Practices
Data Cleaning:
- Before grouping and aggregating, clean the data by handling missing values, outliers, and irrelevant entries to ensure accurate analysis.
Data Exploration:
- Understand the data distribution and characteristics before applying grouping and aggregation. Use descriptive statistics to gain insights into the dataset.
Efficient Grouping:
- For large datasets, choose the grouping columns wisely to avoid unnecessary computation. Grouping by categorical columns is generally more efficient.
Code Modularity:
- Break down complex grouping and aggregation tasks into modular functions for better readability and maintainability.
Handling Multi-level Index:
- Be aware of multi-level index issues when applying multiple aggregation functions. Use
reset_index()
to flatten the DataFrame if needed.
- Be aware of multi-level index issues when applying multiple aggregation functions. Use
Error Handling:
- Implement try-except blocks when reading files to handle potential errors, providing informative messages for troubleshooting.
Conclusion
In this article, we discussed how to group by and aggregate on multiple columns in Pandas. We learned how to group data by multiple columns using the groupby()
method, how to apply aggregation functions to calculate statistics for each group, how to apply multiple aggregation functions at once using the agg()
method, and how to filter the data based on the aggregated values using the filter()
method. Pandas provides powerful tools for working with data, and grouping and aggregating is an important technique for data analysis.
About Saturn Cloud
Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.
Saturn Cloud provides customizable, ready-to-use cloud environments for collaborative data teams.
Try Saturn Cloud and join thousands of users moving to the cloud without
having to switch tools.