How to Group By and Aggregate on Multiple Columns in Pandas

In this blog, we’ll explore the essential task of working with large datasets for data scientists or software engineers. Handling extensive data often requires grouping and aggregating information based on multiple columns. Python’s widely used data analysis library, Pandas, offers robust tools for these tasks. The focus of this article will be on demonstrating the process of grouping by and aggregating data across multiple columns using Pandas.

As a data scientist or software engineer, working with large datasets is a common task. In such cases, grouping and aggregating data based on multiple columns is often necessary. Pandas is a popular data analysis library in Python that provides powerful tools for working with data. In this article, we will discuss how to group by and aggregate on multiple columns in Pandas.

Table of Contents

  1. What is Grouping and Aggregation?
  2. Grouping and Aggregating in Pandas
  3. Aggregating Data
  4. Filtering Data
  5. Best Practices
  6. Conclusion

What is Grouping and Aggregation?

Grouping is the process of dividing data into smaller subsets based on one or more criteria. Aggregation is the process of summarizing or calculating statistics for each subset. For example, if we have a dataset of sales data for a company, we may want to group the data by product type and region, and then calculate the total revenue for each group.

Let’s consider the following sales data:

      Product Region Quarter  Revenue
0   Product C   West      Q2   123969
1   Product A   West      Q2    93001
2   Product C  North      Q3   126552
3   Product C  North      Q2    73897
4   Product A   West      Q3   118148
5   Product A  South      Q4    73483
6   Product C  South      Q3    98555
7   Product B  North      Q4    67159
8   Product C   West      Q4   130077
9   Product C  North      Q1    85920
10  Product C  North      Q3   142067
11  Product C   East      Q1   117121
12  Product A   East      Q3   119479
13  Product C   East      Q3   139475
14  Product B  South      Q1    69457
15  Product A   West      Q1   116557
16  Product B   West      Q3   127189
17  Product B   West      Q2   128953
18  Product B   West      Q4   102995
19  Product B   East      Q1    90757

Grouping and Aggregating in Pandas

Pandas provides the groupby() method to group data based on one or more columns. Once the data is grouped, we can apply various aggregation functions such as sum(), mean(), max(), min(), count(), etc. to calculate statistics for each group.

To group data by multiple columns in Pandas, we simply pass a list of column names to the groupby() method. For example, if we have a dataset of sales data with columns Product, Region, Quarter, and Revenue, and we want to group the data by Product and Region columns, we can write:

import pandas as pd

sales_data = pd.read_csv('sales_data.csv')

grouped_data = sales_data.groupby(['Product', 'Region'])

This will group the data by Product and Region columns. The groupby() method returns a DataFrameGroupBy object, which is a special type of pandas object that allows us to apply aggregation functions to each group.

Aggregating Data

Once we have grouped the data, we can apply various aggregation functions to calculate statistics for each group. For example, we can calculate the total revenue for each group using the sum() function as follows:

revenue_by_group = grouped_data['Revenue'].sum()
print(revenue_by_group)

Output:

Product    Region
Product A  East      119479
           South      73483
           West      327706
Product B  East       90757
           North      67159
           South      69457
           West      359137
Product C  East      256596
           North     428436
           South      98555
           West      254046
Name: Revenue, dtype: int64

This will calculate the sum of Revenue column for each group. The result will be a pandas series with a hierarchical index based on the grouping columns. We can reset the index to convert the series to a pandas dataframe as follows:

revenue_by_group = revenue_by_group.reset_index()
print(revenue_by_group)

This will convert the revenue_by_group series to a pandas dataframe with columns Product, Region, and Revenue.

Output:

      Product Region  Revenue
0   Product A   East   119479
1   Product A  South    73483
2   Product A   West   327706
3   Product B   East    90757
4   Product B  North    67159
5   Product B  South    69457
6   Product B   West   359137
7   Product C   East   256596
8   Product C  North   428436
9   Product C  South    98555
10  Product C   West   254046

We can apply multiple aggregation functions at once using the agg() method. For example, if we want to calculate the total revenue, average revenue, and number of sales for each group, we can write:

agg_data = grouped_data['Revenue'].agg(['sum', 'mean', 'count'])
print(agg_data)

This will calculate the sum, mean, and count of Revenue column for each group. The result will be a pandas dataframe with columns Product, Region, sum, mean, and count.

Output:

                     sum           mean  count
Product   Region                              
Product A East    119479  119479.000000      1
          South    73483   73483.000000      1
          West    327706  109235.333333      3
Product B East     90757   90757.000000      1
          North    67159   67159.000000      1
          South    69457   69457.000000      1
          West    359137  119712.333333      3
Product C East    256596  128298.000000      2
          North   428436  107109.000000      4
          South    98555   98555.000000      1
          West    254046  127023.000000      2

Filtering Data

We can filter the data based on the aggregated values using the filter() method. For example, if we want to filter the data to include only those groups where the total revenue is greater than 100,000, we can write:

filtered_data = grouped_data.filter(lambda x: x['Revenue'].sum() > 100000)
print(filtered_data)

This will filter the data to include only those groups where the sum of Revenue column is greater than 100,000.

Output:

      Product Region Quarter  Revenue
0   Product C   West      Q2   123969
1   Product A   West      Q2    93001
2   Product C  North      Q3   126552
3   Product C  North      Q2    73897
4   Product A   West      Q3   118148
8   Product C   West      Q4   130077
9   Product C  North      Q1    85920
10  Product C  North      Q3   142067
11  Product C   East      Q1   117121
12  Product A   East      Q3   119479
13  Product C   East      Q3   139475
15  Product A   West      Q1   116557
16  Product B   West      Q3   127189
17  Product B   West      Q2   128953
18  Product B   West      Q4   102995

Best Practices

  1. Data Cleaning:

    • Before grouping and aggregating, clean the data by handling missing values, outliers, and irrelevant entries to ensure accurate analysis.
  2. Data Exploration:

    • Understand the data distribution and characteristics before applying grouping and aggregation. Use descriptive statistics to gain insights into the dataset.
  3. Efficient Grouping:

    • For large datasets, choose the grouping columns wisely to avoid unnecessary computation. Grouping by categorical columns is generally more efficient.
  4. Code Modularity:

    • Break down complex grouping and aggregation tasks into modular functions for better readability and maintainability.
  5. Handling Multi-level Index:

    • Be aware of multi-level index issues when applying multiple aggregation functions. Use reset_index() to flatten the DataFrame if needed.
  6. Error Handling:

    • Implement try-except blocks when reading files to handle potential errors, providing informative messages for troubleshooting.

Conclusion

In this article, we discussed how to group by and aggregate on multiple columns in Pandas. We learned how to group data by multiple columns using the groupby() method, how to apply aggregation functions to calculate statistics for each group, how to apply multiple aggregation functions at once using the agg() method, and how to filter the data based on the aggregated values using the filter() method. Pandas provides powerful tools for working with data, and grouping and aggregating is an important technique for data analysis.


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.