# Calculating Averages of Multiple Columns Ignoring NaN A Guide for Data Scientists

In this blog, we will delve into a common task for data scientists – the calculation of averages in data analysis. Handling large datasets with missing values adds complexity to this operation. Here, we’ll investigate the methodology of computing averages across multiple columns while disregarding NaN values, leveraging the robust capabilities of the pandas and numpy libraries.

As a data scientist, one of the most common operations you will perform in your data analysis work is calculating averages. However, when dealing with large datasets that contain missing data, calculating averages can be tricky. In this article, we’ll explore how to calculate averages of multiple columns while ignoring NaN values using the powerful pandas and numpy libraries.

## What is NaN?

NaN stands for “Not a Number,” and it is a value used in computer programming to represent undefined or unrepresentable values. In pandas and numpy, NaN is used to represent missing or null values in a dataset. NaN values can occur for various reasons, such as data entry errors, incomplete data, or data corruption.

## Calculating Averages of Multiple Columns

To calculate averages of multiple columns in pandas, we can use the `mean()` function. However, if the dataset contains NaN values, the `mean()` function will return NaN for any column that contains at least one NaN value. This can be problematic if we want to calculate the average of multiple columns while ignoring NaN values.

Fortunately, pandas provides a simple solution to this problem: the `mean()` function has an optional parameter called `skipna` that we can set to `True` to ignore NaN values when calculating the average. Here’s an example:

``````import pandas as pd

# create a sample dataframe with NaN values
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5, 6],
'B': [7, 8, 9, 10, 11, 12],
'C': [13, 14, None, 16, 17, 18],
'D': [19, None, 21, 22, 23, 24],
'E': [25, 26, 27, 28, None, 30]
})

# calculate the average of columns A, B, and C, ignoring NaN values
avg = df[['A', 'B', 'C']].mean(skipna=True)

# print the result
print(avg)
``````

Output:

``````A    3.5
B    9.5
C    15.0
dtype: float64
``````

In this example, we created a sample dataframe `df` that contains NaN values in columns C, D, and E. We then selected columns A, B, and C using the `[['A', 'B', 'C']]` syntax and calculated the average using the `mean()` function with `skipna=True`. The result is a pandas Series object that contains the average of each column.

## Calculating Averages of Multiple Columns using Numpy

Numpy is another powerful library that is widely used in data science and provides a powerful set of functions for numerical computing. Numpy provides a function called `nanmean()` that can be used to calculate the average of multiple columns while ignoring NaN values.

Here’s an example:

``````import numpy as np

# create a sample numpy array with NaN values
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5, 6],
'B': [7, 8, 9, 10, 11, 12],
'C': [13, 14, None, 16, 17, 18],
'D': [19, None, 21, 22, 23, 24],
'E': [25, 26, 27, 28, None, 30]
})
# calculate the average of columns 0, 1, and 2, ignoring NaN values
avg = np.nanmean(df[['A', 'B', 'C']], axis=0)

# print the result
print(avg)
``````

Output:

``````[ 3.5  9.5 15.6]
``````

In this example, we created a sample numpy array `arr` that contains NaN values in columns 1, 3, and 4. We then selected columns 0, 1, and 2 using the `[:, :3]` syntax and calculated the average using the `nanmean()` function with `axis=0` to calculate the average along the columns. The result is a numpy array that contains the average of each column.

## Calculating Averages of Multiple Columns using Custom Function

The custom function method involves creating a user-defined function tailored to calculate column-wise averages while handling NaN values in a customized manner.

``````import numpy as np

def custom_mean(column):
values = [val for val in column if not np.isnan(val)]
return sum(values) / len(values) if len(values) > 0 else np.nan

# Assuming df is your DataFrame
avg = df.apply(custom_mean, axis=0)
print(avg)
``````

Output:

``````A     3.5
B     9.5
C    15.6
D    21.8
E    27.2
dtype: float64
``````

In the provided example, the `custom_mean` function iterates through each column, excluding NaN values, and computes the mean by summing the valid values and dividing by their count. This approach offers flexibility for specialized requirements but may be less efficient than built-in methods, especially for large datasets.

## Common Errors and Solutions

### Error 1: Handling NaN Incorrectly

• Description: Including NaN values in the calculation can skew results.

• Solution: Use the `skipna` parameter in Pandas or `nanmean` in numpy with the appropriate axis.

``````# Pandas example
column_means = df.mean(skipna=True)

# numpy example
column_means = np.nanmean(data, axis=0)
``````

### Error 2: Incorrect Data Types

• Description: Data types incompatible with mean calculations.

• Solution: Ensure all data is of a numeric type before calculating averages.

``````# Convert DataFrame columns to numeric
df = df.apply(pd.to_numeric, errors='coerce')
``````

### Error 3: Inconsistent Column Lengths

• Description: Columns with different lengths can lead to unexpected results.

• Solution: Verify and handle inconsistencies in column lengths.

``````# Check and handle inconsistent lengths
if len(set(df.apply(len))) > 1:
# Handle inconsistency, e.g., fill with NaN or truncate columns
``````

## Conclusion

Calculating averages of multiple columns while ignoring NaN values is a common operation in data analysis. In this article, we explored how to accomplish this task using pandas and numpy libraries. By setting the `skipna=True` parameter in the `mean()` function in pandas or using the `nanmean()` function in numpy, we can easily calculate the average of multiple columns while ignoring NaN values. Incorporating these techniques into your data analysis workflow will help you to avoid errors caused by NaN values and ensure that your results are accurate.

#### About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.