How to Fill Missing Values of One Column from Another Column in Pandas

In this blog post, explore essential techniques for handling missing values in datasets as a data scientist or software engineer. Address the challenges posed by missing data in your analysis by learning how to effectively fill gaps using information from other columns in a dataset, using Pandas—an indispensable data analysis library in Python.

As a data scientist or software engineer, you may often encounter datasets with missing values. These missing values can cause problems in your analysis, and it’s crucial to handle them appropriately. One common technique is to fill missing values using information from other columns in the dataset. In this blog post, we’ll discuss how to fill missing values of one column from another column in pandas, a popular data analysis library in Python.

The Problem: Missing Values

Before we dive into the solution, let’s understand the problem of missing values. In a dataset, missing values can occur due to various reasons such as data entry errors, equipment failure, or simply because the data was not collected. Missing values can be represented by various symbols such as NaN, NA, or None.

In pandas, missing values are represented by NaN (Not a Number) by default. Let’s consider a simple example of a dataset containing information about employees:

Employee_IDNameAgeSalary
001Alice285000
002Bob32NaN
003ClaireNaN6000
004David355500

In this dataset, we can see that there are missing values in the “Salary” and “Age” columns. Our goal is to fill these missing values using information from other columns.

The Solution: Fill Missing Values from Another Column

To fill missing values of one column from another column in pandas, we can use the fillna() method. This method allows us to replace missing values with a specified value or with values from another column.

Let’s consider the example of the employee dataset again. We want to fill the missing values in the “Salary” column with the median salary of the employees. We can use the following code to achieve this:

import pandas as pd

# Read the dataset
df = pd.read_csv('employee_data.csv')

# Calculate the median salary
median_salary = df['Salary'].median()

# Fill missing values in the "Salary" column with the median salary
df['Salary'] = df['Salary'].fillna(median_salary)

# Show df
print(df)

Output:

   Employee_ID    Name   Age  Salary
0            1   Alice  28.0  5000.0
1            2     Bob  32.0  5500.0
2            3  Claire   NaN  6000.0
3            4   David  35.0  5500.0

In this code, we first read the dataset using the read_csv() method. We then calculate the median salary of the employees using the median() method. Finally, we use the fillna() method to fill the missing values in the “Salary” column with the median salary.

Similarly, we can fill missing values in the “Age” column with the average age of the employees using the following code:

# Calculate the average age
average_age = df['Age'].mean()

# Fill missing values in the "Age" column with the average age
df['Age'] = df['Age'].fillna(average_age)

# Show df
print(df)

Output:

   Employee_ID    Name        Age  Salary
0            1   Alice  28.000000  5000.0
1            2     Bob  32.000000  5500.0
2            3  Claire  31.666667  6000.0
3            4   David  35.000000  5500.0

In this code, we calculate the average age of the employees using the mean() method and fill the missing values in the “Age” column with the average age using the fillna() method.

Some other common methods

Forward Fill

df_ffill = df.ffill()
print("\nForward fill:\n", df_ffill)

Forward fill copies the previous non-NaN value to fill NaN, ensuring continuity in the dataset.

Output:

Forward fill:
    Employee_ID    Name   Age  Salary
0            1   Alice  28.0  5000.0
1            2     Bob  32.0  5000.0
2            3  Claire  32.0  6000.0
3            4   David  35.0  5500.0

Backward Fill

df_bfill = df.bfill()
print("\nBackward fill:\n", df_bfill)

Backward fill copies the next non-NaN value to fill NaN, maintaining a smooth transition in the dataset.

Output:

Backward fill:
    Employee_ID    Name   Age  Salary
0            1   Alice  28.0  5000.0
1            2     Bob  32.0  6000.0
2            3  Claire  35.0  6000.0
3            4   David  35.0  5500.0

Interpolate NaN Values

df_interpolated = df.interpolate()
print("\nInterpolate NaN values:\n", df_interpolated)

Interpolation estimates missing values based on the values present before and after, providing a more nuanced approach.

Output:

Interpolate NaN values:
    Employee_ID    Name   Age  Salary
0            1   Alice  28.0  5000.0
1            2     Bob  32.0  5500.0
2            3  Claire  33.5  6000.0
3            4   David  35.0  5500.0

Conclusion

In this blog post, we discussed how to fill missing values of one column from another column in pandas. We saw that missing values can cause problems in data analysis and how to handle them appropriately using the fillna() method. We also saw an example of how to fill missing values in a dataset containing information about employees using information from other columns. Furthermore, we also explored some other methods that could fit your specific problem.

Handling missing values is just one of many data preprocessing techniques that a data scientist or software engineer should be familiar with. In a real-world scenario, datasets can be much larger and complex, and missing values can be more challenging to handle. However, with the right tools and techniques, we can effectively handle missing values and perform accurate data analysis.


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.