How to Fill Missing Values of One Column from Another Column in Pandas
As a data scientist or software engineer, you may often encounter datasets with missing values. These missing values can cause problems in your analysis, and it’s crucial to handle them appropriately. One common technique is to fill missing values using information from other columns in the dataset. In this blog post, we’ll discuss how to fill missing values of one column from another column in pandas, a popular data analysis library in Python.
The Problem: Missing Values
Before we dive into the solution, let’s understand the problem of missing values. In a dataset, missing values can occur due to various reasons such as data entry errors, equipment failure, or simply because the data was not collected. Missing values can be represented by various symbols such as NaN, NA, or None.
In pandas, missing values are represented by NaN
(Not a Number) by default. Let’s consider a simple example of a dataset containing information about employees:
Employee_ID | Name | Age | Salary |
---|---|---|---|
001 | Alice | 28 | 5000 |
002 | Bob | 32 | NaN |
003 | Claire | NaN | 6000 |
004 | David | 35 | 5500 |
In this dataset, we can see that there are missing values in the “Salary” and “Age” columns. Our goal is to fill these missing values using information from other columns.
The Solution: Fill Missing Values from Another Column
To fill missing values of one column from another column in pandas, we can use the fillna()
method. This method allows us to replace missing values with a specified value or with values from another column.
Let’s consider the example of the employee dataset again. We want to fill the missing values in the “Salary” column with the median salary of the employees. We can use the following code to achieve this:
import pandas as pd
# Read the dataset
df = pd.read_csv('employee_data.csv')
# Calculate the median salary
median_salary = df['Salary'].median()
# Fill missing values in the "Salary" column with the median salary
df['Salary'] = df['Salary'].fillna(median_salary)
# Show df
print(df)
Output:
Employee_ID Name Age Salary
0 1 Alice 28.0 5000.0
1 2 Bob 32.0 5500.0
2 3 Claire NaN 6000.0
3 4 David 35.0 5500.0
In this code, we first read the dataset using the read_csv()
method. We then calculate the median salary of the employees using the median()
method. Finally, we use the fillna()
method to fill the missing values in the “Salary” column with the median salary.
Similarly, we can fill missing values in the “Age” column with the average age of the employees using the following code:
# Calculate the average age
average_age = df['Age'].mean()
# Fill missing values in the "Age" column with the average age
df['Age'] = df['Age'].fillna(average_age)
# Show df
print(df)
Output:
Employee_ID Name Age Salary
0 1 Alice 28.000000 5000.0
1 2 Bob 32.000000 5500.0
2 3 Claire 31.666667 6000.0
3 4 David 35.000000 5500.0
In this code, we calculate the average age of the employees using the mean()
method and fill the missing values in the “Age” column with the average age using the fillna()
method.
Some other common methods
Forward Fill
df_ffill = df.ffill()
print("\nForward fill:\n", df_ffill)
Forward fill copies the previous non-NaN value to fill NaN, ensuring continuity in the dataset.
Output:
Forward fill:
Employee_ID Name Age Salary
0 1 Alice 28.0 5000.0
1 2 Bob 32.0 5000.0
2 3 Claire 32.0 6000.0
3 4 David 35.0 5500.0
Backward Fill
df_bfill = df.bfill()
print("\nBackward fill:\n", df_bfill)
Backward fill copies the next non-NaN value to fill NaN, maintaining a smooth transition in the dataset.
Output:
Backward fill:
Employee_ID Name Age Salary
0 1 Alice 28.0 5000.0
1 2 Bob 32.0 6000.0
2 3 Claire 35.0 6000.0
3 4 David 35.0 5500.0
Interpolate NaN Values
df_interpolated = df.interpolate()
print("\nInterpolate NaN values:\n", df_interpolated)
Interpolation estimates missing values based on the values present before and after, providing a more nuanced approach.
Output:
Interpolate NaN values:
Employee_ID Name Age Salary
0 1 Alice 28.0 5000.0
1 2 Bob 32.0 5500.0
2 3 Claire 33.5 6000.0
3 4 David 35.0 5500.0
Conclusion
In this blog post, we discussed how to fill missing values of one column from another column in pandas. We saw that missing values can cause problems in data analysis and how to handle them appropriately using the fillna()
method. We also saw an example of how to fill missing values in a dataset containing information about employees using information from other columns. Furthermore, we also explored some other methods that could fit your specific problem.
Handling missing values is just one of many data preprocessing techniques that a data scientist or software engineer should be familiar with. In a real-world scenario, datasets can be much larger and complex, and missing values can be more challenging to handle. However, with the right tools and techniques, we can effectively handle missing values and perform accurate data analysis.
About Saturn Cloud
Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.
Saturn Cloud provides customizable, ready-to-use cloud environments for collaborative data teams.
Try Saturn Cloud and join thousands of users moving to the cloud without
having to switch tools.