How to Compare Two Pandas Dataframes for Differences
As a data scientist or software engineer, you will often need to compare two pandas dataframes to identify differences in data. This is an important task in data analysis as it helps to identify anomalies or changes in data that could affect the accuracy of your analysis. In this article, we will explore how to compare two pandas dataframes for differences.
Understanding Pandas Dataframes
Before we dive into the process of comparing two pandas dataframes, let us first understand what pandas dataframes are. Pandas is a popular Python library used for data manipulation and analysis. It provides a powerful data structure called a dataframe, which is essentially a two-dimensional table with labeled axes (rows and columns).
Each column in a pandas dataframe represents a variable, while each row represents an observation. Dataframes can be created from a variety of sources, including CSV files, SQL databases, and Excel spreadsheets.
Comparing Two Pandas Dataframes
To compare two pandas dataframes, we need to consider two scenarios:
- Comparing dataframes with the same shape and column names
- Comparing dataframes with different shapes or column names
Comparing Dataframes with the Same Shape and Column Names
When comparing two dataframes with the same shape and column names, we can use the equals()
function provided by pandas. This function returns a boolean value indicating whether the two dataframes are equal or not.
import pandas as pd
# create two dataframes with same shape and column names
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
# compare dataframes
print(df1.equals(df2)) # True
Output:
True
In the above example, we created two dataframes df1
and df2
with the same shape and column names. We then used the equals()
function to compare the two dataframes and print the result, which is True
.
Comparing Dataframes using compare() method:
Another way to compare dataframes with different shapes or column names is to use the compare()
function provided by pandas.
The compare()
function compares two dataframes element-wise and returns a dataframe containing the differences between the two dataframes. The returned dataframe has the same shape as the input dataframes, with each cell containing a tuple representing the difference between the corresponding cells in the input dataframes.
import pandas as pd
# create two dataframes with different shapes and column names
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
df2 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 7], 'C': [10, 11, 12]})
# compare dataframes
diff = df1.compare(df2)
# print differences
print(diff)
In the above example, we created two dataframes df1
and df2
with different shapes and column names. We then used the compare()
function to compare the two dataframes and store the differences in the variable diff
. Finally, we printed the differences using the print()
function.
The output of the above code will be:
B C
self other self other
0 NaN NaN 7 10
1 NaN NaN 8 11
2 6.0 7.0 9 12
The output shows the differences between the two dataframes. The rows and columns represent the index and column names of the input dataframes, respectively. The self
and other
columns represent the values in df1
and df2
, respectively. The values in the diff
dataframe are tuples containing the difference between the corresponding cells in df1
and df2
.
In the above output, we can see that the values in column B
of row 2
are different between the two dataframes. Similarly, three values in column C
of three rows 0
, 1
, 2
are different between the two dataframes.
Note: Before comparing two DataFrames make sure that the number of records in the first DataFrame matches the number of records in the second DataFrame. If not so, you will be getting a value error which is :
ValueError: Can only compare identically-labeled Series objects
Conclusion
Comparing two pandas dataframes is an essential task in data analysis. In this article, we learned how to compare two pandas dataframes for differences. We explored two scenarios: comparing dataframes with the same shape and column names, and comparing dataframes with different shapes or column names.
We learned that when comparing dataframes with the same shape and column names, we can use the equals()
function provided by pandas. When comparing dataframes with different shapes or column names, we can use the compare()
function, which returns a dataframe containing the differences between the two dataframes.
By following the techniques described in this article, you can easily compare two pandas dataframes and identify any differences in the data.
About Saturn Cloud
Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.
Saturn Cloud provides customizable, ready-to-use cloud environments for collaborative data teams.
Try Saturn Cloud and join thousands of users moving to the cloud without
having to switch tools.