How to Compare Two Pandas Dataframes for Differences

In this blog, discover essential techniques to efficiently compare pandas dataframes, crucial for pinpointing data discrepancies and ensuring analysis accuracy. Explore detailed methods to identify anomalies and changes within your data for a more robust analysis.

As a data scientist or software engineer, you will often need to compare two pandas dataframes to identify differences in data. This is an important task in data analysis as it helps to identify anomalies or changes in data that could affect the accuracy of your analysis. In this article, we will explore how to compare two pandas dataframes for differences.

Understanding Pandas Dataframes

Before we dive into the process of comparing two pandas dataframes, let us first understand what pandas dataframes are. Pandas is a popular Python library used for data manipulation and analysis. It provides a powerful data structure called a dataframe, which is essentially a two-dimensional table with labeled axes (rows and columns).

Each column in a pandas dataframe represents a variable, while each row represents an observation. Dataframes can be created from a variety of sources, including CSV files, SQL databases, and Excel spreadsheets.

Comparing Two Pandas Dataframes

To compare two pandas dataframes, we need to consider two scenarios:

  1. Comparing dataframes with the same shape and column names
  2. Comparing dataframes with different shapes or column names

Comparing Dataframes with the Same Shape and Column Names

When comparing two dataframes with the same shape and column names, we can use the equals() function provided by pandas. This function returns a boolean value indicating whether the two dataframes are equal or not.

import pandas as pd

# create two dataframes with same shape and column names
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# compare dataframes
print(df1.equals(df2)) # True

Output:

True

In the above example, we created two dataframes df1 and df2 with the same shape and column names. We then used the equals() function to compare the two dataframes and print the result, which is True.

Comparing Dataframes using compare() method:

Another way to compare dataframes with different shapes or column names is to use the compare() function provided by pandas.

The compare() function compares two dataframes element-wise and returns a dataframe containing the differences between the two dataframes. The returned dataframe has the same shape as the input dataframes, with each cell containing a tuple representing the difference between the corresponding cells in the input dataframes.

import pandas as pd

# create two dataframes with different shapes and column names
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
df2 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 7], 'C': [10, 11, 12]})

# compare dataframes
diff = df1.compare(df2)

# print differences
print(diff)

In the above example, we created two dataframes df1 and df2 with different shapes and column names. We then used the compare() function to compare the two dataframes and store the differences in the variable diff. Finally, we printed the differences using the print() function.

The output of the above code will be:

     B          C      
  self other self other
0  NaN   NaN    7    10
1  NaN   NaN    8    11
2  6.0   7.0    9    12

The output shows the differences between the two dataframes. The rows and columns represent the index and column names of the input dataframes, respectively. The self and other columns represent the values in df1 and df2, respectively. The values in the diff dataframe are tuples containing the difference between the corresponding cells in df1 and df2.

In the above output, we can see that the values in column B of row 2 are different between the two dataframes. Similarly, three values in column C of three rows 0, 1, 2 are different between the two dataframes.

Note: Before comparing two DataFrames make sure that the number of records in the first DataFrame matches the number of records in the second DataFrame. If not so, you will be getting a value error which is :

ValueError: Can only compare identically-labeled Series objects

Conclusion

Comparing two pandas dataframes is an essential task in data analysis. In this article, we learned how to compare two pandas dataframes for differences. We explored two scenarios: comparing dataframes with the same shape and column names, and comparing dataframes with different shapes or column names.

We learned that when comparing dataframes with the same shape and column names, we can use the equals() function provided by pandas. When comparing dataframes with different shapes or column names, we can use the compare() function, which returns a dataframe containing the differences between the two dataframes.

By following the techniques described in this article, you can easily compare two pandas dataframes and identify any differences in the data.


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.