How to Find All Duplicate Rows in a Pandas Dataframe

As a data scientist or software engineer you may often come across a scenario where you have to identify and remove duplicate rows from a pandas dataframe before performing any analysis Duplicate rows in a dataframe can cause inaccurate results and therefore it becomes crucial to identify and remove them In this article we will discuss how to find all duplicate rows in a pandas dataframe

How to Find All Duplicate Rows in a Pandas Dataframe

As a data scientist or software engineer, you may often come across a scenario where you have to identify and remove duplicate rows from a pandas dataframe before performing any analysis. Duplicate rows in a dataframe can cause inaccurate results, and therefore, it becomes crucial to identify and remove them. In this article, we will discuss how to find all duplicate rows in a pandas dataframe.

What is a Pandas Dataframe?

A pandas dataframe is a two-dimensional labeled data structure with columns of potentially different types. It is similar to a spreadsheet, or an SQL table, or a dictionary of Series objects, where each column is a Series. A dataframe can hold data of different types including integers, floats, strings, and even complex objects.

What are Duplicate Rows in a Pandas Dataframe?

Duplicate rows are rows that have exactly the same values across all columns. It means that if a dataframe has two or more rows with the same values in all columns, they are considered duplicate rows.

How to Find Duplicate Rows in a Pandas Dataframe?

To find duplicate rows in a pandas dataframe, we can use the duplicated() function. The duplicated() function returns a boolean series that indicates which rows are duplicate rows. We can then filter the dataframe using this boolean series to get all the duplicate rows.

Here is an example:

import pandas as pd

# create a sample dataframe
df = pd.DataFrame({
    'Name': ['John', 'John', 'Mary', 'Peter', 'John'],
    'Age': [25, 25, 30, 35, 25],
    'City': ['New York', 'Chicago', 'Los Angeles', 'Houston', 'Chicago']
})

# find duplicate rows
duplicate_rows = df.duplicated()

# print duplicate rows
print(duplicate_rows)

Output:

0    False
1    False
2    False
3    False
4     True
dtype: bool

In the above example, we created a sample dataframe with five rows and three columns. We then used the duplicated() function to find the duplicate rows in the dataframe. The output shows that row 4 is a duplicate row.

How to Find Duplicate Rows Based on Specific Columns?

Sometimes, we may want to find duplicate rows based on specific columns. We can do this by passing a list of column names to the duplicated() function.

Here is an example:

import pandas as pd

# create a sample dataframe
df = pd.DataFrame({
    'Name': ['John', 'John', 'Mary', 'Peter', 'John'],
    'Age': [25, 25, 30, 35, 25],
    'City': ['New York', 'Chicago', 'Los Angeles', 'Houston', 'Chicago']
})

# find duplicate rows based on Name and City columns
duplicate_rows = df.duplicated(subset=['Name', 'City'])

# print duplicate rows
print(duplicate_rows)

Output:

0    False
1    False
2    False
3    False
4     True
dtype: bool

In the above example, we passed a list of column names ['Name', 'City'] to the duplicated() function. This means that we want to find duplicate rows based on the values in the Name and City columns. The output shows that row 4 is a duplicate row based on the Name and City columns.

How to Remove Duplicate Rows from a Pandas Dataframe?

To remove duplicate rows from a pandas dataframe, we can use the drop_duplicates() function. The drop_duplicates() function removes duplicate rows based on a subset of columns or all columns.

Here is an example:

import pandas as pd

# create a sample dataframe
df = pd.DataFrame({
    'Name': ['John', 'John', 'Mary', 'Peter', 'John'],
    'Age': [25, 25, 30, 35, 25],
    'City': ['New York', 'Chicago', 'Los Angeles', 'Houston', 'Chicago']
})

# remove duplicate rows based on all columns
df.drop_duplicates(inplace=True)

# print dataframe after removing duplicate rows
print(df)

Output:

    Name  Age         City
0   John   25     New York
1   John   25      Chicago
2   Mary   30  Los Angeles
3  Peter   35      Houston

In the above example, we used the drop_duplicates() function to remove duplicate rows based on all columns. The inplace=True parameter ensures that the original dataframe is modified.

Conclusion

In this article, we discussed how to find all duplicate rows in a pandas dataframe. We learned that duplicate rows are rows that have the same values in all columns. We also learned how to find duplicate rows based on specific columns and how to remove duplicate rows from a pandas dataframe. Following these steps can help you ensure that your data is clean and accurate, which is crucial for any data analysis or machine learning project.


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.