How to Drop Duplicates of One Column Based on Value in Another Column Using Python and Pandas
As a data scientist or software engineer, you may come across situations where you need to remove duplicates of one column in a Pandas DataFrame based on the value in another column. This can be a critical task when analyzing large datasets, especially when dealing with duplicate data that can skew your results. Fortunately, Python and Pandas provide a straightforward way to drop duplicates based on a specific column using the drop_duplicates()
method.
In this article, we will explain how to drop duplicates of one column based on the value in another column using Python and Pandas. We will walk you through the entire process, from loading your data to removing duplicates. By the end of this article, you will know how to perform this task efficiently and accurately.
What is Pandas?
Before we dive into the specifics of dropping duplicates based on a specific column, let’s first understand what Pandas is and what it can do for us as data scientists and software engineers.
Pandas is an open-source data manipulation library for Python. It provides easy-to-use data structures and data analysis tools for handling and analyzing data. Pandas is particularly useful for working with tabular data, such as spreadsheets and databases. With Pandas, you can easily load, clean, and transform data to prepare it for analysis. Pandas is also widely used for data visualization and statistical analysis.
Loading Data into Pandas
The first step in dropping duplicates based on a specific column is to load your data into a Pandas DataFrame. Pandas can read data from a wide range of sources, including CSV files, Excel spreadsheets, SQL databases, and more.
For the purposes of this article, we will assume that you have a CSV file containing your data. To load this data into a Pandas DataFrame, you can use the read_csv()
method. Here is an example:
import pandas as pd
# Load data from CSV file
data = pd.read_csv('data.csv')
print(data)
Output:
Title ISBN Store
0 The Great Gatsby 9780743273565 Brooklyn Comics
1 To Kill a Mockingbird 9780061120084 Seattle Books
2 The Catcher in the Rye 9780316769174 N.Y. Comics
3 The Great Gatsby 9780743273565 LA Books
In this example, we have imported the Pandas library and used the read_csv()
method to load our data from a CSV file named data.csv
. The resulting DataFrame is stored in the data
variable. As we can see, there is a duplication in our DataFrame (The Great Gatsby
).
Dropping Duplicates Based on a Specific Column
Now that we have loaded our data into a Pandas DataFrame, we can drop duplicates based on a specific column using the drop_duplicates()
method. This method allows us to specify one or more columns to use as the basis for dropping duplicates.
To drop duplicates based on a specific column, we need to pass the name of the column to the subset
parameter of the drop_duplicates()
method. Here is an example:
# Drop duplicates based on a specific column
data_drop = data.drop_duplicates(subset=['Title'])
print(data_drop)
Output:
Title ISBN Store
0 The Great Gatsby 9780743273565 Brooklyn Comics
1 To Kill a Mockingbird 9780061120084 Seattle Books
2 The Catcher in the Rye 9780316769174 N.Y. Comics
In this example, we have passed the name of the column to the subset
parameter of the drop_duplicates()
method. This tells Pandas to drop duplicates based on the values in that column.
Dropping Duplicates Based on Multiple Columns
In some cases, you may need to drop duplicates based on multiple columns. To do this, you can simply pass a list of column names to the subset
parameter of the drop_duplicates()
method. Here is an example:
# Drop duplicates based on multiple columns
data_drop = data.drop_duplicates(subset=['Title', 'ISBN'])
print(data_drop)
Output:
Title ISBN Store
0 The Great Gatsby 9780743273565 Brooklyn Comics
1 To Kill a Mockingbird 9780061120084 Seattle Books
2 The Catcher in the Rye 9780316769174 N.Y. Comics
In this example, we have passed a list of column names to the subset
parameter of the drop_duplicates()
method. This tells Pandas to drop duplicates based on the values in both columns.
Keeping the First or Last Occurrence of Duplicates
By default, the drop_duplicates()
method keeps the first occurrence of each set of duplicates and removes the rest. However, you can also specify whether to keep the first or last occurrence of duplicates using the keep
parameter.
To keep the first occurrence of duplicates, you can set keep='first'
. To keep the last occurrence of duplicates, you can set keep='last'
. Here are examples of both:
# Keep the first occurrence of duplicates
data_first = data.drop_duplicates(subset=['Title'], keep='first')
print(data_first)
Output:
Title ISBN Store
0 The Great Gatsby 9780743273565 Brooklyn Comics
1 To Kill a Mockingbird 9780061120084 Seattle Books
2 The Catcher in the Rye 9780316769174 N.Y. Comics
# Keep the last occurrence of duplicates
data_last = data.drop_duplicates(subset=['Title'], keep='last')
print(data_last)
Output:
Title ISBN Store
1 To Kill a Mockingbird 9780061120084 Seattle Books
2 The Catcher in the Rye 9780316769174 N.Y. Comics
3 The Great Gatsby 9780743273565 LA Books
Conclusion
Removing duplicates based on a specific column in Pandas is a common task for data scientists and software engineers alike. With the drop_duplicates()
method, it is easy to drop duplicates based on the value in another column in just a few lines of code.
In this article, we have walked you through the entire process of loading your data into a Pandas DataFrame and dropping duplicates based on a specific column. We have also shown you how to drop duplicates based on multiple columns and how to keep the first or last occurrence of duplicates.
By following the steps outlined in this article, you can ensure that your data is clean and accurate, allowing you to perform more accurate analyses and make better-informed decisions.
About Saturn Cloud
Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.
Saturn Cloud provides customizable, ready-to-use cloud environments for collaborative data teams.
Try Saturn Cloud and join thousands of users moving to the cloud without
having to switch tools.