How to Get Unique Values in Multiple Columns using Pandas
As a data scientist or software engineer, working with large datasets is a common task. Often, you may need to find unique values in multiple columns to perform various data analysis tasks. In this article, we will explore how to utilize Pandas to get unique values in multiple columns.
Table of Contents
- What is Pandas?
- Why Unique Values Matter
- Getting Started
- Getting Unique Values in a Single Column
- Getting Unique Values in Multiple Columns
- Getting Unique Values in Multiple Columns with Conditions
- Common Errors and How to Handle Them
- Conclusion
What is Pandas?
Pandas is a popular open-source data manipulation library in Python. It provides flexible data structures such as DataFrames and Series, making it easy to work with tabular data. With Pandas, you can perform various operations on your data, including filtering, grouping, merging, and transforming.
Why Unique Values Matter
Understanding unique values in your dataset is crucial for tasks such as data cleaning, exploration, and preparation for further analysis. Eliminating duplicates aids in obtaining accurate insights and preventing skewed results.
Getting Started
Before we dive into how to get unique values in multiple columns using Pandas, let’s first create a sample dataset that we will use throughout this article. We will create a DataFrame with three columns: Name
, Age
, and City
.
import pandas as pd
data = {'Name': ['John', 'Jane', 'Mike', 'Sarah', 'David', 'Kate'],
'Age': [25, 30, 35, 25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Los Angeles', 'Chicago']}
df = pd.DataFrame(data)
The resulting DataFrame df
looks like this:
Name Age City
0 John 25 New York
1 Jane 30 Los Angeles
2 Mike 35 Chicago
3 Sarah 25 New York
4 David 30 Los Angeles
5 Kate 35 Chicago
Getting Unique Values in a Single Column
Before we get into getting unique values in multiple columns, let’s first explore how to get unique values in a single column. We can use the unique()
method of a Pandas Series to get unique values in a column. Let’s see an example:
unique_cities = df['City'].unique()
print(unique_cities)
The output of the above code will be:
['New York' 'Los Angeles' 'Chicago']
The unique()
method returns a NumPy array of unique values in the City
column.
Getting Unique Values in Multiple Columns
Now that we know how to get unique values in a single column, let’s explore how to get unique values in multiple columns. Suppose we want to get unique combinations of Age
and City
columns. We can use the drop_duplicates()
method of a Pandas DataFrame to achieve this.
unique_age_city = df[['Age', 'City']].drop_duplicates()
print(unique_age_city)
The output of the above code will be:
Age City
0 25 New York
1 30 Los Angeles
2 35 Chicago
The drop_duplicates()
method removes duplicate rows from the DataFrame, and the [['Age', 'City']]
syntax selects only the Age
and City
columns. The resulting DataFrame contains unique combinations of Age
and City
columns.
Getting Unique Values in Multiple Columns with Conditions
Sometimes, we may need to get unique values in multiple columns based on certain conditions. For example, suppose we want to get unique combinations of Age
and City
columns where the Age
is greater than or equal to 30. We can use the loc[]
method of a Pandas DataFrame to achieve this.
unique_age_city_above_30 = df.loc[df['Age'] >= 30, ['Age', 'City']].drop_duplicates()
print(unique_age_city_above_30)
The output of the above code will be:
Age City
1 30 Los Angeles
2 35 Chicago
The df.loc[df['Age'] >= 30, ['Age', 'City']]
syntax selects only the rows where the Age
is greater than or equal to 30 and the Age
and City
columns. The resulting DataFrame contains unique combinations of Age
and City
columns where the Age
is greater than or equal to 30.
Common Errors and How to Handle Them
Error 1: Missing Data
Handle missing data using dropna()
before applying the unique values extraction method.
Error 2: Data Types Mismatch
Ensure consistent data types across columns, converting if necessary, to prevent unexpected behavior.
Error 3: Incorrect Syntax
Verify the syntax of the method being used, especially when combining functions. Common issues include incorrect column names or improper lambda functions.
Conclusion
In this article, we explored how to utilize Pandas to get unique values in multiple columns. We learned how to get unique values in a single column using the unique()
method of a Pandas Series and how to get unique values in multiple columns using the drop_duplicates()
method of a Pandas DataFrame. We also learned how to get unique values in multiple columns with conditions using the loc[]
method of a Pandas DataFrame. With these techniques, you can easily perform data analysis tasks and gain insights from your data.
About Saturn Cloud
Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.
Saturn Cloud provides customizable, ready-to-use cloud environments for collaborative data teams.
Try Saturn Cloud and join thousands of users moving to the cloud without
having to switch tools.