How to Get Unique Values in Multiple Columns using Pandas

In this blog, we’ll delve into the common task faced by data scientists and software engineers: working with large datasets. Frequently, the need arises to identify unique values across multiple columns for various data analysis purposes. The focus of this article is to demonstrate the utilization of Pandas for obtaining unique values in multiple columns.

As a data scientist or software engineer, working with large datasets is a common task. Often, you may need to find unique values in multiple columns to perform various data analysis tasks. In this article, we will explore how to utilize Pandas to get unique values in multiple columns.

Table of Contents

  1. What is Pandas?
  2. Why Unique Values Matter
  3. Getting Started
  4. Getting Unique Values in a Single Column
  5. Getting Unique Values in Multiple Columns
  6. Getting Unique Values in Multiple Columns with Conditions
  7. Common Errors and How to Handle Them
  8. Conclusion

What is Pandas?

Pandas is a popular open-source data manipulation library in Python. It provides flexible data structures such as DataFrames and Series, making it easy to work with tabular data. With Pandas, you can perform various operations on your data, including filtering, grouping, merging, and transforming.

Why Unique Values Matter

Understanding unique values in your dataset is crucial for tasks such as data cleaning, exploration, and preparation for further analysis. Eliminating duplicates aids in obtaining accurate insights and preventing skewed results.

Getting Started

Before we dive into how to get unique values in multiple columns using Pandas, let’s first create a sample dataset that we will use throughout this article. We will create a DataFrame with three columns: Name, Age, and City.

import pandas as pd

data = {'Name': ['John', 'Jane', 'Mike', 'Sarah', 'David', 'Kate'],
        'Age': [25, 30, 35, 25, 30, 35],
        'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Los Angeles', 'Chicago']}

df = pd.DataFrame(data)

The resulting DataFrame df looks like this:

    Name    Age    City
0   John    25     New York
1   Jane    30     Los Angeles
2   Mike    35     Chicago
3   Sarah   25     New York
4   David   30     Los Angeles
5   Kate    35     Chicago

Getting Unique Values in a Single Column

Before we get into getting unique values in multiple columns, let’s first explore how to get unique values in a single column. We can use the unique() method of a Pandas Series to get unique values in a column. Let’s see an example:

unique_cities = df['City'].unique()

print(unique_cities)

The output of the above code will be:

['New York' 'Los Angeles' 'Chicago']

The unique() method returns a NumPy array of unique values in the City column.

Getting Unique Values in Multiple Columns

Now that we know how to get unique values in a single column, let’s explore how to get unique values in multiple columns. Suppose we want to get unique combinations of Age and City columns. We can use the drop_duplicates() method of a Pandas DataFrame to achieve this.

unique_age_city = df[['Age', 'City']].drop_duplicates()

print(unique_age_city)

The output of the above code will be:

   Age         City
0   25     New York
1   30  Los Angeles
2   35      Chicago

The drop_duplicates() method removes duplicate rows from the DataFrame, and the [['Age', 'City']] syntax selects only the Age and City columns. The resulting DataFrame contains unique combinations of Age and City columns.

Getting Unique Values in Multiple Columns with Conditions

Sometimes, we may need to get unique values in multiple columns based on certain conditions. For example, suppose we want to get unique combinations of Age and City columns where the Age is greater than or equal to 30. We can use the loc[] method of a Pandas DataFrame to achieve this.

unique_age_city_above_30 = df.loc[df['Age'] >= 30, ['Age', 'City']].drop_duplicates()

print(unique_age_city_above_30)

The output of the above code will be:

   Age         City
1   30  Los Angeles
2   35      Chicago

The df.loc[df['Age'] >= 30, ['Age', 'City']] syntax selects only the rows where the Age is greater than or equal to 30 and the Age and City columns. The resulting DataFrame contains unique combinations of Age and City columns where the Age is greater than or equal to 30.

Common Errors and How to Handle Them

Error 1: Missing Data

Handle missing data using dropna() before applying the unique values extraction method.

Error 2: Data Types Mismatch

Ensure consistent data types across columns, converting if necessary, to prevent unexpected behavior.

Error 3: Incorrect Syntax

Verify the syntax of the method being used, especially when combining functions. Common issues include incorrect column names or improper lambda functions.

Conclusion

In this article, we explored how to utilize Pandas to get unique values in multiple columns. We learned how to get unique values in a single column using the unique() method of a Pandas Series and how to get unique values in multiple columns using the drop_duplicates() method of a Pandas DataFrame. We also learned how to get unique values in multiple columns with conditions using the loc[] method of a Pandas DataFrame. With these techniques, you can easily perform data analysis tasks and gain insights from your data.


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.