How to Convert Categorical Data to Numerical Data with Pandas

As a data scientist or software engineer you may encounter datasets that contain categorical data Categorical data is data that is divided into groups or categories such as colors types of fruit or educational levels To perform certain types of analyses this data must be converted from categorical data to numerical data In this post we will explore how to use Pandas a popular Python library for data manipulation and analysis to convert categorical data to numerical data

As a data scientist or software engineer, you may encounter datasets that contain categorical data. Categorical data is data that is divided into groups or categories, such as colors, types of fruit, or educational levels. To perform certain types of analyses, this data must be converted from categorical data to numerical data. In this post, we will explore how to use Pandas, a popular Python library for data manipulation and analysis, to convert categorical data to numerical data.

What is Pandas?

Pandas is an open-source Python library that is designed for data manipulation and analysis. It provides tools for reading and writing data, as well as powerful data structures for working with tabular data. Pandas is widely used in the data science community and is a popular choice for data analysis tasks.

Converting Categorical Data to Numerical Data

Converting categorical data to numerical data is an important step in many data analysis tasks. In Pandas, there are several ways to convert categorical data to numerical data, including the following:

Method 1: Using the cat.codes Attribute

The easiest way to convert categorical data to numerical data in Pandas is to use the cat.codes attribute. This attribute is available for categorical data types in Pandas and returns a numerical representation of each category.

Here is an example of how to use the cat.codes attribute to convert categorical data to numerical data:

import pandas as pd

# Create a DataFrame with categorical data
df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Red', 'Green']})

# Convert categorical data to numerical data using cat.codes
df['Color'] = df['Color'].astype('category')
df['Color_Codes'] = df['Color'].cat.codes

# View the converted DataFrame
print(df)

The output of this code would be:

   Color  Color_Codes
0    Red            2
1   Blue            0
2  Green            1
3    Red            2
4  Green            1

In this example, we created a DataFrame with a column ‘Color’ that contains categorical data. We then converted this column to a categorical data type using the astype() method. Finally, we used the cat.codes attribute to create a new column ‘Color_Codes’ with numerical representations of each category.

Method 2: Using the replace() Method

Another way to convert categorical data to numerical data in Pandas is to use the replace() method. This method replaces each category with a specified numerical value.

Here is an example of how to use the replace() method to convert categorical data to numerical data:

import pandas as pd

# Create a DataFrame with categorical data
df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Red', 'Green']})

# Convert categorical data to numerical data using replace
df['Color'] = df['Color'].replace({'Red': 0, 'Blue': 1, 'Green': 2})

# View the converted DataFrame
print(df)

The output of this code would be:

   Color
0      0
1      1
2      2
3      0
4      2

In this example, we created a DataFrame with a column ‘Color’ that contains categorical data. We then used the replace() method to replace each category with a specified numerical value.

Method 3: Using the LabelEncoder Class

A third way to convert categorical data to numerical data in Pandas is to use the LabelEncoder class. This class is part of the sklearn.preprocessing module and provides a way to encode categorical features as a numeric array.

Here is an example of how to use the LabelEncoder class to convert categorical data to numerical data:

import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Create a DataFrame with categorical data
df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Red', 'Green']})

# Convert categorical data to numerical data using LabelEncoder
le = LabelEncoder()
df['Color'] = le.fit_transform(df['Color'])

# View the converted DataFrame
print(df)

The output of this code would be:

   Color
0      2
1      0
2      1
3      2
4      1

In this example, we created a DataFrame with a column ‘Color’ that contains categorical data. We then used the LabelEncoder class from the sklearn.preprocessing module to create a new column ‘Color’ with numerical representations of each category.

Conclusion

In this post, we explored how to use Pandas, a popular Python library for data manipulation and analysis, to convert categorical data to numerical data. We discussed three methods for converting categorical data to numerical data, including using the cat.codes attribute, the replace() method, and the LabelEncoder class. By using these methods, you can prepare your data for analysis and gain insights that would not be possible with categorical data.


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.