How to Convert Categorical Data to Numerical Data with Pandas
As a data scientist or software engineer, you may encounter datasets that contain categorical data. Categorical data is data that is divided into groups or categories, such as colors, types of fruit, or educational levels. To perform certain types of analyses, this data must be converted from categorical data to numerical data. In this post, we will explore how to use Pandas, a popular Python library for data manipulation and analysis, to convert categorical data to numerical data.
What is Pandas?
Pandas is an open-source Python library that is designed for data manipulation and analysis. It provides tools for reading and writing data, as well as powerful data structures for working with tabular data. Pandas is widely used in the data science community and is a popular choice for data analysis tasks.
Converting Categorical Data to Numerical Data
Converting categorical data to numerical data is an important step in many data analysis tasks. In Pandas, there are several ways to convert categorical data to numerical data, including the following:
Method 1: Using the cat.codes Attribute
The easiest way to convert categorical data to numerical data in Pandas is to use the cat.codes
attribute. This attribute is available for categorical data types in Pandas and returns a numerical representation of each category.
Here is an example of how to use the cat.codes
attribute to convert categorical data to numerical data:
import pandas as pd
# Create a DataFrame with categorical data
df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Red', 'Green']})
# Convert categorical data to numerical data using cat.codes
df['Color'] = df['Color'].astype('category')
df['Color_Codes'] = df['Color'].cat.codes
# View the converted DataFrame
print(df)
The output of this code would be:
Color Color_Codes
0 Red 2
1 Blue 0
2 Green 1
3 Red 2
4 Green 1
In this example, we created a DataFrame with a column ‘Color’ that contains categorical data. We then converted this column to a categorical data type using the astype()
method. Finally, we used the cat.codes
attribute to create a new column ‘Color_Codes’ with numerical representations of each category.
Method 2: Using the replace() Method
Another way to convert categorical data to numerical data in Pandas is to use the replace()
method. This method replaces each category with a specified numerical value.
Here is an example of how to use the replace()
method to convert categorical data to numerical data:
import pandas as pd
# Create a DataFrame with categorical data
df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Red', 'Green']})
# Convert categorical data to numerical data using replace
df['Color'] = df['Color'].replace({'Red': 0, 'Blue': 1, 'Green': 2})
# View the converted DataFrame
print(df)
The output of this code would be:
Color
0 0
1 1
2 2
3 0
4 2
In this example, we created a DataFrame with a column ‘Color’ that contains categorical data. We then used the replace()
method to replace each category with a specified numerical value.
Method 3: Using the LabelEncoder Class
A third way to convert categorical data to numerical data in Pandas is to use the LabelEncoder
class. This class is part of the sklearn.preprocessing
module and provides a way to encode categorical features as a numeric array.
Here is an example of how to use the LabelEncoder
class to convert categorical data to numerical data:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
# Create a DataFrame with categorical data
df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Red', 'Green']})
# Convert categorical data to numerical data using LabelEncoder
le = LabelEncoder()
df['Color'] = le.fit_transform(df['Color'])
# View the converted DataFrame
print(df)
The output of this code would be:
Color
0 2
1 0
2 1
3 2
4 1
In this example, we created a DataFrame with a column ‘Color’ that contains categorical data. We then used the LabelEncoder
class from the sklearn.preprocessing
module to create a new column ‘Color’ with numerical representations of each category.
Conclusion
In this post, we explored how to use Pandas, a popular Python library for data manipulation and analysis, to convert categorical data to numerical data. We discussed three methods for converting categorical data to numerical data, including using the cat.codes
attribute, the replace()
method, and the LabelEncoder
class. By using these methods, you can prepare your data for analysis and gain insights that would not be possible with categorical data.
About Saturn Cloud
Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.
Saturn Cloud provides customizable, ready-to-use cloud environments for collaborative data teams.
Try Saturn Cloud and join thousands of users moving to the cloud without
having to switch tools.