How to Set dtypes by Column in Pandas DataFrame

In this blog, discover essential techniques for optimizing memory usage and enhancing code efficiency in pandas DataFrames when working with large datasets as a data scientist or software engineer. Learn how to set data types for specific columns, delving into practical applications that streamline your workflow and improve overall performance.

As a data scientist or software engineer working with large datasets, you may often encounter the need to set data types for specific columns in your pandas DataFrame. This is important as it helps optimize memory usage and make your code more efficient. In this article, we will discuss how to set dtypes by column in a pandas DataFrame and explore some common scenarios where this technique can be useful.

What are dtypes in Pandas?

In pandas, data types are referred to as dtypes. Each column in a DataFrame can have a different dtype, depending on the type of data it contains. Common dtypes in pandas include int, float, object, datetime, and bool.

By default, pandas will try to infer the dtype of each column based on the data it contains. However, this can sometimes result in unexpected or incorrect dtypes. Therefore, it is important to manually set dtypes when working with large datasets or when the data types are known in advance.

How to Set dtypes by Column in Pandas DataFrame

To set dtypes by column in a pandas DataFrame, we can use the astype() method. This method allows us to cast the data in a column to a specific dtype.

Suppose we have a DataFrame df with columns A, B, and C. We want to set the dtype of column A to int, column B to float, and column C to str.

import pandas as pd

# create a sample DataFrame
df = pd.DataFrame({
    'A': ['1', '2', '3'],
    'B': ['2.0', '3.0', '4.0'],
    'C': ['apple', 'banana', 'cherry']
})

# set dtypes for each column
df['A'] = df['A'].astype(int)
df['B'] = df['B'].astype(float)
df['C'] = df['C'].astype(str)

print(df.dtypes)

Output:

A      int64
B    float64
C     object
dtype: object

In the above code, we first create a sample DataFrame df with columns A, B, and C. We then use the astype() method to set the dtypes for each column. Note that we specify the dtype we want to cast to inside the astype() method.

Common Scenarios where Setting dtypes by Column is Useful

1. Converting Categorical Variables to Numeric

Categorical variables are variables that can take on a limited number of values. Examples include gender (male or female), education level (high school, college, graduate), and job title (manager, director, CEO).

In pandas, categorical variables are represented as object dtypes. However, some machine learning algorithms require numeric input. Therefore, we need to convert categorical variables to numeric before feeding them into our models.

To convert a categorical variable to numeric, we can use the astype() method as follows:

# create a sample DataFrame
df = pd.DataFrame({
    'gender': ['male', 'female', 'male', 'female', 'male', 'male', 'female'],
    'education': ['high school', 'college', 'college', 'graduate', 'high school', 'graduate', 'college'],
    'salary': [50000, 60000, 70000, 80000, 90000, 100000, 110000]
})

# convert categorical variables to numeric
df['gender'] = df['gender'].astype('category').cat.codes
df['education'] = df['education'].astype('category').cat.codes
print(df)

Output:

   gender  education  salary
0       1          2   50000
1       0          0   60000
2       1          0   70000
3       0          1   80000
4       1          2   90000
5       1          1  100000
6       0          0  110000

In the above code, we first create a sample DataFrame df with columns gender, education, and salary. We then use the astype() method to convert the gender and education columns from object to category dtypes. We then use the cat.codes attribute to convert the category dtypes to numeric.

print(df.dtypes)

Output:

gender        int8
education     int8
salary       int64
dtype: object

2. Optimizing Memory Usage

Large datasets can consume a lot of memory, which can slow down our code and make it inefficient. One way to optimize memory usage is to set the dtypes for each column in our DataFrame.

For example, suppose we have a DataFrame df with columns A, B, C, D, and E. We know that the maximum value of column A is 255, and the maximum value of column B is 65535. We can set the dtypes of these columns to uint8 and uint16, respectively, to save memory.

# create a sample DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'B': [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000],
    'C': ['apple', 'banana', 'cherry', 'date', 'elderberry', 'fig', 'grape', 'honeydew', 'indian gooseberry', 'jackfruit'],
    'D': [1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9, 10.0],
    'E': [True, False, True, False, True, False, True, False, True, False]
})

# set dtypes for columns A and B
df['A'] = df['A'].astype('uint8')
df['B'] = df['B'].astype('uint16')

In the above code, we first create a sample DataFrame df with columns A, B, C, D, and E. We then use the astype() method to set the dtypes of columns A and B to uint8 and uint16, respectively.

Conclusion

In this article, we discussed how to set dtypes by column in a pandas DataFrame. We explored some common scenarios where this technique can be useful, including converting categorical variables to numeric and optimizing memory usage.

Setting dtypes by column can help us make our code more efficient and optimize memory usage. By manually setting dtypes, we can ensure that our code runs smoothly, even with large datasets.


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.