📣 Introducing $2.95/Hr H100, H200, B200s, and B300s: train, fine-tune, and scale ML models affordably, without having to DIY the infrastructure   📣 Run Saturn Cloud on AWS, GCP, Azure, Nebius, Crusoe, or on-prem. 📣 Introducing $2.95/Hr H100, H200, B200s, and B300s: train, fine-tune, and scale ML models affordably, without having to DIY the infrastructure   📣 Run Saturn Cloud on AWS, GCP, Azure, Nebius, Crusoe, or on-prem. 📣 Introducing $2.95/Hr H100, H200, B200s, and B300s: train, fine-tune, and scale ML models affordably, without having to DIY the infrastructure   📣 Run Saturn Cloud on AWS, GCP, Azure, Nebius, Crusoe, or on-prem.
← Back to Blog

How to Fix Python Pandas Error Tokenizing Data

As a data scientist or software engineer you may often come across data that needs to be cleaned and processed before analysis One popular tool for data processing is Pythons Pandas library which is widely used in the data science community However when using Pandas to read in a CSV file you may encounter an error that says Error tokenizing data In this blog post we will explain what this error means and provide steps on how to fix it

How to Fix Python Pandas Error Tokenizing Data

Table of Contents

  1. Understanding the Error
  2. Fixing the Error
  3. Conclusion

Understanding the Error

When Pandas tries to read in a CSV file using the read_csv() function, it splits the data into individual rows and columns based on the delimiter specified in the sep parameter (which defaults to a comma). However, if the data in the CSV file is not properly formatted, Pandas may encounter issues with splitting the data into rows and columns.

The “Error tokenizing data” message usually indicates that there is an issue with how the data is formatted in the CSV file. This can be due to a number of reasons, such as:

  • There are missing values in the CSV file that are not properly represented.
  • The delimiter used in the file is not consistent throughout the file.
  • There are unbalanced quotes in the CSV file.
  • There are special characters in the CSV file that are not properly encoded.

We will be using the following csv sample file that has some formatting issues:

Name,Age,Location
John,25,New York
Jane,30,"Los Angeles, CA"
Sam,,Chicago
Alex,28,"San Francisco, CA"

Fixing the Error

Here are some steps you can take to fix the “Error tokenizing data” issue in Pandas:

Step 1: Check the CSV file for issues

The first step is to check the CSV file for any obvious issues. Open the file in a text editor or spreadsheet software and look for any missing values or inconsistencies in the delimiter used. You can also try opening the file in a different program to see if it can be read properly.

Step 2: Specify the delimiter

If the delimiter used in the CSV file is not consistent throughout the file, you can specify the delimiter explicitly in the read_csv() function. For example, if the delimiter is a semicolon, you can use the following code:

import pandas as pd

df = pd.read_csv('filename.csv', sep=',')

Step 3: Use the correct encoding

If there are special characters in the CSV file that are not properly encoded, you can specify the encoding type in the read_csv() function. For example, if the file is encoded in UTF-8, you can use the following code:

import pandas as pd

df = pd.read_csv('filename.csv', encoding='utf-8')

Step 4: Skip rows with errors

If there are only a few rows in the CSV file that have issues, you can skip those rows and continue reading in the rest of the file. You can do this by using the error_bad_lines parameter in the read_csv() function. For example:

import pandas as pd

df = pd.read_csv('filename.csv', on_bad_lines='skip')

This will skip any rows with errors and continue reading in the rest of the file.

Step 5: Fix unbalanced quotes

If there are unbalanced quotes in the CSV file, you can use the quoting parameter in the read_csv() function to specify how quotes should be handled. For example, if the quotes are double quotes ("), you can use the following code:

import pandas as pd

df = pd.read_csv('filename.csv', quoting=csv.QUOTE_NONE, quotechar='"')

This will tell Pandas to treat quotes as regular characters and not as delimiters.

If we add all the of the above together in one single definition for our df, we will get our data neat and clean.

df = pd.read_csv('filename.csv', sep=',', encoding='utf-8', on_bad_lines='skip', quoting=csv.QUOTE_NONE, quotechar='"')

Output:

       Name   Age  Location
0      John  25.0  New York
1       Sam   NaN   Chicago

Conclusion

The “Error tokenizing data” message in Pandas can be frustrating, but it is usually an indication of issues with the formatting of the CSV file. By following the steps outlined in this blog post, you can fix the error and continue processing your data with Pandas. Remember to always check your data for issues before trying to read it into Pandas, and to specify any necessary parameters in the read_csv() function to ensure that Pandas can properly parse the data.

Keep reading

Related articles

How to Fix Python Pandas Error Tokenizing Data
Dec 29, 2023

How to Resolve Memory Errors in Amazon SageMaker

How to Fix Python Pandas Error Tokenizing Data
Dec 22, 2023

Loading S3 Data into Your AWS SageMaker Notebook: A Guide

How to Fix Python Pandas Error Tokenizing Data
Dec 19, 2023

How to Convert Pandas Series to DateTime in a DataFrame