How to Fix Python Pandas Error Tokenizing Data

As a data scientist or software engineer you may often come across data that needs to be cleaned and processed before analysis One popular tool for data processing is Pythons Pandas library which is widely used in the data science community However when using Pandas to read in a CSV file you may encounter an error that says Error tokenizing data In this blog post we will explain what this error means and provide steps on how to fix it

Table of Contents

  1. Understanding the Error
  2. Fixing the Error
  3. Conclusion

Understanding the Error

When Pandas tries to read in a CSV file using the read_csv() function, it splits the data into individual rows and columns based on the delimiter specified in the sep parameter (which defaults to a comma). However, if the data in the CSV file is not properly formatted, Pandas may encounter issues with splitting the data into rows and columns.

The “Error tokenizing data” message usually indicates that there is an issue with how the data is formatted in the CSV file. This can be due to a number of reasons, such as:

  • There are missing values in the CSV file that are not properly represented.
  • The delimiter used in the file is not consistent throughout the file.
  • There are unbalanced quotes in the CSV file.
  • There are special characters in the CSV file that are not properly encoded.

We will be using the following csv sample file that has some formatting issues:

Name,Age,Location
John,25,New York
Jane,30,"Los Angeles, CA"
Sam,,Chicago
Alex,28,"San Francisco, CA"

Fixing the Error

Here are some steps you can take to fix the “Error tokenizing data” issue in Pandas:

Step 1: Check the CSV file for issues

The first step is to check the CSV file for any obvious issues. Open the file in a text editor or spreadsheet software and look for any missing values or inconsistencies in the delimiter used. You can also try opening the file in a different program to see if it can be read properly.

Step 2: Specify the delimiter

If the delimiter used in the CSV file is not consistent throughout the file, you can specify the delimiter explicitly in the read_csv() function. For example, if the delimiter is a semicolon, you can use the following code:

import pandas as pd

df = pd.read_csv('filename.csv', sep=',')

Step 3: Use the correct encoding

If there are special characters in the CSV file that are not properly encoded, you can specify the encoding type in the read_csv() function. For example, if the file is encoded in UTF-8, you can use the following code:

import pandas as pd

df = pd.read_csv('filename.csv', encoding='utf-8')

Step 4: Skip rows with errors

If there are only a few rows in the CSV file that have issues, you can skip those rows and continue reading in the rest of the file. You can do this by using the error_bad_lines parameter in the read_csv() function. For example:

import pandas as pd

df = pd.read_csv('filename.csv', on_bad_lines='skip')

This will skip any rows with errors and continue reading in the rest of the file.

Step 5: Fix unbalanced quotes

If there are unbalanced quotes in the CSV file, you can use the quoting parameter in the read_csv() function to specify how quotes should be handled. For example, if the quotes are double quotes ("), you can use the following code:

import pandas as pd

df = pd.read_csv('filename.csv', quoting=csv.QUOTE_NONE, quotechar='"')

This will tell Pandas to treat quotes as regular characters and not as delimiters.

If we add all the of the above together in one single definition for our df, we will get our data neat and clean.

df = pd.read_csv('filename.csv', sep=',', encoding='utf-8', on_bad_lines='skip', quoting=csv.QUOTE_NONE, quotechar='"')

Output:

       Name   Age  Location
0      John  25.0  New York
1       Sam   NaN   Chicago

Conclusion

The “Error tokenizing data” message in Pandas can be frustrating, but it is usually an indication of issues with the formatting of the CSV file. By following the steps outlined in this blog post, you can fix the error and continue processing your data with Pandas. Remember to always check your data for issues before trying to read it into Pandas, and to specify any necessary parameters in the read_csv() function to ensure that Pandas can properly parse the data.


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.