How to Fix Python Pandas Error Tokenizing Data
Table of Contents
Understanding the Error
When Pandas tries to read in a CSV file using the read_csv()
function, it splits the data into individual rows and columns based on the delimiter specified in the sep
parameter (which defaults to a comma). However, if the data in the CSV file is not properly formatted, Pandas may encounter issues with splitting the data into rows and columns.
The “Error tokenizing data” message usually indicates that there is an issue with how the data is formatted in the CSV file. This can be due to a number of reasons, such as:
- There are missing values in the CSV file that are not properly represented.
- The delimiter used in the file is not consistent throughout the file.
- There are unbalanced quotes in the CSV file.
- There are special characters in the CSV file that are not properly encoded.
We will be using the following csv sample file that has some formatting issues:
Name,Age,Location
John,25,New York
Jane,30,"Los Angeles, CA"
Sam,,Chicago
Alex,28,"San Francisco, CA"
Fixing the Error
Here are some steps you can take to fix the “Error tokenizing data” issue in Pandas:
Step 1: Check the CSV file for issues
The first step is to check the CSV file for any obvious issues. Open the file in a text editor or spreadsheet software and look for any missing values or inconsistencies in the delimiter used. You can also try opening the file in a different program to see if it can be read properly.
Step 2: Specify the delimiter
If the delimiter used in the CSV file is not consistent throughout the file, you can specify the delimiter explicitly in the read_csv()
function. For example, if the delimiter is a semicolon, you can use the following code:
import pandas as pd
df = pd.read_csv('filename.csv', sep=',')
Step 3: Use the correct encoding
If there are special characters in the CSV file that are not properly encoded, you can specify the encoding type in the read_csv()
function. For example, if the file is encoded in UTF-8, you can use the following code:
import pandas as pd
df = pd.read_csv('filename.csv', encoding='utf-8')
Step 4: Skip rows with errors
If there are only a few rows in the CSV file that have issues, you can skip those rows and continue reading in the rest of the file. You can do this by using the error_bad_lines
parameter in the read_csv()
function. For example:
import pandas as pd
df = pd.read_csv('filename.csv', on_bad_lines='skip')
This will skip any rows with errors and continue reading in the rest of the file.
Step 5: Fix unbalanced quotes
If there are unbalanced quotes in the CSV file, you can use the quoting
parameter in the read_csv()
function to specify how quotes should be handled. For example, if the quotes are double quotes ("), you can use the following code:
import pandas as pd
df = pd.read_csv('filename.csv', quoting=csv.QUOTE_NONE, quotechar='"')
This will tell Pandas to treat quotes as regular characters and not as delimiters.
If we add all the of the above together in one single definition for our df, we will get our data neat and clean.
df = pd.read_csv('filename.csv', sep=',', encoding='utf-8', on_bad_lines='skip', quoting=csv.QUOTE_NONE, quotechar='"')
Output:
Name Age Location
0 John 25.0 New York
1 Sam NaN Chicago
Conclusion
The “Error tokenizing data” message in Pandas can be frustrating, but it is usually an indication of issues with the formatting of the CSV file. By following the steps outlined in this blog post, you can fix the error and continue processing your data with Pandas. Remember to always check your data for issues before trying to read it into Pandas, and to specify any necessary parameters in the read_csv()
function to ensure that Pandas can properly parse the data.
About Saturn Cloud
Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.
Saturn Cloud provides customizable, ready-to-use cloud environments for collaborative data teams.
Try Saturn Cloud and join thousands of users moving to the cloud without
having to switch tools.