Pandas CSV error: Error tokenizing data. C error: EOF inside string starting at line

Tokenizing Error

Recently, I burned about 3 hours trying to load a large CSV file into Python Pandas using the read_csv function, only to consistently run into the following error:

ParserError                               Traceback (most recent call last)
<ipython-input-6-b51ad8562823> in <module>()
...
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader.read (pandas\_libs\parsers.c:10862)()

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory (pandas\_libs\parsers.c:11138)()

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_rows (pandas\_libs\parsers.c:11884)()

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows (pandas\_libs\parsers.c:11755)()

pandas\_libs\parsers.pyx in pandas._libs.parsers.raise_parser_error (pandas\_libs\parsers.c:28765)()

Error tokenizing data. C error: EOF inside string starting at line XXXX

“Error tokenising data. C error: EOF inside string starting at line”.

There was an erroneous character about 5000 lines into the CSV file that prevented the Pandas CSV parser from reading the entire file. Excel had no problems opening the file, and no amount of saving/re-saving/changing encodings was working. Manually removing the offending line worked, but ultimately, another character 6000 lines further into the file caused the same issue.

The solution was to use the parameter engine=’python’ in the read_csv function call. The Pandas CSV parser can use two different “engines” to parse CSV file – Python or C (default).

pandas.read_csv(filepath_or_buffer, sep=', ', delimiter=None, 
                header='infer', names=None, 
                index_col=None, usecols=None, squeeze=False, 
                ..., engine=None, ...)

engine : {‘c’, ‘python’}, optional

Parser engine to use. The C engine is faster while the python engine is currently more feature-complete.

The Python engine is described as “slower, but more feature complete” in the Pandas documentation. However, in this case, the python engine sorts the problem, without a massive time impact (overall file size was approx 200,000 rows).

UnicodeDecodeError

The other big problem with reading in CSV files that I come across is the error:

“UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0x96 in position XX: invalid start byte”

Character encoding issues can be usually fixed by opening the CSV file in Sublime Text, and then “Save with encoding” choosing UTF-8. Then adding the encoding=’utf8′ to the pandas.read_csv command allows Pandas to open the CSV file without trouble. This error appears regularly for me with files saved in Excel.

encoding : str, default None

Encoding to use for UTF when reading/writing (ex. ‘utf-8’). List of Python standard encodings

22 Comments

Inline Feedbacks

View all comments

Mbuotidem Isaac

6 years ago

Thank you for this article! Both errors occurred on my data set and your solutions fixed it for me. You saved me the 3 hours.

James

6 years ago

Thank you, I’m a complete n00b to python and the above article was perfect.

#works like a charm
, engine=’python’)

Hugh Wright

6 years ago

Thanks, this was driving me mad! Even turned my ad-blocker off for you 🙂

Reshma

6 years ago

Hi Shane,

I tried encoding the csv in sublime text and changed engine=”python”
I still get an error 🙁
ParserError: Expected 1 fields in line 34, saw 2

Is there something I need to change about my code?

This is my code:
data = pd.read_csv(‘Technical_Assessment_Package_Data_Analyst/survey_data.csv’, encoding=’utf8′, engine=”python”)

Thanks in advance.

Shane

Author

Reply to Reshma

6 years ago

Hi Reshma,

Sounds like the code is okay – but the error is mentioning that your CSV data is ill formatted – specifically on line 34 – you should have a look – have you an extra comma or an unescaped string in line 34?

Shane

Reshma

Reply to Shane

6 years ago

Thanks for the reply Shane.
But I figured a way out. Not sure how did it help, but it did.
I first saved as the file in the UTF-8 csv format in excel itself. And then encoded the new file again inti Sublime.

Thanks again.

Bill

5 years ago

Thanks – saved a lot of time – thanks

shabbir

5 years ago

Good info.

Ganesh

5 years ago

engine=python Life saver.. Thanks a lot .. saved me lot of time.

Ganesh M Bhat

5 years ago

Hi, Thanks a lot this worked for me but but gave me another error “ParserError: unexpected end of data”. Can you please tell me why this is happening. My line is as follows `df = pd.read_csv(“test_view.csv”, engine=’python’)`

Lisa

Reply to Ganesh M Bhat

4 years ago

Did you figure this issue out, I’ve had the same experience with what is a very clean dataset.

Christina Boididou (@CMpoi)

5 years ago

Thanks, you saved my time!

Vent Poli

5 years ago

Helped a ton. You are the kind of hero the internet needs

Niya

4 years ago

Thank you so much,It helped me a lot!!!

Last edited 4 years ago by Niya

govind soni

3 years ago

data=pd.read_csv(“file name”,engine=’python’, error_bad_lines=False)
try this line also, if you have error still.

Last edited 3 years ago by govind soni

Jeshurun

2 years ago

Thank you so much man. I was getting really frustrated. You are my hero.

Franco

2 years ago

Thank you so much

Naveed

2 years ago

Step 1: install sublim text software. the website link is: https://www.sublimetext.com/
Step 2: After install sublim text software then open your dataset in sublim text
Step 3: Then click file menu -> save in Encoding -> UTF-8 (click this option). and close dataset.
Step 4: upload your dataset in Jupyter notebook
Step 5: pd.read_csv(‘csv file name’, sep=’;’,engine=’python’,error_bad_lines=False)

Aaron McAdie

2 years ago

Thank you so much Shane! This saved me a ton of banging my head against the wall. Hope you enjoy the coffee 😉

Saloni

1 year ago

I have been searching the solution from past an hour and engine=’python’ worked like a magic to me.

Andres

1 year ago

thaks

Samay

1 year ago

Thanks it is working

wpDiscuz