Add ability to process bad lines for read_csv

CSV files can contains some errors, for example:

NAME,PAT Peter,cat really bad line Fedor,cat

to skip really bad lines exist error_bad_lines=False parameter.
Another example without quotes and delimiters in field:

NAME,PAT Peter,cat Ira,cat,dog Fedor,cat

Which with quotes will look like this:

NAME,PAT Peter,cat Ira,"cat,dog" Fedor,cat

So it easy fix this line if know that first field not contain extra separators.
Also extra trailing delimiters issue: #2886

More real life example:

82,52,29,11,2,2013-08-02 00:00:00,,,gen,,FDP, employee,0,1,gen,,,0 55,69,36,19,2,2013-10-28 00:00:00,,,gen,,FDP employee,0,1,gen,,,0

There are difference for FDP employee and FDP, employee. So it will be grate to have ability process this bad lines with own handler.

My proposition add additional parameter process_bad_lines for read_csv.
For example, if I want fix line:

def bad_line_handler(items): '''probably ugly example, but lets imagine that `FDP, employee` is half of our data''' fdp_index = items.index('FDP') return items[:fdp_index] + ['FDP, employee'] + items[fdp_index + 2:] pd.read_csv(file, process_bad_lines=bad_line_handler)

error_bad_line and warn_bad_line can work as before but at first once try replace bad string with process_bad_lines handler.
if process_bad_lines will return None when probably better just skip this line without exceptions (probably it more flexible), to store compatibility just return unchanged items parameter. Otherwise None can be equal to bad line and better raise exceptions from process_bad_lines handler.
not always I can already create CSV file with quotes, somebody already send me bad CSV.
I can pre-process file but it will take more time and work, see for example http://stackoverflow.com/questions/14550441/problems-reading-csv-file-with-commas-and-characters-in-pandas

Some additions:
For example I have no much string fields and can assume that one of strings contains separator:

int,int,int,int,int,datetime,,,str,,str,str,int,int,str,,,int int,int,int,int,int,datetime,,,str,,str ,int,int,str,,,int

But it can work bad for many strings:

int,int,int,int,int,datetime,str,str,str,str,str,str,int,int,str,,,int int,int,int,int,int,datetime, str,str,str,str,str ,int,int,str,,,int

However it also be grate have default methods to fix this strings with concatenating left strings:

int,int,int,int,int,datetime,str,str,str,str,str,str,int,int,str,,,int int,int,int,int,int,datetime, STR ,str,str,str,str,int,int,str,,,int # for example with next syntax pd.read_csv(file, process_bad_lines='try_concat_left')

or right strings:

int,int,int,int,int,datetime,str,str,str,str,str,str,int,int,str,,,int int,int,int,int,int,datetime,str,str,str,str, STR ,int,int,str,,,int # for example with next syntax pd.read_csv(file, process_bad_lines='try_concat_right')

and also removing extra trailing delimiters:

pd.read_csv(file, process_bad_lines='skip_right_delimiters')

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add ability to process bad lines for read_csv #5686

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Add ability to process bad lines for read_csv #5686

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions