-
- Notifications
You must be signed in to change notification settings - Fork 19.3k
Description
CSV files can contains some errors, for example:
NAME,PAT Peter,cat really bad line Fedor,cat to skip really bad lines exist error_bad_lines=False parameter.
Another example without quotes and delimiters in field:
NAME,PAT Peter,cat Ira,cat,dog Fedor,cat Which with quotes will look like this:
NAME,PAT Peter,cat Ira,"cat,dog" Fedor,cat So it easy fix this line if know that first field not contain extra separators.
Also extra trailing delimiters issue: #2886
More real life example:
82,52,29,11,2,2013-08-02 00:00:00,,,gen,,FDP, employee,0,1,gen,,,0 55,69,36,19,2,2013-10-28 00:00:00,,,gen,,FDP employee,0,1,gen,,,0 There are difference for FDP employee and FDP, employee. So it will be grate to have ability process this bad lines with own handler.
My proposition add additional parameter process_bad_lines for read_csv.
For example, if I want fix line:
def bad_line_handler(items): '''probably ugly example, but lets imagine that `FDP, employee` is half of our data''' fdp_index = items.index('FDP') return items[:fdp_index] + ['FDP, employee'] + items[fdp_index + 2:] pd.read_csv(file, process_bad_lines=bad_line_handler) error_bad_lineandwarn_bad_linecan work as before but at first once try replace bad string withprocess_bad_lineshandler.- if
process_bad_lineswill returnNonewhen probably better just skip this line without exceptions (probably it more flexible), to store compatibility just return unchangeditemsparameter. OtherwiseNonecan be equal to bad line and better raise exceptions fromprocess_bad_lineshandler. - not always I can already create CSV file with quotes, somebody already send me bad CSV.
- I can pre-process file but it will take more time and work, see for example http://stackoverflow.com/questions/14550441/problems-reading-csv-file-with-commas-and-characters-in-pandas
Some additions:
For example I have no much string fields and can assume that one of strings contains separator:
int,int,int,int,int,datetime,,,str,,str,str,int,int,str,,,int int,int,int,int,int,datetime,,,str,,str ,int,int,str,,,int But it can work bad for many strings:
int,int,int,int,int,datetime,str,str,str,str,str,str,int,int,str,,,int int,int,int,int,int,datetime, str,str,str,str,str ,int,int,str,,,int However it also be grate have default methods to fix this strings with concatenating left strings:
int,int,int,int,int,datetime,str,str,str,str,str,str,int,int,str,,,int int,int,int,int,int,datetime, STR ,str,str,str,str,int,int,str,,,int # for example with next syntax pd.read_csv(file, process_bad_lines='try_concat_left') or right strings:
int,int,int,int,int,datetime,str,str,str,str,str,str,int,int,str,,,int int,int,int,int,int,datetime,str,str,str,str, STR ,int,int,str,,,int # for example with next syntax pd.read_csv(file, process_bad_lines='try_concat_right') and also removing extra trailing delimiters:
pd.read_csv(file, process_bad_lines='skip_right_delimiters')