-
- Notifications
You must be signed in to change notification settings - Fork 19.4k
ENH: Add arrow engine to to_csv #54171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/io/formats/csvs.py Outdated
| quoting_style=pa_quoting, | ||
| ) | ||
| # pa_csv.write_csv(table, handle, write_options) | ||
| pa_csv.write_csv(table, self.filepath_or_buffer, write_options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't decided yet whether to let pyarrow figure out how to do file handling on its own or use get_handle.
get_handle doesn't work out of the box due to pyarrow only working on binary handles IIRC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to try to convert the handle to a binary one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to set specific encoding for pyarrow's write_csv?
pandas/core/generic.py Outdated
| path_or_buf: None = ..., | ||
| sep: str = ..., | ||
| na_rep: str = ..., | ||
| engine: str = "python", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would need to be placed at the end to be backward compat with positional argument calls I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sidenote: we should make most of those arguments keyword-only, like we did for read_ functions (after a deprecation cycle, I think)
pandas/io/formats/csvs.py Outdated
| # Convert index to column and rename name to empty string | ||
| # since we serialize the index as basically a column with no name | ||
| # TODO: this won't work for multi-indexes | ||
| obj = self.obj.reset_index(names=[""]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we explicitly set the name to None after the reset_index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want None, but actually this empty string (for compatibility with current to_csv).
But, we should only rename to "" if the index has no name itself, i.e. only when the name was originally None.
So I think we can do something like:
| obj = self.obj.reset_index(names=[""]) | |
| new_names = [label if label is not None else "" for label in self.obj.index.names] | |
| obj = self.obj.reset_index(names=new_names) |
and then that should also work fine for MultiIndex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the modifications. I tested it and it seems to work (just a small issue with pyarrow quoting all strings).
What if the MultiIndex doesn't have names, though?
IIRC, pyarrow doesn't allow duplicate column names, so the trick wouldn't work anymore (can't have two "" columns).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @lithomas1 , how to make sure that pyarrow and pandas have the same quoting results? In pyarrow 12.0.1, setting quoting_style to needed(the default value) will quote all strings, but by default pandas will only quote string if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a known issue with pyarrow, please read my bottom comment.
#54171 (comment)
| # Map quoting arg to pyarrow equivalents | ||
| pa_quoting = None | ||
| if self.quoting == csvlib.QUOTE_MINIMAL: | ||
| pa_quoting = "needed" | ||
| elif self.quoting == csvlib.QUOTE_ALL: | ||
| # TODO: Is this a 1-1 mapping? | ||
| # This doesn't quote nulls, check if Python does this | ||
| pa_quoting = "all_valid" | ||
| elif self.quoting == csvlib.QUOTE_NONE: | ||
| pa_quoting = "none" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quoting_style was only added in pyarrow 11.0 (apache/arrow#14722), we can only pass this argument to WriteOptions below for 11.0+
Based on the above, the default should be the same ("needed"), so that should still work fine for older versions of pyarrow.
| OK, this is down to 34 failing tests now (out of 87). I think I'll xfail all the failing tests and get this merge-ready for now soonish. Most of the things failing are just subtle differences in the way Pyarrow/Python CSV engine write things. |
| pre-commit ci fix |
| pre-commit.ci autofix |
for more information, see https://pre-commit.ci
| I don't think https://github.com/pandas-dev/pandas/pull/54171/files#r1267066476 was addressed otherwise looks good |
| Just a merge conflict otherwise looks good |
|
| This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
| Since this feature has not been updated for a long time, I am wondering if it is still being worked on. |
| I think there is just a typing error left that I haven't resolved, I'll add an ignore if it's still there. |
| | ||
| class TestToCSV: | ||
| def test_to_csv_with_single_column(self): | ||
| @xfail_pyarrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like we've been doing with the read_csv test recently, if the test fails due to an exception from pandas and not pyarrow we should check for that exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated most of these.
Since some test multiple things in a test, I left the xfail_pyarrow on some since sometimes it would fail on an earlier assert due to mismatch between the pyarrow and python output.
| if self.quotechar is not None and self.quotechar != '"': | ||
| raise ValueError('The pyarrow engine only supports " as a quotechar.') | ||
| | ||
| unsupported_options = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should escapechar be added here? It doesn't look supported by pyarrow.csv.WriteOptions.
| Looks like this PR has gotten stale so closing for now. Feel free to reopen when you have time to circle back |
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.