Skip to content

Conversation

@lithomas1
Copy link
Contributor

@lithomas1 lithomas1 commented Jul 17, 2023

@lithomas1 lithomas1 added Enhancement IO CSV read_csv, to_csv Arrow pyarrow functionality labels Jul 17, 2023
quoting_style=pa_quoting,
)
# pa_csv.write_csv(table, handle, write_options)
pa_csv.write_csv(table, self.filepath_or_buffer, write_options)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't decided yet whether to let pyarrow figure out how to do file handling on its own or use get_handle.

get_handle doesn't work out of the box due to pyarrow only working on binary handles IIRC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to try to convert the handle to a binary one?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to set specific encoding for pyarrow's write_csv?

path_or_buf: None = ...,
sep: str = ...,
na_rep: str = ...,
engine: str = "python",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would need to be placed at the end to be backward compat with positional argument calls I think

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sidenote: we should make most of those arguments keyword-only, like we did for read_ functions (after a deprecation cycle, I think)

# Convert index to column and rename name to empty string
# since we serialize the index as basically a column with no name
# TODO: this won't work for multi-indexes
obj = self.obj.reset_index(names=[""])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we explicitly set the name to None after the reset_index?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want None, but actually this empty string (for compatibility with current to_csv).

But, we should only rename to "" if the index has no name itself, i.e. only when the name was originally None.

So I think we can do something like:

Suggested change
obj = self.obj.reset_index(names=[""])
new_names = [label if label is not None else "" for label in self.obj.index.names]
obj = self.obj.reset_index(names=new_names)

and then that should also work fine for MultiIndex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the modifications. I tested it and it seems to work (just a small issue with pyarrow quoting all strings).

What if the MultiIndex doesn't have names, though?
IIRC, pyarrow doesn't allow duplicate column names, so the trick wouldn't work anymore (can't have two "" columns).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @lithomas1 , how to make sure that pyarrow and pandas have the same quoting results? In pyarrow 12.0.1, setting quoting_style to needed(the default value) will quote all strings, but by default pandas will only quote string if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a known issue with pyarrow, please read my bottom comment.
#54171 (comment)

Comment on lines 270 to 279
# Map quoting arg to pyarrow equivalents
pa_quoting = None
if self.quoting == csvlib.QUOTE_MINIMAL:
pa_quoting = "needed"
elif self.quoting == csvlib.QUOTE_ALL:
# TODO: Is this a 1-1 mapping?
# This doesn't quote nulls, check if Python does this
pa_quoting = "all_valid"
elif self.quoting == csvlib.QUOTE_NONE:
pa_quoting = "none"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quoting_style was only added in pyarrow 11.0 (apache/arrow#14722), we can only pass this argument to WriteOptions below for 11.0+

Based on the above, the default should be the same ("needed"), so that should still work fine for older versions of pyarrow.

@lithomas1
Copy link
Contributor Author

OK, this is down to 34 failing tests now (out of 87).

I think I'll xfail all the failing tests and get this merge-ready for now soonish.

Most of the things failing are just subtle differences in the way Pyarrow/Python CSV engine write things.
(One particularly annoying thing is that pyarrow quotes strings even if the quotes are technically not needed. It doesn't affect roundtripping I don't think, but it makes a bunch of the tests fail).

@lithomas1 lithomas1 marked this pull request as ready for review August 3, 2023 18:37
@lithomas1
Copy link
Contributor Author

pre-commit ci fix

@lithomas1
Copy link
Contributor Author

pre-commit.ci autofix

@mroeschke
Copy link
Member

I don't think https://github.com/pandas-dev/pandas/pull/54171/files#r1267066476 was addressed otherwise looks good

@mroeschke
Copy link
Member

Just a merge conflict otherwise looks good

@mroeschke
Copy link
Member

 /home/runner/work/pandas/pandas/pandas/io/formats/csvs.py:320:17 - error: Argument of type "IO[AnyStr@_save]" cannot be assigned to parameter "csvfile" of type "SupportsWrite[str]" in function "writer" "IO[AnyStr@_save]" is incompatible with protocol "SupportsWrite[str]" "write" is an incompatible type Type "(__s: AnyStr@_save, /) -> int" cannot be assigned to type "(__s: _T_contra@SupportsWrite, /) -> object" Parameter 1: type "_T_contra@SupportsWrite" cannot be assigned to type "AnyStr@_save" Type "str" cannot be assigned to type "AnyStr@_save" (reportGeneralTypeIssues) 
@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Sep 18, 2023
@SysuJayce
Copy link

Since this feature has not been updated for a long time, I am wondering if it is still being worked on.

@lithomas1
Copy link
Contributor Author

I think there is just a typing error left that I haven't resolved, I'll add an ignore if it's still there.

@lithomas1 lithomas1 removed the Stale label Nov 22, 2023

class TestToCSV:
def test_to_csv_with_single_column(self):
@xfail_pyarrow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like we've been doing with the read_csv test recently, if the test fails due to an exception from pandas and not pyarrow we should check for that exception

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated most of these.

Since some test multiple things in a test, I left the xfail_pyarrow on some since sometimes it would fail on an earlier assert due to mismatch between the pyarrow and python output.

if self.quotechar is not None and self.quotechar != '"':
raise ValueError('The pyarrow engine only supports " as a quotechar.')

unsupported_options = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should escapechar be added here? It doesn't look supported by pyarrow.csv.WriteOptions.

@mroeschke
Copy link
Member

Looks like this PR has gotten stale so closing for now. Feel free to reopen when you have time to circle back

@mroeschke mroeschke closed this Apr 23, 2024
@swt2c swt2c mentioned this pull request Sep 5, 2025
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Arrow pyarrow functionality Enhancement IO CSV read_csv, to_csv

5 participants