ENH: Add arrow engine to to_csv #54171

lithomas1 · 2023-07-17T21:38:31Z

closes ENH/PERF: provide pyarrow engine option for to_csv #53618 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

lithomas1 · 2023-07-17T21:42:39Z

pandas/io/formats/csvs.py

+ quoting_style=pa_quoting,
+ )
+ # pa_csv.write_csv(table, handle, write_options)
+ pa_csv.write_csv(table, self.filepath_or_buffer, write_options)


Haven't decided yet whether to let pyarrow figure out how to do file handling on its own or use get_handle.

get_handle doesn't work out of the box due to pyarrow only working on binary handles IIRC.

Would it be possible to try to convert the handle to a binary one?

How to set specific encoding for pyarrow's write_csv?

mroeschke · 2023-07-18T16:57:43Z

pandas/core/generic.py

 path_or_buf: None = ...,
 sep: str = ...,
 na_rep: str = ...,
+ engine: str = "python",


Would need to be placed at the end to be backward compat with positional argument calls I think

Sidenote: we should make most of those arguments keyword-only, like we did for read_ functions (after a deprecation cycle, I think)

pandas/io/formats/csvs.py

mroeschke · 2023-07-18T16:59:31Z

pandas/io/formats/csvs.py

+ # Convert index to column and rename name to empty string
+ # since we serialize the index as basically a column with no name
+ # TODO: this won't work for multi-indexes
+ obj = self.obj.reset_index(names=[""])


Can we explicitly set the name to None after the reset_index?

We don't want None, but actually this empty string (for compatibility with current to_csv).

But, we should only rename to "" if the index has no name itself, i.e. only when the name was originally None.

So I think we can do something like:

Suggested change

obj = self.obj.reset_index(names=[""])

new_names = [label if label is not None else "" for label in self.obj.index.names]

obj = self.obj.reset_index(names=new_names)

and then that should also work fine for MultiIndex?

Thanks for the modifications. I tested it and it seems to work (just a small issue with pyarrow quoting all strings).

What if the MultiIndex doesn't have names, though?
IIRC, pyarrow doesn't allow duplicate column names, so the trick wouldn't work anymore (can't have two "" columns).

Hello @lithomas1 , how to make sure that pyarrow and pandas have the same quoting results? In pyarrow 12.0.1, setting quoting_style to needed(the default value) will quote all strings, but by default pandas will only quote string if needed.

Yes, this is a known issue with pyarrow, please read my bottom comment.
#54171 (comment)

pandas/io/formats/csvs.py

jorisvandenbossche · 2023-07-19T09:34:08Z

pandas/io/formats/csvs.py

+ # Map quoting arg to pyarrow equivalents
+ pa_quoting = None
+ if self.quoting == csvlib.QUOTE_MINIMAL:
+ pa_quoting = "needed"
+ elif self.quoting == csvlib.QUOTE_ALL:
+ # TODO: Is this a 1-1 mapping?
+ # This doesn't quote nulls, check if Python does this
+ pa_quoting = "all_valid"
+ elif self.quoting == csvlib.QUOTE_NONE:
+ pa_quoting = "none"


quoting_style was only added in pyarrow 11.0 (apache/arrow#14722), we can only pass this argument to WriteOptions below for 11.0+

Based on the above, the default should be the same ("needed"), so that should still work fine for older versions of pyarrow.

…w-to-csv

lithomas1 · 2023-07-23T00:24:17Z

OK, this is down to 34 failing tests now (out of 87).

I think I'll xfail all the failing tests and get this merge-ready for now soonish.

Most of the things failing are just subtle differences in the way Pyarrow/Python CSV engine write things.
(One particularly annoying thing is that pyarrow quotes strings even if the quotes are technically not needed. It doesn't affect roundtripping I don't think, but it makes a bunch of the tests fail).

…w-to-csv

lithomas1 · 2023-08-04T18:53:50Z

pre-commit ci fix

lithomas1 · 2023-08-04T18:54:26Z

pre-commit.ci autofix

for more information, see https://pre-commit.ci

mroeschke · 2023-08-04T21:25:34Z

I don't think https://github.com/pandas-dev/pandas/pull/54171/files#r1267066476 was addressed otherwise looks good

mroeschke · 2023-08-08T16:41:32Z

Just a merge conflict otherwise looks good

mroeschke · 2023-08-10T22:26:51Z

 /home/runner/work/pandas/pandas/pandas/io/formats/csvs.py:320:17 - error: Argument of type "IO[AnyStr@_save]" cannot be assigned to parameter "csvfile" of type "SupportsWrite[str]" in function "writer" "IO[AnyStr@_save]" is incompatible with protocol "SupportsWrite[str]" "write" is an incompatible type Type "(__s: AnyStr@_save, /) -> int" cannot be assigned to type "(__s: _T_contra@SupportsWrite, /) -> object" Parameter 1: type "_T_contra@SupportsWrite" cannot be assigned to type "AnyStr@_save" Type "str" cannot be assigned to type "AnyStr@_save" (reportGeneralTypeIssues)

github-actions · 2023-09-18T00:05:22Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

SysuJayce · 2023-11-15T03:47:26Z

Since this feature has not been updated for a long time, I am wondering if it is still being worked on.

lithomas1 · 2023-11-15T13:18:07Z

I think there is just a typing error left that I haven't resolved, I'll add an ignore if it's still there.

pandas/core/generic.py

pandas/tests/io/formats/test_to_csv.py

mroeschke · 2023-11-23T00:13:31Z

pandas/tests/io/formats/test_to_csv.py

+
 class TestToCSV:
- def test_to_csv_with_single_column(self):
+ @xfail_pyarrow


Like we've been doing with the read_csv test recently, if the test fails due to an exception from pandas and not pyarrow we should check for that exception

Updated most of these.

Since some test multiple things in a test, I left the xfail_pyarrow on some since sometimes it would fail on an earlier assert due to mismatch between the pyarrow and python output.

rohanjain101 · 2024-03-12T13:45:37Z

pandas/io/formats/csvs.py

+ if self.quotechar is not None and self.quotechar != '"':
+ raise ValueError('The pyarrow engine only supports " as a quotechar.')
+
+ unsupported_options = [


Should escapechar be added here? It doesn't look supported by pyarrow.csv.WriteOptions.

mroeschke · 2024-04-23T17:57:40Z

Looks like this PR has gotten stale so closing for now. Feel free to reopen when you have time to circle back

ENH: Add arrow engine to to_csv

d0e7d86

lithomas1 added Enhancement IO CSV read_csv, to_csv Arrow pyarrow functionality labels Jul 17, 2023

lithomas1 commented Jul 17, 2023

View reviewed changes

lithomas1 mentioned this pull request Jul 17, 2023

ENH/PERF: provide pyarrow engine option for to_csv #53618

Open

mroeschke reviewed Jul 18, 2023

View reviewed changes

pandas/io/formats/csvs.py Outdated Show resolved Hide resolved

mroeschke reviewed Jul 18, 2023

View reviewed changes

pandas/io/formats/csvs.py Outdated Show resolved Hide resolved

jorisvandenbossche reviewed Jul 19, 2023

View reviewed changes

lithomas1 added 2 commits July 22, 2023 13:40

Merge branch 'main' of https://github.com/pandas-dev/pandas into arro…

4b7f880

…w-to-csv

pass more

8328120

lithomas1 added 2 commits August 3, 2023 11:13

Merge branch 'main' of https://github.com/pandas-dev/pandas into arro…

f988f0d

…w-to-csv

xfail everything

a889ebf

lithomas1 marked this pull request as ready for review August 3, 2023 18:37

lithomas1 added 4 commits August 3, 2023 12:55

revert unintentional change

1f7ffea

fix typing and tests

faeed4c

green everything?

47d48f1

Merge branch 'main' into arrow-to-csv

9a8d250

lithomas1 requested review from jorisvandenbossche and mroeschke August 4, 2023 18:54

[pre-commit.ci] auto fixes from pre-commit.com hooks

ae9f87c

for more information, see https://pre-commit.ci

move option to end

c49309c

Merge branch 'main' into arrow-to-csv

74be30c

Merge branch 'main' into arrow-to-csv

d08991c

github-actions bot added the Stale label Sep 18, 2023

Merge branch 'main' into arrow-to-csv

08d9cf5

Merge branch 'main' into arrow-to-csv

8689109

Update csvs.py

da13091

lithomas1 removed the Stale label Nov 22, 2023

lithomas1 added 3 commits November 22, 2023 12:18

Update csvs.py

6345ab5

Merge branch 'main' into arrow-to-csv

3948072

green and move whatsnew

bde1a2b