ENH: Arrow backed string array - implement factorize() method without casting to objects #38007

simonjayhawkins · 2020-11-22T20:06:58Z

xref #35169 (comment), follow-on to #35259

This is moreless copy/paste from https://github.com/xhochy/fletcher with a slight tidy for initial review feedback

still to do

benchmarking
return type for chunked array with more than 1 chunk
maybe update tests.

we don't have failing tests, but the return type should be Tuple[np.ndarray, ExtensionArray] while we have return factorize(np_array, na_sentinel=na_sentinel) if more than 1 chunk and the return type of pd.factorize is Tuple[np.ndarray, Union[np.ndarray, ABCIndex]] (mypy doesn't report this as an error since np.ndarray resolves to Any)

since we don't have failing tests, we probably should paramatrize data for the base extension tests with an array with one chunk and a array with multiple chunks.

cc @jorisvandenbossche @xhochy

jorisvandenbossche

we probably should paramatrize data for the base extension tests with an array with one chunk and a array with multiple chunks.

That sounds as a good idea, yes (not sure how easy it is to do, since there are multiple data fixtures)

jorisvandenbossche · 2020-11-23T07:39:10Z

pandas/core/arrays/string_arrow.py

- return cls._from_sequence(values)
+ @doc(ExtensionArray.factorize)
+ def factorize(self, na_sentinel: int = -1) -> Tuple[np.ndarray, ExtensionArray]:
+ if self._data.num_chunks == 1:


Nowadays, dictionary_encode works fine for ChunkedArrays as well, so I am not sure this if statement is actually needed.

Tooks a stab at that in fletcher to let CI verify this assumption: Seems to work with pyarrow 0.17-2.0 xhochy/fletcher#206

github-actions · 2020-12-24T00:24:24Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

jreback · 2020-12-24T20:39:08Z

this should yield a perf improvement yes? can you add an asv for this.

jorisvandenbossche · 2020-12-28T13:50:46Z

There is an existing benchmark in algorithms.py::Factorize.time_factorize() which has a "string" case -> can rename this to "str", and add an actual "string" dtype version.

jreback · 2020-12-28T17:11:14Z

pandas/core/arrays/string_arrow.py

+ if indices.dtype.kind == "f":
+ indices[np.isnan(indices)] = na_sentinel
+ indices = indices.astype(int)
+ if not is_int64_dtype(indices):


you can just do
indices = indices.astype(np.int64, copy=False)

jreback · 2020-12-28T17:11:23Z

pandas/core/arrays/string_arrow.py

+ indices = encoded.indices.to_pandas()
+ if indices.dtype.kind == "f":
+ indices[np.isnan(indices)] = na_sentinel
+ indices = indices.astype(int)


int -> np.int64

github-actions · 2021-01-28T00:17:55Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

jreback · 2021-02-11T00:25:25Z

status here?

simonjayhawkins · 2021-02-16T11:19:23Z

status here?

benchmarks added but pyarrow not added to env. this is OK for local testing with asv dev. I'm not sure whether we want to add pyarrow to asv env.

extensions tests updated for arrays with more than 1 chunk. good news is that we now see the factorize failures

these will be fixed by incorporating latest changes from fletcher next and other comments addressed.

jorisvandenbossche

Looks good to me!

jorisvandenbossche · 2021-02-18T15:25:29Z

asv_bench/benchmarks/algorithms.py

 from pandas._libs import lib

 import pandas as pd
+from pandas.core.arrays.string_arrow import ArrowStringDtype


Can you do this in a try/except? (we need to be able to still run the benchmarks with slightly older pandas version that might not have this import available)

jorisvandenbossche · 2021-02-18T15:30:12Z

pandas/core/arrays/string_arrow.py

+ ).to_pandas()
+ if indices.dtype.kind == "f":
+ indices[np.isnan(indices)] = na_sentinel
+ indices = indices.astype(np.int64, copy=False)


Wondering, is the int64 needed here? (pyarrow will typically use int32 as default I think)

I suppose that we always return int64 from factorize for the indices. Short-term, casting to int64 might be best then (to ensure nothing else breaks because of not doing that), but long term we should maybe check if internally we require int64 or would be fine with int32 as well.

Wondering, is the int64 needed here? (pyarrow will typically use int32 as default I think)

refactor in 0023f08 partially to address comments

but yes, we seem to be getting an int32 from pyarrow

also we could maybe work with numpy arrays here directly for the indices instead of pandas Series?

jorisvandenbossche · 2021-02-18T15:37:56Z

@simonjayhawkins did you check what difference it gives in performance for the benchmark case compared to object dtype? (just that single case, no need to run the full suite)

simonjayhawkins · 2021-02-19T16:44:31Z

with changes in this PR to compare with String(Index)

[ 25.00%] ··· algorithms.Factorize.time_factorize ok [ 25.00%] ··· ======== ======= ==================== ========== unique sort dtype -------- ------- -------------------- ---------- True True int 3.66±0ms True True uint 4.55±0ms True True float 15.1±0ms True True string 59.8±0ms True True datetime64[ns] 182±0μs True True datetime64[ns, tz] 185±0μs True True Int64 5.07±0ms True True boolean 900±0μs True True string_arrow 63.4±0ms True False int 3.04±0ms True False uint 2.78±0ms True False float 2.49±0ms True False string 8.48±0ms True False datetime64[ns] 182±0μs True False datetime64[ns, tz] 183±0μs True False Int64 3.19±0ms True False boolean 603±0μs True False string_arrow 7.00±0ms False True int 8.14±0ms False True uint 8.45±0ms False True float 20.9±0ms False True string 75.7±0ms False True datetime64[ns] 10.8±0ms False True datetime64[ns, tz] 9.70±0ms False True Int64 9.70±0ms False True boolean 3.04±0ms False True string_arrow 68.8±0ms False False int 6.43±0ms False False uint 6.28±0ms False False float 8.13±0ms False False string 20.4±0ms False False datetime64[ns] 8.89±0ms False False datetime64[ns, tz] 6.56±0ms False False Int64 5.61±0ms False False boolean 2.53±0ms False False string_arrow 11.6±0ms ======== ======= ==================== ==========

without factorize to compare with _from_factorized

[ 25.00%] ··· algorithms.Factorize.time_factorize ok [ 25.00%] ··· ======== ======= ==================== ========== unique sort dtype -------- ------- -------------------- ---------- True True int 3.25±0ms True True uint 4.36±0ms True True float 14.9±0ms True True string 59.6±0ms True True datetime64[ns] 174±0μs True True datetime64[ns, tz] 175±0μs True True Int64 3.49±0ms True True boolean 858±0μs True True string_arrow 73.9±0ms True False int 1.81±0ms True False uint 1.50±0ms True False float 2.45±0ms True False string 7.45±0ms True False datetime64[ns] 179±0μs True False datetime64[ns, tz] 174±0μs True False Int64 2.80±0ms True False boolean 584±0μs True False string_arrow 16.5±0ms False True int 8.34±0ms False True uint 9.37±0ms False True float 21.1±0ms False True string 75.0±0ms False True datetime64[ns] 10.7±0ms False True datetime64[ns, tz] 11.8±0ms False True Int64 8.35±0ms False True boolean 3.27±0ms False True string_arrow 125±0ms False False int 7.07±0ms False False uint 6.18±0ms False False float 8.71±0ms False False string 23.1±0ms False False datetime64[ns] 7.16±0ms False False datetime64[ns, tz] 6.87±0ms False False Int64 6.39±0ms False False boolean 2.95±0ms False False string_arrow 74.2±0ms ======== ======= ==================== ==========

simonjayhawkins · 2021-03-02T13:06:45Z

@jorisvandenbossche @jreback anything else to be done here?

jorisvandenbossche

Thanks for the ping. Can you merge latest master to be sure?

simonjayhawkins · 2021-03-02T16:22:40Z

Can you merge latest master to be sure?

greenish.

jorisvandenbossche · 2021-03-02T16:51:48Z

Thanks!

simonjayhawkins added 3 commits November 21, 2020 19:54

moreless copy/paste from fletcher

c53a3c2

use docstring from base class

b7d0ab8

remove redundant type check

154496a

simonjayhawkins added ExtensionArray Extending pandas with custom dtypes or arrays. Strings String extension data type and string data labels Nov 22, 2020

jorisvandenbossche reviewed Nov 23, 2020

View reviewed changes

jorisvandenbossche mentioned this pull request Nov 23, 2020

Plan for a native string dtype #35169

Closed

jorisvandenbossche changed the title ~~Arrow backed string array - implement factorize() method (instead of converting to objects through _values_for_factorize)~~ ENH: Arrow backed string array - implement factorize() method without casting to objects Nov 23, 2020

github-actions bot added the Stale label Dec 24, 2020

jorisvandenbossche removed the Stale label Dec 24, 2020

jreback requested changes Dec 28, 2020

View reviewed changes

github-actions bot added the Stale label Jan 28, 2021

jorisvandenbossche removed the Stale label Jan 28, 2021

Merge remote-tracking branch 'upstream/master' into factorize

c545970

simonjayhawkins added 4 commits February 15, 2021 10:46

Merge remote-tracking branch 'upstream/master' into factorize

6e3aac8

ignore new mypy error

73c7de9

update algorithms.Factorize.time_factorize

42ca9c3

test for arrays with 2 chunks

a251537

simonjayhawkins added 4 commits February 18, 2021 11:13

Merge remote-tracking branch 'upstream/master' into factorize

dbc8253

fix failing test_factorize_equivalence

ea59c38

fix failing test_factorize_empty

7d98727

address dtype comment

0023f08

simonjayhawkins added this to the 1.3 milestone Feb 18, 2021

simonjayhawkins requested a review from jorisvandenbossche February 18, 2021 15:16

simonjayhawkins requested a review from jreback February 18, 2021 15:16

jorisvandenbossche reviewed Feb 18, 2021

View reviewed changes

simonjayhawkins added 2 commits February 19, 2021 16:46

move ArrowStringDtype import inside try/except

6a28414

Merge remote-tracking branch 'upstream/master' into factorize

c4db20d

jorisvandenbossche approved these changes Mar 2, 2021

View reviewed changes

Merge remote-tracking branch 'upstream/master' into factorize

88ab4f4

jorisvandenbossche merged commit 7be41ca into pandas-dev:master Mar 2, 2021

simonjayhawkins deleted the factorize branch March 2, 2021 17:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: Arrow backed string array - implement factorize() method without casting to objects #38007

ENH: Arrow backed string array - implement factorize() method without casting to objects #38007

Uh oh!

simonjayhawkins commented Nov 22, 2020 •

edited

Loading

jorisvandenbossche left a comment

jorisvandenbossche Nov 23, 2020

xhochy Nov 23, 2020

github-actions bot commented Dec 24, 2020

jreback commented Dec 24, 2020

jorisvandenbossche commented Dec 28, 2020

jreback Dec 28, 2020

jreback Dec 28, 2020

github-actions bot commented Jan 28, 2021

jreback commented Feb 11, 2021

simonjayhawkins commented Feb 16, 2021 •

edited

Loading

jorisvandenbossche left a comment

jorisvandenbossche Feb 18, 2021

jorisvandenbossche Feb 18, 2021

simonjayhawkins Feb 18, 2021

jorisvandenbossche commented Feb 18, 2021 •

edited

Loading

simonjayhawkins commented Feb 19, 2021

simonjayhawkins commented Mar 2, 2021

jorisvandenbossche left a comment

simonjayhawkins commented Mar 2, 2021

jorisvandenbossche commented Mar 2, 2021

Labels

4 participants

Uh oh!

ENH: Arrow backed string array - implement factorize() method without casting to objects #38007

ENH: Arrow backed string array - implement factorize() method without casting to objects #38007

Uh oh!

Conversation

simonjayhawkins commented Nov 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche Nov 23, 2020

Choose a reason for hiding this comment

xhochy Nov 23, 2020

Choose a reason for hiding this comment

github-actions bot commented Dec 24, 2020

jreback commented Dec 24, 2020

jorisvandenbossche commented Dec 28, 2020

jreback Dec 28, 2020

Choose a reason for hiding this comment

jreback Dec 28, 2020

Choose a reason for hiding this comment

github-actions bot commented Jan 28, 2021

jreback commented Feb 11, 2021

simonjayhawkins commented Feb 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche Feb 18, 2021

Choose a reason for hiding this comment

jorisvandenbossche Feb 18, 2021

Choose a reason for hiding this comment

simonjayhawkins Feb 18, 2021

Choose a reason for hiding this comment

jorisvandenbossche commented Feb 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

simonjayhawkins commented Feb 19, 2021

simonjayhawkins commented Mar 2, 2021

jorisvandenbossche left a comment

Choose a reason for hiding this comment

simonjayhawkins commented Mar 2, 2021

jorisvandenbossche commented Mar 2, 2021

Labels

4 participants

simonjayhawkins commented Nov 22, 2020 •

edited

Loading

simonjayhawkins commented Feb 16, 2021 •

edited

Loading

jorisvandenbossche commented Feb 18, 2021 •

edited

Loading