Skip to content

Conversation

@simonjayhawkins
Copy link
Member

@simonjayhawkins simonjayhawkins commented Nov 22, 2020

xref #35169 (comment), follow-on to #35259

This is moreless copy/paste from https://github.com/xhochy/fletcher with a slight tidy for initial review feedback

still to do

  • benchmarking
  • return type for chunked array with more than 1 chunk
  • maybe update tests.

we don't have failing tests, but the return type should be Tuple[np.ndarray, ExtensionArray] while we have return factorize(np_array, na_sentinel=na_sentinel) if more than 1 chunk and the return type of pd.factorize is Tuple[np.ndarray, Union[np.ndarray, ABCIndex]] (mypy doesn't report this as an error since np.ndarray resolves to Any)

since we don't have failing tests, we probably should paramatrize data for the base extension tests with an array with one chunk and a array with multiple chunks.

cc @jorisvandenbossche @xhochy

@simonjayhawkins simonjayhawkins added ExtensionArray Extending pandas with custom dtypes or arrays. Strings String extension data type and string data labels Nov 22, 2020
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably should paramatrize data for the base extension tests with an array with one chunk and a array with multiple chunks.

That sounds as a good idea, yes (not sure how easy it is to do, since there are multiple data fixtures)

return cls._from_sequence(values)
@doc(ExtensionArray.factorize)
def factorize(self, na_sentinel: int = -1) -> Tuple[np.ndarray, ExtensionArray]:
if self._data.num_chunks == 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nowadays, dictionary_encode works fine for ChunkedArrays as well, so I am not sure this if statement is actually needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tooks a stab at that in fletcher to let CI verify this assumption: Seems to work with pyarrow 0.17-2.0 xhochy/fletcher#206

@jorisvandenbossche jorisvandenbossche changed the title Arrow backed string array - implement factorize() method (instead of converting to objects through _values_for_factorize) ENH: Arrow backed string array - implement factorize() method without casting to objects Nov 23, 2020
@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@jreback
Copy link
Contributor

jreback commented Dec 24, 2020

this should yield a perf improvement yes? can you add an asv for this.

@jorisvandenbossche
Copy link
Member

There is an existing benchmark in algorithms.py::Factorize.time_factorize() which has a "string" case -> can rename this to "str", and add an actual "string" dtype version.

if indices.dtype.kind == "f":
indices[np.isnan(indices)] = na_sentinel
indices = indices.astype(int)
if not is_int64_dtype(indices):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just do
indices = indices.astype(np.int64, copy=False)

indices = encoded.indices.to_pandas()
if indices.dtype.kind == "f":
indices[np.isnan(indices)] = na_sentinel
indices = indices.astype(int)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int -> np.int64

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@jreback
Copy link
Contributor

jreback commented Feb 11, 2021

status here?

@simonjayhawkins
Copy link
Member Author

simonjayhawkins commented Feb 16, 2021

status here?

benchmarks added but pyarrow not added to env. this is OK for local testing with asv dev. I'm not sure whether we want to add pyarrow to asv env.

extensions tests updated for arrays with more than 1 chunk. good news is that we now see the factorize failures

these will be fixed by incorporating latest changes from fletcher next and other comments addressed.

@simonjayhawkins simonjayhawkins added this to the 1.3 milestone Feb 18, 2021
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

from pandas._libs import lib

import pandas as pd
from pandas.core.arrays.string_arrow import ArrowStringDtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do this in a try/except? (we need to be able to still run the benchmarks with slightly older pandas version that might not have this import available)

).to_pandas()
if indices.dtype.kind == "f":
indices[np.isnan(indices)] = na_sentinel
indices = indices.astype(np.int64, copy=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering, is the int64 needed here? (pyarrow will typically use int32 as default I think)

I suppose that we always return int64 from factorize for the indices. Short-term, casting to int64 might be best then (to ensure nothing else breaks because of not doing that), but long term we should maybe check if internally we require int64 or would be fine with int32 as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering, is the int64 needed here? (pyarrow will typically use int32 as default I think)

refactor in 0023f08 partially to address comments

but yes, we seem to be getting an int32 from pyarrow

also we could maybe work with numpy arrays here directly for the indices instead of pandas Series?

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Feb 18, 2021

@simonjayhawkins did you check what difference it gives in performance for the benchmark case compared to object dtype? (just that single case, no need to run the full suite)

@simonjayhawkins
Copy link
Member Author

with changes in this PR to compare with String(Index)

[ 25.00%] ··· algorithms.Factorize.time_factorize ok [ 25.00%] ··· ======== ======= ==================== ========== unique sort dtype -------- ------- -------------------- ---------- True True int 3.66±0ms True True uint 4.55±0ms True True float 15.1±0ms True True string 59.8±0ms True True datetime64[ns] 182±0μs True True datetime64[ns, tz] 185±0μs True True Int64 5.07±0ms True True boolean 900±0μs True True string_arrow 63.4±0ms True False int 3.04±0ms True False uint 2.78±0ms True False float 2.49±0ms True False string 8.48±0ms True False datetime64[ns] 182±0μs True False datetime64[ns, tz] 183±0μs True False Int64 3.19±0ms True False boolean 603±0μs True False string_arrow 7.00±0ms False True int 8.14±0ms False True uint 8.45±0ms False True float 20.9±0ms False True string 75.7±0ms False True datetime64[ns] 10.8±0ms False True datetime64[ns, tz] 9.70±0ms False True Int64 9.70±0ms False True boolean 3.04±0ms False True string_arrow 68.8±0ms False False int 6.43±0ms False False uint 6.28±0ms False False float 8.13±0ms False False string 20.4±0ms False False datetime64[ns] 8.89±0ms False False datetime64[ns, tz] 6.56±0ms False False Int64 5.61±0ms False False boolean 2.53±0ms False False string_arrow 11.6±0ms ======== ======= ==================== ========== 

without factorize to compare with _from_factorized

[ 25.00%] ··· algorithms.Factorize.time_factorize ok [ 25.00%] ··· ======== ======= ==================== ========== unique sort dtype -------- ------- -------------------- ---------- True True int 3.25±0ms True True uint 4.36±0ms True True float 14.9±0ms True True string 59.6±0ms True True datetime64[ns] 174±0μs True True datetime64[ns, tz] 175±0μs True True Int64 3.49±0ms True True boolean 858±0μs True True string_arrow 73.9±0ms True False int 1.81±0ms True False uint 1.50±0ms True False float 2.45±0ms True False string 7.45±0ms True False datetime64[ns] 179±0μs True False datetime64[ns, tz] 174±0μs True False Int64 2.80±0ms True False boolean 584±0μs True False string_arrow 16.5±0ms False True int 8.34±0ms False True uint 9.37±0ms False True float 21.1±0ms False True string 75.0±0ms False True datetime64[ns] 10.7±0ms False True datetime64[ns, tz] 11.8±0ms False True Int64 8.35±0ms False True boolean 3.27±0ms False True string_arrow 125±0ms False False int 7.07±0ms False False uint 6.18±0ms False False float 8.71±0ms False False string 23.1±0ms False False datetime64[ns] 7.16±0ms False False datetime64[ns, tz] 6.87±0ms False False Int64 6.39±0ms False False boolean 2.95±0ms False False string_arrow 74.2±0ms ======== ======= ==================== ========== 
@simonjayhawkins
Copy link
Member Author

@jorisvandenbossche @jreback anything else to be done here?

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the ping. Can you merge latest master to be sure?

@simonjayhawkins
Copy link
Member Author

Can you merge latest master to be sure?

greenish.

@jorisvandenbossche jorisvandenbossche merged commit 7be41ca into pandas-dev:master Mar 2, 2021
@jorisvandenbossche
Copy link
Member

Thanks!

@simonjayhawkins simonjayhawkins deleted the factorize branch March 2, 2021 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ExtensionArray Extending pandas with custom dtypes or arrays. Strings String extension data type and string data

4 participants