-
- Notifications
You must be signed in to change notification settings - Fork 19.3k
ENH: Arrow backed string array - implement factorize() method without casting to objects #38007
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 12 commits
c53a3c2 b7d0ab8 154496a c545970 6e3aac8 73c7de9 42ca9c3 a251537 dbc8253 ea59c38 7d98727 0023f08 6a28414 c4db20d 88ab4f4 File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| | @@ -6,6 +6,7 @@ | |
| Any, | ||
| Optional, | ||
| Sequence, | ||
| Tuple, | ||
| Type, | ||
| Union, | ||
| ) | ||
| | @@ -20,6 +21,7 @@ | |
| Dtype, | ||
| NpDtype, | ||
| ) | ||
| from pandas.util._decorators import doc | ||
| from pandas.util._validators import validate_fillna_kwargs | ||
| | ||
| from pandas.core.dtypes.base import ExtensionDtype | ||
| | @@ -273,9 +275,22 @@ def __len__(self) -> int: | |
| """ | ||
| return len(self._data) | ||
| | ||
| @classmethod | ||
| def _from_factorized(cls, values, original): | ||
| return cls._from_sequence(values) | ||
| @doc(ExtensionArray.factorize) | ||
| def factorize(self, na_sentinel: int = -1) -> Tuple[np.ndarray, ExtensionArray]: | ||
| encoded = self._data.dictionary_encode() | ||
| indices = pa.chunked_array( | ||
| [c.indices for c in encoded.chunks], type=encoded.type.index_type | ||
| ).to_pandas() | ||
| if indices.dtype.kind == "f": | ||
| indices[np.isnan(indices)] = na_sentinel | ||
| indices = indices.astype(np.int64, copy=False) | ||
| Member There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wondering, is the I suppose that we always return Member Author There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
refactor in 0023f08 partially to address comments but yes, we seem to be getting an int32 from pyarrow also we could maybe work with numpy arrays here directly for the indices instead of pandas Series? | ||
| | ||
| if encoded.num_chunks: | ||
| uniques = type(self)(encoded.chunk(0).dictionary) | ||
| else: | ||
| uniques = type(self)(pa.array([], type=encoded.type.value_type)) | ||
| | ||
| return indices.values, uniques | ||
| | ||
| @classmethod | ||
| def _concat_same_type(cls, to_concat) -> ArrowStringArray: | ||
| | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you do this in a try/except? (we need to be able to still run the benchmarks with slightly older pandas version that might not have this import available)