-
- Notifications
You must be signed in to change notification settings - Fork 19.4k
BUG: Add fillna at the beginning of _where not to fill NA. #60729 #60772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 4 commits
bc9a942 558569f bbbc720 e2f32cb 6fd8986 d2d5f62 475f2d1 db30b58 55fe420 cb94cf7 71e442e b6bd3af 8bac997 89bc1b4 f154cf5 eed6121 9ac81f0 7e3fd3a 8c5ffff 5516517 2437ce2 9556aa4 b64b8a7 9574746 c073c0b 4eea08e bbc5612 0851593 98fb602 915b8a7 7611f59 88b0530 044d0a9 601f6c9 4528a54 acf805a 3239d2f f7a1e5a f3ed54d 16dda53 dfa96ac e557c05 7e0d4e0 85f35e9 e2df2d0 0526b02 ffd15b7 17b83cb 350b75d e7f8fa7 506f39d 8b0c385 23934ec 9cd462e File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| | @@ -9674,6 +9674,13 @@ def _where( | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| if axis is not None: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| axis = self._get_axis_number(axis) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # We should not be filling NA. See GH#60729 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # make sure we are boolean | |
| fill_value = bool(inplace) | |
| cond = cond.fillna(fill_value) | |
| cond = cond.infer_objects() |
Could you explain in more detail what you mean by propagate for all type? Do you mean we need to keep NA as it is even after this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @WillAyd,
I've done some further investigations on this, but I still belive the current code is the simplest way to make the missing values propagate.
If we want to let NA propagate without calling fillna() here, there might be too many code changes needed. See below codes :
- Need to change the below code so that we don't fill the missing values when caller is
where()ormask(). If we don't,fillna()will fill them withinplace.
Lines 9695 to 9698 in f1441b2
| # make sure we are boolean | |
| fill_value = bool(inplace) | |
| cond = cond.fillna(fill_value) | |
| cond = cond.infer_objects() |
- Need to change the below code as well since
to_numpy()will fill the missing value usinginplacewhen cond is a DataFrame.
Lines 9703 to 9716 in f1441b2
| if not isinstance(cond, ABCDataFrame): | |
| # This is a single-dimensional object. | |
| if not is_bool_dtype(cond): | |
| raise TypeError(msg.format(dtype=cond.dtype)) | |
| else: | |
| for _dt in cond.dtypes: | |
| if not is_bool_dtype(_dt): | |
| raise TypeError(msg.format(dtype=_dt)) | |
| if cond._mgr.any_extension_types: | |
| # GH51574: avoid object ndarray conversion later on | |
| cond = cond._constructor( | |
| cond.to_numpy(dtype=bool, na_value=fill_value), | |
| **cond._construct_axes_dict(), | |
| ) |
- Since
extract_bool_array()fills the missing values using argna_value=FalseatEABackedBlock.where(), we might need to find every single NA index from cond before we call this function(using isna() for example) and then implement additional behaviour to make those values propagate atExtensionArray._where().
pandas/pandas/core/internals/blocks.py
Lines 1664 to 1668 in f1441b2
| def where(self, other, cond) -> list[Block]: | |
| arr = self.values.T | |
| cond = extract_bool_array(cond) | |
pandas/pandas/core/array_algos/putmask.py
Lines 116 to 127 in f1441b2
| def extract_bool_array(mask: ArrayLike) -> npt.NDArray[np.bool_]: | |
| """ | |
| If we have a SparseArray or BooleanArray, convert it to ndarray[bool]. | |
| """ | |
| if isinstance(mask, ExtensionArray): | |
| # We could have BooleanArray, Sparse[bool], ... | |
| # Except for BooleanArray, this is equivalent to just | |
| # np.asarray(mask, dtype=bool) | |
| mask = mask.to_numpy(dtype=bool, na_value=False) | |
| mask = np.asarray(mask, dtype=bool) | |
| return mask |
If _where() is trying to fill the missing values for cond anyway, I think we don't necessarily have to disfavour the current code change. Could you give me some feedback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this trying to fill missing values when NaN is the missing value indicator? I don't think that is right either - the missing values should propogate for all types.
By filling in the missing values on cond with True, the missing value in the caller propagates. It's not filling in this missing values on cond that then fails to properly propagate the caller's missing value.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also do fillna on L9704 below. Can these be combined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, what other types besides ndarray and NDFrame get here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, what other types besides ndarray and NDFrame get here?
Hi @rhshadrach, thanks for the review!
I've tried to find if there are any other types that possibly can get here, but I couldn't find any.
According to the documentation, cond should be one of these : bool Series/DataFrame, array-like, or callable.
And array-like such as list/tuple would be converted to NDFrame/np.ndarray via below codes.
In case we input list/tuple to mask():
or In case we input callable(a function that returns list or tuple) to mask():
(pandas/core/generic.py > NDFrame.mask())
Lines 10096 to 10098 in 57fd502
| # see gh-21891 | |
| if not hasattr(cond, "__invert__"): | |
| cond = np.array(cond) |
In case we input list/tuple to where():
or In case we input scalar to 'mask()' or 'where():
or In case we input callable(a function that returns list or tuple) to where():
(pandas/core/generic.py > NDFrame._where())
Lines 9712 to 9717 in 57fd502
| else: | |
| if not hasattr(cond, "shape"): | |
| cond = np.asanyarray(cond) | |
| if cond.shape != self.shape: | |
| raise ValueError("Array conditional must be same shape as self") | |
| cond = self._constructor(cond, **self._construct_axes_dict(), copy=False) |
Please let me know if I'm missing anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @rhshadrach, sorry for the confusion. I just realized that cond could be either list or tuple when we input a callable to where(). Will revise the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhshadrach,
I've tried to combine this code with fillna(inplace) at L9732 as you said, but it seems this would result in some test failures since align() at L9722 sometimes returns an ndarray with full of np.nan, and then cond is supposed to be filled with inplace(=False) by L9732. And several tests is current expecting this behaviour as it is. For example, at tests.frame.indexing.test_mask.test_mask_stringdtype[Series] :
tests.frame.indexing.test_mask.test_mask_stringdtype[Series]
def test_mask_stringdtype(frame_or_series): # GH 40824 obj = DataFrame( {"A": ["foo", "bar", "baz", NA]}, index=["id1", "id2", "id3", "id4"], dtype=StringDtype(), ) filtered_obj = DataFrame( {"A": ["this", "that"]}, index=["id2", "id3"], dtype=StringDtype() ) expected = DataFrame( {"A": [NA, "this", "that", NA]}, index=["id1", "id2", "id3", "id4"], dtype=StringDtype(), ) if frame_or_series is Series: obj = obj["A"] filtered_obj = filtered_obj["A"] expected = expected["A"] filter_ser = Series([False, True, True, False]) result = obj.mask(filter_ser, filtered_obj) tm.assert_equal(result, expected)result >>> id1 foo id2 bar id3 baz id4 <NA> Name: A, dtype: string expected >>> id1 <NA> id2 this id3 that id4 <NA> Name: A, dtype: stringI suspect the behaviour of align() at L9722 is not desirable because current code will re-initialize the cond for these cases and fill cond with False regardless of cond given by users as below.
obj >>> id1 foo id2 bar id3 baz id4 <NA> Name: A, dtype: string filtered_obj >>> id2 this id3 that Name: A, dtype: string filter_ser = pd.Series([False, True, True, False]) filter_ser_2 = pd.Series([False, False, False, False]) filter_ser_3 = pd.Series([True, True, True, True]) result = obj.mask(filter_ser, filtered_obj) # Should return ["foo", "this", "that", pd.NA]. But this test is currently expecthing to be [pd.NA, "this", "that", pd.NA] result >>> id1 <NA> id2 this id3 that id4 <NA> Name: A, dtype: string result_2 = obj.mask(filter_ser_2, filtered_obj) # Should return ["foo", "bar", "baz", pd.NA] result_2 >>> id1 <NA> id2 this id3 that id4 <NA> Name: A, dtype: string result_3 = obj.mask(filter_ser_3, filtered_obj) # Should reutrn ["pd.NA, "this", "that", pd.NA] result_3 >>> id1 <NA> id2 this id3 that id4 <NA> Name: A, dtype: stringI think I'd better open another issue regarding this, but for now, I suppose we'd best to leave fillna() as it is, not combining with the below one. Could you please let me know what you think about this?
failing tests
tests.frame.indexing.test_getitem.TestGetitemBooleanMask.test_getitem_boolean_series_with_duplicate_columns
tests.frame.indexing.test_indexing.TestDataFrameIndexing.test_setitem_cast
tests.frame.indexing.test_mask.test_mask_stringdtype[Series]
tests.frame.indexing.test_mask.test_mask_where_dtype_timedelta
tests.frame.indexing.test_where.test_where_bool_comparison
tests.frame.indexing.test_where.test_where_string_dtype[Series]
tests.frame.indexing.test_where.TestDataFrameIndexingWhere.test_where_alignment[float_string]
tests.frame.indexing.test_where.TestDataFrameIndexingWhere.test_where_alignment[mixed_int]
tests.frame.indexing.test_where.TestDataFrameIndexingWhere.test_where_bug
tests.frame.indexing.test_where.TestDataFrameIndexingWhere.test_where_invalid
tests.frame.indexing.test_where.TestDataFrameIndexingWhere.test_where_ndframe_align
tests.frame.indexing.test_where.TestDataFrameIndexingWhere.test_where_none
tests.indexing.multiindex.test_setitem.TestMultiIndexSetItem.test_frame_getitem_setitem_multislice
tests.indexing.test_indexing.TestMisc.test_no_reference_cycle
tests.series.indexing.test_mask.test_mask_casts
tests.series.indexing.test_where.test_where_error
tests.series.indexing.test_where.test_where_setitem_invalid
Uh oh!
There was an error while loading. Please reload this page.