-
- Notifications
You must be signed in to change notification settings - Fork 19.4k
BUG: Add fillna at the beginning of _where not to fill NA. #60729 #60772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
pandas/core/generic.py Outdated
| if axis is not None: | ||
| axis = self._get_axis_number(axis) | ||
| | ||
| # We should not be filling NA. See GH#60729 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this trying to fill missing values when NaN is the missing value indicator? I don't think that is right either - the missing values should propogate for all types. We may just be missing coverage for the NaN case (which should be added to the test)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback, @WillAyd .
I thought we could make the values propagate by filling cond with True, since _where() would finally keep the values in self alive where its cond is True.
Even if I don't fill those values here, _where would call fillna() using inplace at the below code. That's also why the result varies depending on whether inpalce=True or not.
Lines 9695 to 9698 in e3b2de8
| # make sure we are boolean | |
| fill_value = bool(inplace) | |
| cond = cond.fillna(fill_value) | |
| cond = cond.infer_objects() |
Could you explain in more detail what you mean by propagate for all type? Do you mean we need to keep NA as it is even after this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @WillAyd,
I've done some further investigations on this, but I still belive the current code is the simplest way to make the missing values propagate.
If we want to let NA propagate without calling fillna() here, there might be too many code changes needed. See below codes :
- Need to change the below code so that we don't fill the missing values when caller is
where()ormask(). If we don't,fillna()will fill them withinplace.
Lines 9695 to 9698 in f1441b2
| # make sure we are boolean | |
| fill_value = bool(inplace) | |
| cond = cond.fillna(fill_value) | |
| cond = cond.infer_objects() |
- Need to change the below code as well since
to_numpy()will fill the missing value usinginplacewhen cond is a DataFrame.
Lines 9703 to 9716 in f1441b2
| if not isinstance(cond, ABCDataFrame): | |
| # This is a single-dimensional object. | |
| if not is_bool_dtype(cond): | |
| raise TypeError(msg.format(dtype=cond.dtype)) | |
| else: | |
| for _dt in cond.dtypes: | |
| if not is_bool_dtype(_dt): | |
| raise TypeError(msg.format(dtype=_dt)) | |
| if cond._mgr.any_extension_types: | |
| # GH51574: avoid object ndarray conversion later on | |
| cond = cond._constructor( | |
| cond.to_numpy(dtype=bool, na_value=fill_value), | |
| **cond._construct_axes_dict(), | |
| ) |
- Since
extract_bool_array()fills the missing values using argna_value=FalseatEABackedBlock.where(), we might need to find every single NA index from cond before we call this function(using isna() for example) and then implement additional behaviour to make those values propagate atExtensionArray._where().
pandas/pandas/core/internals/blocks.py
Lines 1664 to 1668 in f1441b2
| def where(self, other, cond) -> list[Block]: | |
| arr = self.values.T | |
| cond = extract_bool_array(cond) | |
pandas/pandas/core/array_algos/putmask.py
Lines 116 to 127 in f1441b2
| def extract_bool_array(mask: ArrayLike) -> npt.NDArray[np.bool_]: | |
| """ | |
| If we have a SparseArray or BooleanArray, convert it to ndarray[bool]. | |
| """ | |
| if isinstance(mask, ExtensionArray): | |
| # We could have BooleanArray, Sparse[bool], ... | |
| # Except for BooleanArray, this is equivalent to just | |
| # np.asarray(mask, dtype=bool) | |
| mask = mask.to_numpy(dtype=bool, na_value=False) | |
| mask = np.asarray(mask, dtype=bool) | |
| return mask |
If _where() is trying to fill the missing values for cond anyway, I think we don't necessarily have to disfavour the current code change. Could you give me some feedback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this trying to fill missing values when NaN is the missing value indicator? I don't think that is right either - the missing values should propogate for all types.
By filling in the missing values on cond with True, the missing value in the caller propagates. It's not filling in this missing values on cond that then fails to properly propagate the caller's missing value.
Co-authored-by: WillAyd <will_ayd@innobi.io>
| FYI, it seems this has already been discussed at #53124 (comment) |
Co-authored-by: Xiao Yuan <yuanx749@gmail.com>
rhshadrach left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks like the right approach, but a question on if we can simplify here.
pandas/core/generic.py Outdated
| if isinstance(cond, np.ndarray): | ||
| cond = np.array(cond) | ||
| cond[np.isnan(cond)] = True | ||
| elif isinstance(cond, NDFrame): | ||
| cond = cond.fillna(True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also do fillna on L9704 below. Can these be combined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, what other types besides ndarray and NDFrame get here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, what other types besides ndarray and NDFrame get here?
Hi @rhshadrach, thanks for the review!
I've tried to find if there are any other types that possibly can get here, but I couldn't find any.
According to the documentation, cond should be one of these : bool Series/DataFrame, array-like, or callable.
And array-like such as list/tuple would be converted to NDFrame/np.ndarray via below codes.
In case we input list/tuple to mask():
or In case we input callable(a function that returns list or tuple) to mask():
(pandas/core/generic.py > NDFrame.mask())
Lines 10096 to 10098 in 57fd502
| # see gh-21891 | |
| if not hasattr(cond, "__invert__"): | |
| cond = np.array(cond) |
In case we input list/tuple to where():
or In case we input scalar to 'mask()' or 'where():
or In case we input callable(a function that returns list or tuple) to where():
(pandas/core/generic.py > NDFrame._where())
Lines 9712 to 9717 in 57fd502
| else: | |
| if not hasattr(cond, "shape"): | |
| cond = np.asanyarray(cond) | |
| if cond.shape != self.shape: | |
| raise ValueError("Array conditional must be same shape as self") | |
| cond = self._constructor(cond, **self._construct_axes_dict(), copy=False) |
Please let me know if I'm missing anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @rhshadrach, sorry for the confusion. I just realized that cond could be either list or tuple when we input a callable to where(). Will revise the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhshadrach,
I've tried to combine this code with fillna(inplace) at L9732 as you said, but it seems this would result in some test failures since align() at L9722 sometimes returns an ndarray with full of np.nan, and then cond is supposed to be filled with inplace(=False) by L9732. And several tests is current expecting this behaviour as it is. For example, at tests.frame.indexing.test_mask.test_mask_stringdtype[Series] :
tests.frame.indexing.test_mask.test_mask_stringdtype[Series]
def test_mask_stringdtype(frame_or_series): # GH 40824 obj = DataFrame( {"A": ["foo", "bar", "baz", NA]}, index=["id1", "id2", "id3", "id4"], dtype=StringDtype(), ) filtered_obj = DataFrame( {"A": ["this", "that"]}, index=["id2", "id3"], dtype=StringDtype() ) expected = DataFrame( {"A": [NA, "this", "that", NA]}, index=["id1", "id2", "id3", "id4"], dtype=StringDtype(), ) if frame_or_series is Series: obj = obj["A"] filtered_obj = filtered_obj["A"] expected = expected["A"] filter_ser = Series([False, True, True, False]) result = obj.mask(filter_ser, filtered_obj) tm.assert_equal(result, expected)result >>> id1 foo id2 bar id3 baz id4 <NA> Name: A, dtype: string expected >>> id1 <NA> id2 this id3 that id4 <NA> Name: A, dtype: stringI suspect the behaviour of align() at L9722 is not desirable because current code will re-initialize the cond for these cases and fill cond with False regardless of cond given by users as below.
obj >>> id1 foo id2 bar id3 baz id4 <NA> Name: A, dtype: string filtered_obj >>> id2 this id3 that Name: A, dtype: string filter_ser = pd.Series([False, True, True, False]) filter_ser_2 = pd.Series([False, False, False, False]) filter_ser_3 = pd.Series([True, True, True, True]) result = obj.mask(filter_ser, filtered_obj) # Should return ["foo", "this", "that", pd.NA]. But this test is currently expecthing to be [pd.NA, "this", "that", pd.NA] result >>> id1 <NA> id2 this id3 that id4 <NA> Name: A, dtype: string result_2 = obj.mask(filter_ser_2, filtered_obj) # Should return ["foo", "bar", "baz", pd.NA] result_2 >>> id1 <NA> id2 this id3 that id4 <NA> Name: A, dtype: string result_3 = obj.mask(filter_ser_3, filtered_obj) # Should reutrn ["pd.NA, "this", "that", pd.NA] result_3 >>> id1 <NA> id2 this id3 that id4 <NA> Name: A, dtype: stringI think I'd better open another issue regarding this, but for now, I suppose we'd best to leave fillna() as it is, not combining with the below one. Could you please let me know what you think about this?
failing tests
tests.frame.indexing.test_getitem.TestGetitemBooleanMask.test_getitem_boolean_series_with_duplicate_columns
tests.frame.indexing.test_indexing.TestDataFrameIndexing.test_setitem_cast
tests.frame.indexing.test_mask.test_mask_stringdtype[Series]
tests.frame.indexing.test_mask.test_mask_where_dtype_timedelta
tests.frame.indexing.test_where.test_where_bool_comparison
tests.frame.indexing.test_where.test_where_string_dtype[Series]
tests.frame.indexing.test_where.TestDataFrameIndexingWhere.test_where_alignment[float_string]
tests.frame.indexing.test_where.TestDataFrameIndexingWhere.test_where_alignment[mixed_int]
tests.frame.indexing.test_where.TestDataFrameIndexingWhere.test_where_bug
tests.frame.indexing.test_where.TestDataFrameIndexingWhere.test_where_invalid
tests.frame.indexing.test_where.TestDataFrameIndexingWhere.test_where_ndframe_align
tests.frame.indexing.test_where.TestDataFrameIndexingWhere.test_where_none
tests.indexing.multiindex.test_setitem.TestMultiIndexSetItem.test_frame_getitem_setitem_multislice
tests.indexing.test_indexing.TestMisc.test_no_reference_cycle
tests.series.indexing.test_mask.test_mask_casts
tests.series.indexing.test_where.test_where_error
tests.series.indexing.test_where.test_where_setitem_invalid
| | ||
| def test_mask_na(): | ||
| # We should not be filling pd.NA. See GH#60729 | ||
| series = Series([None, 1, 2, None, 3, 4, None], dtype=Int64Dtype()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add a test for arrow. Can parametrize with e.g.
@pytest.mark.parametrize("dtype", ["Int64", "int64[pyarrow]") There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, just added a test for pyarrow.
Co-authored by: rhshadrach <rhshadrach@gmail.com>
rhshadrach left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good!
pandas/core/generic.py Outdated
| cond[isna(cond)] = True | ||
| elif isinstance(cond, NDFrame): | ||
| cond = cond.fillna(True) | ||
| elif isinstance(cond, (list, tuple)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs for where state that cond can be "list-like", so we should be using is_list_like instead of this condition. However, can you instead move this section so that it's combined with the if block on L9714 immediately below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've change the code, but it seems this causes test failures. I'll convert the status to draft and revise the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just confirmed that all the tests passed. Thanks!
| from pandas import ( | ||
| Series, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you revert this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code reverted. Thanks!
| if dtype == "int64[pyarrow]": | ||
| pytest.importorskip("pyarrow") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead, can you change the parametrization to be:
pytest.param("int64[pyarrow]", marks=td.skip_if_no("pyarrow")) There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. just changed the code.
| Many thanks for your feedback, @rhshadrach . All the changes you requested are now reflected. Could you review the changes? |
pandas/core/generic.py Outdated
| if not hasattr(cond, "shape"): | ||
| cond = np.asanyarray(cond) | ||
| else: | ||
| cond = np.array(cond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| cond = np.array(cond) | |
| cond = extract_array(cond, extract_numpy=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we avoid the copy here somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might be able to call fillna(True) right below the self._constructor() since cond will become an NDFrame there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mroeschke ,confirmed that all the checks passed. Could you review the code change?
Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>
| Hi @mroeschke , Thanks for the suggestion! But it seems some tests fail when we changed the code |
| # align the cond to same shape as myself | ||
| cond = common.apply_if_callable(cond, self) | ||
| if isinstance(cond, NDFrame): | ||
| cond = cond.fillna(True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, that makes sense to me. But I believe that's only necessary to fix the bug mentioned above and not necessarily for this PR. If that is the case, I think it should be handled separately.
pandas/core/generic.py Outdated
| if all( | ||
| x is NA or isinstance(x, (np.bool_, bool)) or x is np.nan | ||
| for x in cond.flatten() | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following why cond = cond.astype(bool) would be incorrect here. bool(np.nan) gives True, and so while users may be surprised by pandas replacing np.nan in such a case, I believe it is technically correct. pd.NA on the other hand doesn't allow conversion to Boolean and propagates in comparisons (unlike nan), so I think we should special case this.
pandas/core/generic.py Outdated
| else: | ||
| if not hasattr(cond, "shape"): | ||
| cond = np.asanyarray(cond) | ||
| cond = np.asanyarray(cond, dtype=object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no way to avoid this?
pandas/core/generic.py Outdated
| # see gh-21891 | ||
| if not hasattr(cond, "__invert__"): | ||
| cond = np.array(cond) | ||
| cond = np.array(cond, dtype=object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, no way to avoid this?
pandas/core/generic.py Outdated
| | ||
| if isinstance(cond, np.ndarray): | ||
| if all( | ||
| x is NA or isinstance(x, (np.bool_, bool)) or x is np.nan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lib.is_bool for the isinstance checks
| My intution here is that we want |
| Sorry for the late response, @jbrockmendel . Do you mean we have to raise when |
That's what I would do, yes |
… into add-mask-fillna
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.Added fillna at the beginning of _where so that we can fill pd.NA.
Since this is my first PR, please correct me if I'm mistaken. Thanks!