-
- Notifications
You must be signed in to change notification settings - Fork 19.3k
ENH: Provide an errors parameter to fillna #15653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Try it out yourself: pd.DataFrame({'a': [1,2,np.nan], 'b': [np.nan, 2, 3]}).fillna({'a': 'foo', 'b': 'bar'}, errors='raise') pd.DataFrame({'a': [1,2,np.nan], 'b': [np.nan, 2, 3]}).fillna({'a': 'foo', 'b': 'bar'}, errors='force') pd.DataFrame({'a': [1,2,np.nan], 'b': [np.nan, 2, 3]}).fillna({'a': 'foo', 'b': 'bar'}, errors='ignore')
|
| @jreback Two API questions:
|
| This fails seemingly every
|
| Yeah, I think |
pandas/core/generic.py Outdated
| limit=None, downcast=None, errors='force'): | ||
| inplace = validate_bool_kwarg(inplace, 'inplace') | ||
| | ||
| from pandas import Series |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you won't do any of this here, this will all be done in pandas\core\missing.py
| if errors == 'force' or errors == 'raise': | ||
| # we cannot coerce the underlying object, so | ||
| # make an ObjectBlock | ||
| return self.to_object_block(mgr=mgr).fillna(original_value, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, this should all be done in missing
pandas/types/missing.py Outdated
| return np.nan | ||
| | ||
| | ||
| def valid_fill_value(value, dtype, is_period=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> validate_fill_value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just returns a bool, the action verb validate implies to me raising an error or processing the data somehow. Like in e.g. maybe_upcast. I can rename this, but I suggest reserving validate_fill_value as the name for the fillna sub-impl I'm moving to core/missing.py per your review.
pandas/types/missing.py Outdated
| ---------- | ||
| value : scalar | ||
| dtype: string / dtype | ||
| is_period : bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
take this parameter out, you can skip period validation for now if its not working
pandas/core/missing.py Outdated
| return result | ||
| | ||
| | ||
| def maybe_fill(obj, value, errors): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is better as validate_fill_value, just raising if its an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would work if we had coerce and raise modes only, but with the addition of ignore we have to flag the non-conformal column so that the fill routine can skip it. Hence why I modified this method to return a bool.
How else would you implement ignore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cleanest way I can think of is writing a separate maybe_ignore method which factors that bit of logic out. But, you will have to run a is_valid_fill_value operation twice. Maybe that's OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a non-idempotent (probably not the right word for this) function that either raises or returns a flag depending on the mode.
For the fillna use-case alone one way to avoid this combination would be to add an if-else to the method body, and split this logic into two separate methods in missing (so e.g. if mode is errors do validate_fill_value, otherwise do continue_to_fill).
This would be fine for fillna alone. But, I'm trying to keep an eye out also for extending this raise/ignore/coerce stuff to other methods as well in some later PRs (#15533), and obviously you can't go about inserting this same if-else block everywhere all willy-nilly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having a function that takes an argument, then returns (with possibly modified), and potentially raises is just fine as well. You can always raise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
validation functions on the other hand will generally just return None OR raise. returning a boolean is a bit odd actually. You want simple as possible for things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback I agree with you about the function signature, will do.
The trouble is that we need some mechanism for telling the underlying function (fillna) to short-circuit. So we would need e.g.
try: missing.validate_fill_value(self, value) except ValueError: if errors == 'ignore': returnI don't know of any way of implementing that in a separate function without passing a flag somewhere. If we want to wrap this whole sequence of actions in one function call, that would have to be part of the method signature.
But, looking at this again, I think it'd actually be best to just keep this logic in fillna for now, and figure out to generalize it more broadly (so we don't have to rewrite this same bit in shift et. al.) in a different PR later.
pandas/core/missing.py Outdated
| | ||
| def maybe_fill(obj, value, errors): | ||
| """ | ||
| fillna error coercion routine, returns whether or not to continue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add a doc-string
pandas/core/missing.py Outdated
| fillna error coercion routine, returns whether or not to continue. | ||
| """ | ||
| from pandas import Series | ||
| if isinstance(obj, Series): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use ABCSeries (import at top)
pandas/core/missing.py Outdated
| from pandas import Series | ||
| if isinstance(obj, Series): | ||
| if errors is None or errors == 'coerce': | ||
| return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
errors must be coerce, raise, ignore, not is not acceptable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, I provided coerce as the default kwarg and passed that on through the fillna method stack. However whilst Sparse and Categorical series accept an errors param they raise immediately if it is not None (because it's not implemented/doesn't make sense for them). So the default (coerce) was causing those routines to raise—and a lot of test failures.
I remade the default param to None and introduced this modified check to address this concern.
How should this be handled instead?
pandas/types/missing.py Outdated
| return np.nan | ||
| | ||
| | ||
| def valid_fill_value(value, dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> is_valid_fill_value
ebec92c to 54c8406 Compare pandas/core/missing.py Outdated
| _ensure_float64) | ||
| | ||
| is_float_dtype, is_datetime64_dtype, | ||
| is_datetime64tz_dtype, is_integer_dtype, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leave the formatting as is
pandas/core/missing.py Outdated
| Fillna error coercion routine. | ||
| Construct a piecewise polynomial in the Bernstein basis, compatible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dangers of copy-paste.
pandas/core/missing.py Outdated
| parsed as a sequence of sub-Series later on. | ||
| value : object | ||
| The value to be used as a fill for the object. | ||
| errors : {'raise', 'coerce', or 'ignore'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not a parameter
| I'm investigating the five tests in the suite that the current code still doesn't pass, and I've hit something I can't explain originating in At a high level the bug is the fact that with the following bit:
While this PR gives: This difference seems to be emitting from line 268 in In In this PR: But how is this possible? This PR only adds a bit of if-else logic to the And the current master seems to pass all tests, including this one...huh. I wouldn't call this an edge case, the old behavior definitely needs to be preserved. Maybe I'm missing something, going to have to circle back to this later. |
| @ResidentMario this is related to this bug: #15613 I think with the default |
| inplace=inplace, | ||
| downcast=False) | ||
| if errors == 'coerce' or errors == 'raise': | ||
| # we cannot coerce the underlying object, so |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise shoulnd't coerce.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should have already caught and raised in this case, no need to have this need. Seems I made a bit of a mess while figuring out a working code path.
| downcast=False, | ||
| errors='ignore') | ||
| else: # errors == 'ignore' | ||
| return self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is tricky for datetimes, they auto-infer.
| Thanks for the clarification, this is seems to be the same as your comment. Going to need to fiddle with it to find a workaround. A more minimal example: pd.Series([pd.Timestamp('2015-01-01'), pd.NaT]).fillna("?")Fills with |
| Should probably write the tests first, this got picked up because a test happened to depend on this behavior, maybe there are other cases also like this. |
| Looking at the tests, I see three dissatisfactory cases:
The second one needs to be solved, the first and last are limitations of |
| Looks like it was simpler than I had thought. The problem is that to keep To the |
87854e3 to 5010d5c Compare | Does this perhaps also fix the problem reported in #16112? |
| No, trying an error mode on a sparse matrix will raise an error for now. |
|
Suppose we have: import pandas as pd from io import StringIO res = pd.read_csv(StringIO(""",dates1,dates2,floatcol,intcol,stringcol 0,1970-01-01 00:00:00.000000000,1970-01-01 00:00:00.000000000,,, 1,1970-01-01 00:00:00.000000000,,,, 2,1970-01-01 00:00:00.000000000,1970-01-01 00:00:00.000000000,,, 3,1970-01-01 00:00:00.000000000,,,, 4,1970-01-01 00:00:00.000000000,,,, 5,1970-01-01 00:00:00.000000000,1970-01-01 00:00:00.000000000,,, 6,1970-01-01 00:00:00.000000000,1970-01-01 00:00:00.000000000,,, 7,1970-01-01 00:00:00.000000000,,,, 8,1970-01-01 00:00:00.000000000,1970-01-01 00:00:00.000000001,,, 9,1970-01-01 00:00:00.000000001,1970-01-01 00:00:00.000000001,,,"""), index_col=0)On master: On the feature branch: On master even though the first column is already full and doesn't receive any boolean fills, the column Since I now force this operation to be done column-by-column, But, this test relies on the old behavior, because it then attempts: res.fillna(True).astype(bool)Which fails with: Because while I suspect the other remaining failures are similar. |
| On master: import pandas as pd from io import StringIO res = pd.read_csv(StringIO(""",dates1,dates2 0,1970-01-01 00:00:00.000000000,1970-01-01 00:00:00.000000000 1,1970-01-01 00:00:00.000000000,"""), index_col=0) res['dates1'] = pd.to_datetime(res['dates1']) res.astype(bool)Raises: |
a1a39e1 to a30fe18 Compare a30fe18 to ee1f011 Compare | @ResidentMario your last is correct. if you need to do booleans with that you have to fillna first. |
| @jreback Unsure what you mean above, can you clarify? |
| @ResidentMario I mean your example above is exactly as expected an not an error. |
| Returning to this after a long absence. @jreback's solution works in the reproduction case, but can't be used in implementation. The problem is that it uses The brute-force way of resolving this, mentioned at #16155, is to upcast the But I'm just not seeing any hooks I can use to get in there and fix just this corner case specifically, either. I'm stumped. Welcoming suggestions from others. This is the last remaining test failure in this PR (assuming the driver failures are benign). There's still work to do with checking the impact on performance and going over the tests again, but this is otherwise reasonably close to a finish. |
| closing as stale. this is a bit easier in current master FYI. |
git diff upstream/master | flake8 --diff