ENH: Add na_value argument to DataFrame.to_numpy #33857

dsaxton · 2020-04-29T01:28:37Z

closes ENH: Add na_value to DataFrame.to_numpy #33820
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

simonjayhawkins

Thanks @dsaxton maybe some tests for error messages and mixed dataframes

pandas/tests/base/test_conversion.py

simonjayhawkins · 2020-04-29T10:03:23Z

pandas/core/frame.py

 result = np.array(self.values, dtype=dtype, copy=copy)
+
+ if na_value is not lib.no_default:
+ result[isna(result)] = na_value


>>> df = pd.DataFrame({"A": [1, 2, None]}) >>> df.to_numpy(na_value=pd.NA) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Users\simon\pandas\pandas\core\frame.py", line 1349, in to_numpy result[isna(result)] = na_value TypeError: float() argument must be a string or a number, not 'NAType' >>>

do we want a more user friendly message, or perhaps validate the fill_value.

That will be difficult in general, I think (since it depends on the behaviour of numpy on what it accepts as value to set to the array).
For this specific case, pd.NA is basically only possible for object dtype (so that could be checked specifically. Although it is also not the typical use case for to_numpy, since you typically want to get rid of pd.NA, since that doesn't work nicely with numpy).

Now, for an extension array, you already get a slightly better error message:

In [18]: pd.array([1, 2, None]).to_numpy(float, na_value=pd.NA) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-18-3c4196341619> in <module> ----> 1 pd.array([1, 2, None]).to_numpy(float, na_value=pd.NA) ~/scipy/pandas/pandas/core/arrays/masked.py in to_numpy(self, dtype, copy, na_value) 145 ): 146 raise ValueError( --> 147 f"cannot convert to '{dtype}'-dtype NumPy array " 148 "with missing values. Specify an appropriate 'na_value' " 149 "for this dtype." ValueError: cannot convert to '<class 'float'>'-dtype NumPy array with missing values. Specify an appropriate 'na_value' for this dtype.

pandas/core/frame.py

jorisvandenbossche · 2020-04-29T11:24:09Z

pandas/core/frame.py

 result = np.array(self.values, dtype=dtype, copy=copy)
+
+ if na_value is not lib.no_default:
+ result[isna(result)] = na_value


That will be difficult in general, I think (since it depends on the behaviour of numpy on what it accepts as value to set to the array).
For this specific case, pd.NA is basically only possible for object dtype (so that could be checked specifically. Although it is also not the typical use case for to_numpy, since you typically want to get rid of pd.NA, since that doesn't work nicely with numpy).

Now, for an extension array, you already get a slightly better error message:

In [18]: pd.array([1, 2, None]).to_numpy(float, na_value=pd.NA) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-18-3c4196341619> in <module> ----> 1 pd.array([1, 2, None]).to_numpy(float, na_value=pd.NA) ~/scipy/pandas/pandas/core/arrays/masked.py in to_numpy(self, dtype, copy, na_value) 145 ): 146 raise ValueError( --> 147 f"cannot convert to '{dtype}'-dtype NumPy array " 148 "with missing values. Specify an appropriate 'na_value' " 149 "for this dtype." ValueError: cannot convert to '<class 'float'>'-dtype NumPy array with missing values. Specify an appropriate 'na_value' for this dtype.

pandas/core/frame.py

pandas/tests/base/test_conversion.py

pandas/core/frame.py

jorisvandenbossche · 2020-04-30T19:38:36Z

So I think you need to deal with this deeper into the internals, as I mentioned before. It is there that the custom pandas logic to decide on the resulting dtype is done (which is indeed different from numpy's result dtype)

jorisvandenbossche · 2020-04-30T19:39:59Z

Check BlockManager.as_array -> this also already handles the case of a single Block (to keep the no copy behaviour there).

I think we can expand that method to support na_value

pandas/core/internals/managers.py

pandas/core/frame.py

Otherwise this can raise for an all integer DataFrame without any missing values

pandas/core/frame.py

pandas/core/internals/managers.py

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

pandas/tests/base/test_conversion.py

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

dsaxton · 2020-05-04T23:06:46Z

pandas/tests/frame/test_api.py

 assert df.values.base is arr
 assert df.to_numpy(copy=False).base is arr
- assert df.to_numpy(copy=True).base is None
+ assert df.to_numpy(copy=True).base is not arr


This had to change because now the output array is a view, but only on the transpose of itself (which is a copy of the original data)

pandas/core/internals/managers.py

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

jreback · 2020-05-09T19:29:00Z

pandas/core/internals/managers.py

 arr = np.empty(self.shape, dtype=float)
 return arr.transpose() if transpose else arr

+ should_copy = copy or na_value is not lib.no_default


why is this needed?

It's to avoid making a second copy in certain cases where we can detect this (eg with multiple dtypes (going through _interleave), the constructed array is never the original values, so an additional copy is not needed

then i would just make this

copy = copy or na_value is not lib.no_default

and add an explanation line

jreback · 2020-05-09T19:29:44Z

pandas/core/internals/managers.py

 # always be object dtype. Some callers seem to want the
 # DatetimeArray (previously DTI)
 arr = self.blocks[0].get_values(dtype=object)
+ elif self._is_single_block and self.blocks[0].is_extension:


you might be able to remove the block aboev (is_datetimetz) as that is an extension block already

Indeed, the DatetimeArray.to_numpy behaviour seems correct for the default (having object dtype with timestamps instead of datetime64)

pandas/core/internals/managers.py

jreback · 2020-05-13T15:51:36Z

thanks @dsaxton

ENH: Add na_value argument to DataFrame.to_numpy

054d74e

simonjayhawkins reviewed Apr 29, 2020

View reviewed changes

simonjayhawkins added the Enhancement label Apr 29, 2020

jorisvandenbossche reviewed Apr 29, 2020

View reviewed changes

dsaxton added 4 commits April 29, 2020 07:37

Add some tests

cafbf5f

Issue num

34b9b9f

A little better

9a2bbd6

Merge remote-tracking branch 'upstream/master' into dataframe-to-numpy

f7dc246

jreback requested changes Apr 30, 2020

View reviewed changes

pandas/core/frame.py Outdated Show resolved Hide resolved

jorisvandenbossche reviewed Apr 30, 2020

View reviewed changes

pandas/core/frame.py Outdated Show resolved Hide resolved

This comment has been minimized.

Sign in to view

dsaxton added 4 commits May 1, 2020 14:17

as_array

5eb8bb2

Merge remote-tracking branch 'upstream/master' into dataframe-to-numpy

ec2f729

Black

b48b6c9

More black

d1a60e8

jorisvandenbossche reviewed May 1, 2020

View reviewed changes

pandas/core/internals/managers.py Outdated Show resolved Hide resolved

jorisvandenbossche reviewed May 1, 2020

View reviewed changes

pandas/core/internals/managers.py Show resolved Hide resolved

dsaxton added 4 commits May 1, 2020 20:53

to_numpy for ExtensionBlock

09fdf51

Merge remote-tracking branch 'upstream/master' into dataframe-to-numpy

f5db15a

dtype hack

89e8930

Merge remote-tracking branch 'upstream/master' into dataframe-to-numpy

d24b976

dsaxton commented May 2, 2020

View reviewed changes

pandas/core/frame.py Outdated Show resolved Hide resolved

jreback requested changes May 2, 2020

View reviewed changes

pandas/core/frame.py Outdated Show resolved Hide resolved

dsaxton added 5 commits May 3, 2020 08:51

reshape

bec3889

Merge remote-tracking branch 'upstream/master' into dataframe-to-numpy

a20f116

Check for NA first

02405a1

Otherwise this can raise for an all integer DataFrame without any missing values

Cast non-extension single block

055413f

Test nit

ae088e4

jorisvandenbossche reviewed May 4, 2020

View reviewed changes

pandas/core/frame.py Outdated Show resolved Hide resolved

pandas/core/frame.py Outdated Show resolved Hide resolved

pandas/core/internals/managers.py Outdated Show resolved Hide resolved

dsaxton and others added 3 commits May 4, 2020 07:10

Update pandas/core/frame.py

d78ba29

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

side effect test

9c87e00

Merge remote-tracking branch 'upstream/master' into dataframe-to-numpy

df5b683

jorisvandenbossche reviewed May 4, 2020

View reviewed changes

pandas/tests/base/test_conversion.py Outdated Show resolved Hide resolved

dsaxton and others added 4 commits May 4, 2020 07:30

Update pandas/tests/base/test_conversion.py

ae2b34a

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

Copy

bcb69c5

Copy less

c3a7a55

Merge remote-tracking branch 'upstream/master' into dataframe-to-numpy

b5ec43f

dsaxton commented May 4, 2020

View reviewed changes

jorisvandenbossche reviewed May 5, 2020

View reviewed changes

pandas/core/internals/managers.py Show resolved Hide resolved

dsaxton added 2 commits May 5, 2020 09:32

should_copy

f3e45d7

Merge remote-tracking branch 'upstream/master' into dataframe-to-numpy

e54cc28

jorisvandenbossche reviewed May 5, 2020

View reviewed changes

pandas/core/internals/managers.py Outdated Show resolved Hide resolved

Update pandas/core/internals/managers.py

491a5ae

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

dsaxton mentioned this pull request May 6, 2020

CLN: Use to_numpy directly in cov / corr #34031

Merged

jreback requested changes May 9, 2020

View reviewed changes

dsaxton added 3 commits May 9, 2020 17:00

Rename and comment

4ecccff

Merge remote-tracking branch 'upstream/master' into dataframe-to-numpy

142c808

Don't special case datetimetz

8d42fd4

jorisvandenbossche approved these changes May 11, 2020

View reviewed changes

Merge remote-tracking branch 'upstream/master' into dataframe-to-numpy

c2228bf

jreback added this to the 1.1 milestone May 13, 2020

jreback added ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels May 13, 2020

jreback approved these changes May 13, 2020

View reviewed changes

jreback merged commit 4bd6905 into pandas-dev:master May 13, 2020

dsaxton deleted the dataframe-to-numpy branch May 13, 2020 15:57

simonjayhawkins mentioned this pull request Jul 29, 2020

REGR: DataFrame.to_numpy(dtype=str) raises RuntimeError in pandas 1.1.0 #35455

Closed

1 task

Uh oh!

ENH: Add na_value argument to DataFrame.to_numpy #33857

ENH: Add na_value argument to DataFrame.to_numpy #33857

Uh oh!

Conversation

dsaxton commented Apr 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

simonjayhawkins left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

jorisvandenbossche commented Apr 30, 2020

jorisvandenbossche commented Apr 30, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

jreback commented May 13, 2020

Labels

4 participants

dsaxton commented Apr 29, 2020 •

edited

Loading