Skip to content

Conversation

@dsaxton
Copy link
Contributor

@dsaxton dsaxton commented Apr 29, 2020

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dsaxton maybe some tests for error messages and mixed dataframes

result = np.array(self.values, dtype=dtype, copy=copy)

if na_value is not lib.no_default:
result[isna(result)] = na_value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>> df = pd.DataFrame({"A": [1, 2, None]}) >>> df.to_numpy(na_value=pd.NA) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Users\simon\pandas\pandas\core\frame.py", line 1349, in to_numpy result[isna(result)] = na_value TypeError: float() argument must be a string or a number, not 'NAType' >>> 

do we want a more user friendly message, or perhaps validate the fill_value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will be difficult in general, I think (since it depends on the behaviour of numpy on what it accepts as value to set to the array).
For this specific case, pd.NA is basically only possible for object dtype (so that could be checked specifically. Although it is also not the typical use case for to_numpy, since you typically want to get rid of pd.NA, since that doesn't work nicely with numpy).

Now, for an extension array, you already get a slightly better error message:

In [18]: pd.array([1, 2, None]).to_numpy(float, na_value=pd.NA) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-18-3c4196341619> in <module> ----> 1 pd.array([1, 2, None]).to_numpy(float, na_value=pd.NA) ~/scipy/pandas/pandas/core/arrays/masked.py in to_numpy(self, dtype, copy, na_value) 145 ): 146 raise ValueError( --> 147 f"cannot convert to '{dtype}'-dtype NumPy array " 148 "with missing values. Specify an appropriate 'na_value' " 149 "for this dtype." ValueError: cannot convert to '<class 'float'>'-dtype NumPy array with missing values. Specify an appropriate 'na_value' for this dtype. 
result = np.array(self.values, dtype=dtype, copy=copy)

if na_value is not lib.no_default:
result[isna(result)] = na_value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will be difficult in general, I think (since it depends on the behaviour of numpy on what it accepts as value to set to the array).
For this specific case, pd.NA is basically only possible for object dtype (so that could be checked specifically. Although it is also not the typical use case for to_numpy, since you typically want to get rid of pd.NA, since that doesn't work nicely with numpy).

Now, for an extension array, you already get a slightly better error message:

In [18]: pd.array([1, 2, None]).to_numpy(float, na_value=pd.NA) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-18-3c4196341619> in <module> ----> 1 pd.array([1, 2, None]).to_numpy(float, na_value=pd.NA) ~/scipy/pandas/pandas/core/arrays/masked.py in to_numpy(self, dtype, copy, na_value) 145 ): 146 raise ValueError( --> 147 f"cannot convert to '{dtype}'-dtype NumPy array " 148 "with missing values. Specify an appropriate 'na_value' " 149 "for this dtype." ValueError: cannot convert to '<class 'float'>'-dtype NumPy array with missing values. Specify an appropriate 'na_value' for this dtype. 
@dsaxton

This comment has been minimized.

@jorisvandenbossche
Copy link
Member

So I think you need to deal with this deeper into the internals, as I mentioned before. It is there that the custom pandas logic to decide on the resulting dtype is done (which is indeed different from numpy's result dtype)

@jorisvandenbossche
Copy link
Member

Check BlockManager.as_array -> this also already handles the case of a single Block (to keep the no copy behaviour there).

I think we can expand that method to support na_value

assert df.values.base is arr
assert df.to_numpy(copy=False).base is arr
assert df.to_numpy(copy=True).base is None
assert df.to_numpy(copy=True).base is not arr
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This had to change because now the output array is a view, but only on the transpose of itself (which is a copy of the original data)

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
arr = np.empty(self.shape, dtype=float)
return arr.transpose() if transpose else arr

should_copy = copy or na_value is not lib.no_default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's to avoid making a second copy in certain cases where we can detect this (eg with multiple dtypes (going through _interleave), the constructed array is never the original values, so an additional copy is not needed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then i would just make this

copy = copy or na_value is not lib.no_default 

and add an explanation line

# always be object dtype. Some callers seem to want the
# DatetimeArray (previously DTI)
arr = self.blocks[0].get_values(dtype=object)
elif self._is_single_block and self.blocks[0].is_extension:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might be able to remove the block aboev (is_datetimetz) as that is an extension block already

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, the DatetimeArray.to_numpy behaviour seems correct for the default (having object dtype with timestamps instead of datetime64)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 👍

@jreback jreback added this to the 1.1 milestone May 13, 2020
@jreback jreback added ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels May 13, 2020
@jreback jreback merged commit 4bd6905 into pandas-dev:master May 13, 2020
@jreback
Copy link
Contributor

jreback commented May 13, 2020

thanks @dsaxton

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

4 participants