-
- Notifications
You must be signed in to change notification settings - Fork 19.2k
Description
Pandas version checks
- I have checked that the issue still exists on the latest versions of the docs on
main
here
Location of the documentation
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html
https://pandas.pydata.org/docs/reference/api/pandas.Series.astype.html
Documentation problem
The documentation for DataFrame
states that dtype
can be a dict
mapping from column label to new type, or a scalar type. However, dtype
can also be a pd.Series
whose index contains a subset of the dataframe's labels (though perhaps in a different order) and whose values are dtypes. For example:
import pandas as pd df = pd.DataFrame([[1, 2]]) print(f"original dtypes:\n{df.dtypes}") new_df = df.astype(pd.Series(["string", "object"], index=df.columns[::-1])) print(f"new dtypes:\n{new_df.dtypes}")
In case the frame has duplicate column labels, the index of the new series of dtypes may still be a subset of the column labels. However, it appears that the only way that the new series can have a non-unique index is if that index is exactly equal to the dataframe's columns
. For example, this works:
import pandas as pd df = pd.DataFrame([[1, 2, 3]], columns=[0, 0, 1]) print(f"original dtypes:\n{df.dtypes}") new_df = df.astype(pd.Series(["string", "object", "float"], index=df.columns)) print(f"new dtypes:\n{new_df.dtypes}")
but giving the series an index with all the columns in a different order
import pandas as pd df = pd.DataFrame([[1, 2, 3]], columns=[0, 0, 1]) print(f"original dtypes:\n{df.dtypes}") new_df = df.astype(pd.Series(["string", "object", "float"], index=df.columns[::-1])) print(f"new dtypes:\n{new_df.dtypes}")
raises ValueError: cannot reindex on an axis with duplicate labels
:
Show stack trace
ValueError Traceback (most recent call last) Input In [14], in <module> 3 df = pd.DataFrame([[1, 2, 3]], columns=[0, 0, 1]) 4 print(f"original dtypes:\n{df.dtypes}") ----> 5 new_df = df.astype(pd.Series(["string", "object", "float"], index=df.columns[::-1])) 6 print(f"new dtypes:\n{new_df.dtypes}") File /usr/local/lib/python3.9/site-packages/pandas/core/generic.py:5898, in NDFrame.astype(self, dtype, copy, errors) 5895 from pandas import Series 5897 dtype_ser = Series(dtype, dtype=object) -> 5898 dtype_ser = dtype_ser.reindex(self.columns, fill_value=None, copy=False) 5900 results = [] 5901 for i, (col_name, col) in enumerate(self.items()): File /usr/local/lib/python3.9/site-packages/pandas/core/series.py:4672, in Series.reindex(self, *args, **kwargs) 4668 raise TypeError( 4669 "'index' passed as both positional and keyword argument" 4670 ) 4671 kwargs.update({"index": index}) -> 4672 return super().reindex(**kwargs) File /usr/local/lib/python3.9/site-packages/pandas/core/generic.py:4974, in NDFrame.reindex(self, *args, **kwargs) 4971 return self._reindex_multi(axes, copy, fill_value) 4973 # perform the reindex on the axes -> 4974 return self._reindex_axes( 4975 axes, level, limit, tolerance, method, fill_value, copy 4976 ).__finalize__(self, method="reindex") File /usr/local/lib/python3.9/site-packages/pandas/core/generic.py:4994, in NDFrame._reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy) 4989 new_index, indexer = ax.reindex( 4990 labels, level=level, limit=limit, tolerance=tolerance, method=method 4991 ) 4993 axis = self._get_axis_number(a) -> 4994 obj = obj._reindex_with_indexers( 4995 {axis: [new_index, indexer]}, 4996 fill_value=fill_value, 4997 copy=copy, 4998 allow_dups=False, 4999 ) 5000 # If we've made a copy once, no need to make another one 5001 copy = False File /usr/local/lib/python3.9/site-packages/pandas/core/generic.py:5040, in NDFrame._reindex_with_indexers(self, reindexers, fill_value, copy, allow_dups) 5037 indexer = ensure_platform_int(indexer) 5039 # TODO: speed up on homogeneous DataFrame objects (see _reindex_multi) -> 5040 new_data = new_data.reindex_indexer( 5041 index, 5042 indexer, 5043 axis=baxis, 5044 fill_value=fill_value, 5045 allow_dups=allow_dups, 5046 copy=copy, 5047 ) 5048 # If we've made a copy once, no need to make another one 5049 copy = False File /usr/local/lib/python3.9/site-packages/pandas/core/internals/managers.py:679, in BaseBlockManager.reindex_indexer(self, new_axis, indexer, axis, fill_value, allow_dups, copy, consolidate, only_slice, use_na_proxy) 677 # some axes don't allow reindexing with dups 678 if not allow_dups: --> 679 self.axes[axis]._validate_can_reindex(indexer) 681 if axis >= self.ndim: 682 raise IndexError("Requested axis not found in manager") File /usr/local/lib/python3.9/site-packages/pandas/core/indexes/base.py:4107, in Index._validate_can_reindex(self, indexer) 4105 # trying to reindex on an axis with duplicates 4106 if not self._index_as_unique and len(indexer): -> 4107 raise ValueError("cannot reindex on an axis with duplicate labels") ValueError: cannot reindex on an axis with duplicate labels
and so does passing a series of dtypes for just the first two columns:
import pandas as pd df = pd.DataFrame([[1, 2, 3]], columns=[0, 0, 1]) print(f"original dtypes:\n{df.dtypes}") new_df = df.astype(pd.Series(["string", "object"], index=df.columns[:2])) print(f"new dtypes:\n{new_df.dtypes}")
although passing a type for just duplicate column label 0
converts the first two columns to the same new type:
import pandas as pd df = pd.DataFrame([[1, 2, 3]], columns=[0, 0, 1]) print(f"original dtypes:\n{df.dtypes}") new_df = df.astype(pd.Series(["string"], index=[0])) print(f"new dtypes:\n{new_df.dtypes}")
Finally, it's possible to pass a series of a single dtype to Series.astype
. In that case, the one value in the series index must be the series name, e.g.:
import pandas as pd s = pd.Series(1) s.astype(pd.Series(["int64"], index=[s.name]))
Suggested fix for documentation
I don't know whether all of this is an intended feature, or just a bug caused by converting dict-like dtypes to Series
here. If it's a feature, it should be documented in the documentation for astype
for both DataFrame
and Series
.