Skip to content

DOC: Document that astype() for Series and Dataframe can accept a series of dtypes #46353

@mvashishtha

Description

@mvashishtha

Pandas version checks

  • I have checked that the issue still exists on the latest versions of the docs on main here

Location of the documentation

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html

https://pandas.pydata.org/docs/reference/api/pandas.Series.astype.html

Documentation problem

The documentation for DataFrame states that dtype can be a dict mapping from column label to new type, or a scalar type. However, dtype can also be a pd.Series whose index contains a subset of the dataframe's labels (though perhaps in a different order) and whose values are dtypes. For example:

import pandas as pd df = pd.DataFrame([[1, 2]]) print(f"original dtypes:\n{df.dtypes}") new_df = df.astype(pd.Series(["string", "object"], index=df.columns[::-1])) print(f"new dtypes:\n{new_df.dtypes}")

In case the frame has duplicate column labels, the index of the new series of dtypes may still be a subset of the column labels. However, it appears that the only way that the new series can have a non-unique index is if that index is exactly equal to the dataframe's columns. For example, this works:

import pandas as pd df = pd.DataFrame([[1, 2, 3]], columns=[0, 0, 1]) print(f"original dtypes:\n{df.dtypes}") new_df = df.astype(pd.Series(["string", "object", "float"], index=df.columns)) print(f"new dtypes:\n{new_df.dtypes}")

but giving the series an index with all the columns in a different order

import pandas as pd df = pd.DataFrame([[1, 2, 3]], columns=[0, 0, 1]) print(f"original dtypes:\n{df.dtypes}") new_df = df.astype(pd.Series(["string", "object", "float"], index=df.columns[::-1])) print(f"new dtypes:\n{new_df.dtypes}")

raises ValueError: cannot reindex on an axis with duplicate labels:

Show stack trace
ValueError Traceback (most recent call last) Input In [14], in <module> 3 df = pd.DataFrame([[1, 2, 3]], columns=[0, 0, 1]) 4 print(f"original dtypes:\n{df.dtypes}") ----> 5 new_df = df.astype(pd.Series(["string", "object", "float"], index=df.columns[::-1])) 6 print(f"new dtypes:\n{new_df.dtypes}") File /usr/local/lib/python3.9/site-packages/pandas/core/generic.py:5898, in NDFrame.astype(self, dtype, copy, errors) 5895 from pandas import Series 5897 dtype_ser = Series(dtype, dtype=object) -> 5898 dtype_ser = dtype_ser.reindex(self.columns, fill_value=None, copy=False) 5900 results = [] 5901 for i, (col_name, col) in enumerate(self.items()): File /usr/local/lib/python3.9/site-packages/pandas/core/series.py:4672, in Series.reindex(self, *args, **kwargs) 4668 raise TypeError( 4669 "'index' passed as both positional and keyword argument" 4670 ) 4671 kwargs.update({"index": index}) -> 4672 return super().reindex(**kwargs) File /usr/local/lib/python3.9/site-packages/pandas/core/generic.py:4974, in NDFrame.reindex(self, *args, **kwargs) 4971 return self._reindex_multi(axes, copy, fill_value) 4973 # perform the reindex on the axes -> 4974 return self._reindex_axes( 4975 axes, level, limit, tolerance, method, fill_value, copy 4976 ).__finalize__(self, method="reindex") File /usr/local/lib/python3.9/site-packages/pandas/core/generic.py:4994, in NDFrame._reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy) 4989 new_index, indexer = ax.reindex( 4990 labels, level=level, limit=limit, tolerance=tolerance, method=method 4991 ) 4993 axis = self._get_axis_number(a) -> 4994 obj = obj._reindex_with_indexers( 4995 {axis: [new_index, indexer]}, 4996 fill_value=fill_value, 4997 copy=copy, 4998 allow_dups=False, 4999 ) 5000 # If we've made a copy once, no need to make another one 5001 copy = False File /usr/local/lib/python3.9/site-packages/pandas/core/generic.py:5040, in NDFrame._reindex_with_indexers(self, reindexers, fill_value, copy, allow_dups) 5037 indexer = ensure_platform_int(indexer) 5039 # TODO: speed up on homogeneous DataFrame objects (see _reindex_multi) -> 5040 new_data = new_data.reindex_indexer( 5041 index, 5042 indexer, 5043 axis=baxis, 5044 fill_value=fill_value, 5045 allow_dups=allow_dups, 5046 copy=copy, 5047 ) 5048 # If we've made a copy once, no need to make another one 5049 copy = False File /usr/local/lib/python3.9/site-packages/pandas/core/internals/managers.py:679, in BaseBlockManager.reindex_indexer(self, new_axis, indexer, axis, fill_value, allow_dups, copy, consolidate, only_slice, use_na_proxy) 677 # some axes don't allow reindexing with dups 678 if not allow_dups: --> 679 self.axes[axis]._validate_can_reindex(indexer) 681 if axis >= self.ndim: 682 raise IndexError("Requested axis not found in manager") File /usr/local/lib/python3.9/site-packages/pandas/core/indexes/base.py:4107, in Index._validate_can_reindex(self, indexer) 4105 # trying to reindex on an axis with duplicates 4106 if not self._index_as_unique and len(indexer): -> 4107 raise ValueError("cannot reindex on an axis with duplicate labels") ValueError: cannot reindex on an axis with duplicate labels 

and so does passing a series of dtypes for just the first two columns:

import pandas as pd df = pd.DataFrame([[1, 2, 3]], columns=[0, 0, 1]) print(f"original dtypes:\n{df.dtypes}") new_df = df.astype(pd.Series(["string", "object"], index=df.columns[:2])) print(f"new dtypes:\n{new_df.dtypes}")

although passing a type for just duplicate column label 0 converts the first two columns to the same new type:

import pandas as pd df = pd.DataFrame([[1, 2, 3]], columns=[0, 0, 1]) print(f"original dtypes:\n{df.dtypes}") new_df = df.astype(pd.Series(["string"], index=[0])) print(f"new dtypes:\n{new_df.dtypes}")

Finally, it's possible to pass a series of a single dtype to Series.astype. In that case, the one value in the series index must be the series name, e.g.:

import pandas as pd s = pd.Series(1) s.astype(pd.Series(["int64"], index=[s.name]))

Suggested fix for documentation

I don't know whether all of this is an intended feature, or just a bug caused by converting dict-like dtypes to Series here. If it's a feature, it should be documented in the documentation for astype for both DataFrame and Series.

Metadata

Metadata

Assignees

No one assigned

    Labels

    DocsDtype ConversionsUnexpected or buggy dtype conversionsSeriesSeries data structure

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions