Skip to content

Categorical dtype doesn't survive groupby of first, max, min, value_counts etc.: unwanted coercion to object #18502

@dcolascione

Description

@dcolascione

Code Sample, a copy-pastable example if possible

# Your code here In [1]: df=pd.DataFrame(dict(payload=[-1,-2,-1,-2], col=pd.Categorical(["foo", "bar", "bar", "qux"], ordered=True)));df Out[1]: col payload 0 foo -1 1 bar -2 2 bar -1 3 qux -2 In [2]: df.groupby("payload").first().col.dtype Out[2]: dtype('O')

Problem description

Grouping shouldn't coerce a categorical into object. Categorical dtypes should be preserved as long as possible for efficiency and correctness.

Expected Output

The result dtype should be CategoricalDtype(categories=['bar', 'foo', 'qux'], ordered=True), just like it is here:

 In [6]: df.groupby("payload").head().col.dtype Out[6]: CategoricalDtype(categories=['bar', 'foo', 'qux'], ordered=True #### Output of ``pd.show_versions()`` <details> INSTALLED VERSIONS ------------------ commit: None python: 3.4.3.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-98-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.21.0 pytest: 3.2.5 pip: 9.0.1 setuptools: 36.5.0 Cython: 0.20.1post0 numpy: 1.13.3 scipy: 0.13.3 pyarrow: None xarray: 0.9.6 IPython: 6.2.0 sphinx: None patsy: None dateutil: 2.6.1 pytz: 2017.3 blosc: None bottleneck: None tables: 3.1.1 numexpr: 2.6.4 feather: None matplotlib: 1.3.1 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: 4.2.1 html5lib: 0.999 sqlalchemy: 0.8.4 pymysql: None psycopg2: None jinja2: 2.7.2 s3fs: 0.1.2 fastparquet: None pandas_gbq: None pandas_datareader: None [paste the output of ``pd.show_versions()`` here below this line] </details>

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions