Skip to content

Conversation

@davidjcastrejon
Copy link

@davidjcastrejon davidjcastrejon commented Nov 3, 2025

MultiIndex.factorize() was silently converting extension dtypes (Int64, boolean, string) to base dtypes, causing data corruption. This fix preserves extension dtypes by restoring them level-by-level after factorization.

Before:

import pandas as pd mi = pd.MultiIndex.from_arrays([pd.array([1, 2, 3], dtype="Int64")]) codes, uniques = mi.factorize() print(uniques.dtypes.iloc[0]) # int64 ← Lost extension dtype x = pd.Series([1, None], dtype='Int32').to_frame(name='col') # This is 'Int32Dtype()' as expected print(pd.MultiIndex.from_frame(x).to_frame()['col'].dtype) # This is float64 print(pd.MultiIndex.from_frame(x).factorize()[1].to_frame().iloc[:, 0].dtype)

After:

import pandas as pd mi = pd.MultiIndex.from_arrays([pd.array([1, 2, 3], dtype="Int64")]) codes, uniques = mi.factorize() print(uniques.dtypes.iloc[0]) # Int64 ← Extension dtype preserved x = pd.Series([1, None], dtype='Int32').to_frame(name='col') # This is 'Int32Dtype()' as expected print(pd.MultiIndex.from_frame(x).to_frame()['col'].dtype) # This is Int32Dtype() print(pd.MultiIndex.from_frame(x).factorize()[1].to_frame().iloc[:, 0].dtype)

Performance Increase:
Some MultiIndex operations ~10% faster due to better type consistency.

Benchmarks:

asv continuous -f 1.1 upstream/main HEAD -b ^multiindex_object
@davidjcastrejon davidjcastrejon changed the title BUG: Fix multiindex factorize extension dtypes (#62337) BUG: Fix multiindex factorize extension dtypes Nov 3, 2025
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR but I think this is working around the core issue where algorithms.factorize is being called on self._values which is just a numpy array for a MultiIndex.

I think MultiIndex would need to override factorize and use a custom implementation if any level has an ExtentionDtype.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants