Skip to content

Conversation

@jreback
Copy link
Contributor

@jreback jreback commented Jan 24, 2017

ENH: allow hashing of MultiIndex

closes #12397

@jreback
Copy link
Contributor Author

jreback commented Jan 24, 2017

cc @mrocklin
cc @jcrist

also adds hashing of MultiIndex.

@jreback
Copy link
Contributor Author

jreback commented Jan 24, 2017

In [52]: i = pd.MultiIndex.from_tuples([(118, 472), (236, 118), (51, 204), (102, 51)]) In [53]: i Out[53]: MultiIndex(levels=[[51, 102, 118, 236], [51, 118, 204, 472]], labels=[[2, 3, 0, 1], [3, 1, 2, 0]]) In [55]: i.to_dataframe(index=False) Out[55]: 0 1 0 118 472 1 236 118 2 51 204 3 102 51 In [56]: from pandas.tools.hashing import hash_pandas_object In [57]: hash_pandas_object(i.to_dataframe(index=False), index=False) Out[57]: 0 11950414010286087598 1 11950414010286087598 2 10472907816967777234 3 10472907816967777234 dtype: uint64 In [58]: hash_pandas_object(i.to_dataframe(index=False), index=True) Out[58]: 0 17404497957148711178 1 5195826631379738351 2 10365020365066803200 3 15157173997208611942 dtype: uint64 

odd that [57] can produce duplicates. any thoughts @mikegraham
(in practice I don't think this matters as we normally also include the index, which then makes these unique)

but I have a case where I just want to uniquely hash values (no index)

@codecov-io
Copy link

codecov-io commented Jan 25, 2017

Current coverage is 86.30% (diff: 100%)

No coverage report found for master at ba05744.

Powered by Codecov. Last update ba05744...4a151c6

@jreback
Copy link
Contributor Author

jreback commented Jan 25, 2017

@jorisvandenbossche ok with this? (I need to build on top of this for other things).....

@jorisvandenbossche
Copy link
Member

Yes, looks good. Only thing I am wondering is the name. We already have Series.to_frame as well, so it would be nice to be consistent here (although I think I like to_dataframe more ..).

Also wondering if this should be restricted to MultiIndex and not just general for Index (but that can certainly go in a follow-up PR if we want that)

@jreback
Copy link
Contributor Author

jreback commented Jan 25, 2017

sure will change .to_frame()

not sure if we should add this to Series, it doesn't really make senses there :>

@jorisvandenbossche
Copy link
Member

not sure if we should add this to Series, it doesn't really make senses there :>

I wrote Index, not Series (but maybe that's what you meant :-)). Index already has a to_series method, which is more logical for an Index, but the main motivation would be to just not create more distinction in api between single/multi index (eg to not always have to check the number of levels of your index when writing generic code, cfr #3268)

@jreback
Copy link
Contributor Author

jreback commented Jan 25, 2017

@jorisvandenbossche yeah that is a fair point, we can revisit, i'll create an issue.

AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017
closes pandas-dev#12397 Author: Jeff Reback <jeff@reback.net> Closes pandas-dev#15216 from jreback/to_dataframe and squashes the following commits: b744fb5 [Jeff Reback] ENH: add MultiIndex.to_dataframe
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment