-  
-   Notifications  You must be signed in to change notification settings 
- Fork 19.2k
Description
This is coming out of a discussion that has stalled #22225 (which is about adding .set_index to Series, see #21684). The discussion has shifted away from what capabilities a putative Series.set_index should have, but what capabilities df.set_index has currently.
The main issue (for @jreback) is that df.set_index takes arrays:
@jreback: There were several attempts to have DataFrame.set_index take an array as well, but these never got off the ground.
@h-vetinari: I'm not sure when, but they certainly did get off the ground:
>>> import pandas as pd >>> import numpy as np >>> pd.__version__ '0.23.4' >>> >>> df = pd.DataFrame(np.random.randint(0, 10, (4, 4)), columns=list('abcd')) >>> df.set_index(['a', # label ... df.index, # Index ... df.b ** 2, # Series ... df.b.values, # ndarray ... list('ABCD'), # list ... 'c']) # label again b d a b c 0 0 0 2 A 1 0 2 8 1 1 4 B 4 1 4 3 2 25 5 C 8 5 5 0 3 9 7 D 2 3 7 Further on:
@jreback: @h-vetinari you are confusing the purpose of
.set_axis. [...] The problem with.set_indexon a DataFrame with an array is that it technically can work with an array and not keys. (meaning its not unambiguous)
I don't think I am confusing them. If I want to set the .index-attribute of a Series/DataFrame, then using .set_index is the most reasonable name by far. If anything, set_axis should be a superset of set_index (and a putative set_columns), that just switches between the two based on the axis-kwarg.
More than that, the current capabilities of df.set_index are a proper superset of df.set_axis(axis=0)**, in that it's possible to fill keys with only Series/Index/ndarray/list etc.:
>>> df.set_index(pd.Index(df.a)) # same result as Series directly below >>> df.set_index(df.a) a b c d a 0 0 0 1 2 8 8 1 4 4 3 3 5 8 5 0 0 3 2 7 >>> df.set_index(df.a.values) # same result as list directly below >>> df.set_index([[0, 8, 3, 0]]) a b c d 0 0 0 1 2 8 8 1 4 4 3 3 5 8 5 0 0 3 2 7 ** there is one caveat, in that lists (and only lists; out of all containers) need to be wrapped in another list, i.e. df.set_index([[0, 8, 3, 0]]) instead of df.set_index([0, 8, 3, 0]). This is the heart of the ambiguity that @jreback mentioned above (because a list is interpreted as a list of column keys).
Summing up:
-  set_indexis the most natural name for setting the.index-attribute
-  df.set_indexshould be able to process list-likes (as it currently does; this is the source of the ambiguity of the list case).
-  df.set_axisshould be able to do everything thatdf.set_indexdoes, and just switch between operating on index/columns based on theaxis-kwarg (after all,indexandcolumnsare the two axes of a DF).-  it could be considered to add a method set_columnson aDataFrame
-  The axis-kwarg ofset_axisshould just switch between the behaviour ofset_index(i.e. dealing with keys and array-likes) andset_columns.
 
-  it could be considered to add a method 
-  Series.set_indexshould support the same signature asdf.set_index, with the exception of thedrop-keyword (which only makes sense for column labels).
-  For Series, the set_indexandset_axismethods should be exactly the same.
Since I can't tag @pandas-dev/pandas-core, here are a few individual tags: @jreback @TomAugspurger @jorisvandenbossche @gfyoung @WillAyd @jbrockmendel @jschendel @toobaz.
EDIT: Forgot to add an xref from @jreback:
@h-vetinari we had quite some discussion about this: #14829
and never reached resolution. This is an API question.
In that issue, there's discussion largely around .rename, and how to make that method more consistent. Also discussed was potentially introducing .relabel, as well as .set_columns.