-  
-   Notifications  You must be signed in to change notification settings 
- Fork 19.2k
BUG: Fix copy s.t. it always copies index/columns. #4830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| @hayd copying will work now :) | 
   pandas/core/internals.py  Outdated    
 There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just pass ref_items=new_axes[0] if deep else None (and it will get passed to the block), the you dont' need the mod in apply
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duh - good catch. 😄
| this works for everything but SparseSeries and SparsePanel at this point...still trying to figure them out. | 
   pandas/core/groupby.py  Outdated    
 There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback, @cpcloud or others - can't do this check any more here, because of the change in how copying works (and because more ops end up returning views rather than the same object), I think identical is okay, because it seems relatively fast. I didn't see much of a change on perf tests, but they are relatively unreliable on my computer. What do you think? I wasn't sure whether what mattered was whether it was the same object or if it was just the length...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm is is much faster than identical, even for tiny indices:
In [9]: i = Index([1, 2, 3]) In [10]: timeit i.identical(i) 1000000 loops, best of 3: 1.43 µs per loop In [11]: timeit i is i 10000000 loops, best of 3: 60.2 ns per loop Is there any way to try the is check and if that fails then use identical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cpcloud with these changes, in the entire test suite, the is check never succeeds. I was thinking of checking is on the base first, since it's a view. Would that work? Here's what I get timing wise for this:
In [3]: %timeit i is i %timeit 10000000 loops, best of 3: 95.6 ns per loop In [4]: %timeit i.base is i.base 1000000 loops, best of 3: 369 ns per loop In [5]: %timeit i.identical(i) 100000 loops, best of 3: 2.1 µs per loop Works with MultiIndex too:
In [6]: mi = MultiIndex.from_tuples(zip(range(10), range(10))) In [7]: mi.base Out[7]: array([], dtype=object) In [8]: mi2 = MultiIndex.from_tuples(zip(range(10), range(10)) ...: ) In [9]: mi2 Out[9]: MultiIndex [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7), (8, 8), (9, 9)] In [10]: mi2.base Out[10]: array([], dtype=object) In [11]: mi.base is mi2.base Out[11]: False In [12]: %timeit mi is mi %timeit mi.base i10000000 loops, best of 3: 95.8 ns per loop In [13]: %timeit mi.base is mi.base 1000000 loops, best of 3: 351 ns per loop In [14]: %timeit mi.identical(mi) 100000 loops, best of 3: 12.6 µs per loop In [15]: %timeit mi.base is mi2.base 1000000 loops, best of 3: 355 ns per loop Length check is not MI friendly:
In [18]: %timeit len(mi) == len(mi2) 100000 loops, best of 3: 4.96 µs per loop btw - maybe we should, at some point, optimize the length check on multiindex...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope. base doesn't work in every case either :-(
| re identical u might try something like if a is b or a.base is b.base: ( maybe do a share memory check too) return a.identizal(b) so the actual identity case is a lot faster | 
| Problem is that bases gets changed with view. I.e.: so slight tweak on base. | 
| @jreback @cpcloud here's what I'm thinking for index, adding an   def is_(self, other): """returns True if and only if the indices actually are the same underlying array.  Generally survives through views. Doesn't care about metadata.""" if self is other: return True s_base, o_base = self.base, other.base return self is o_base or other is s_base or s_base is o_base(base calls aren't cached, which is why it needs to be setup this way) | 
   pandas/sparse/panel.py  Outdated    
 There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should these be v.copy(deep=True)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes they should
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change.
| blah, this doesn't quite work either :-/ multiple levels of view screw up  | 
| It's going to take some thinking to decide on how best to efficiently handle flexible metadata for  | 
| i'll vbench it and report back :) | 
| if it's not good enough, we can shave off about 600-800ns per iteration by skipping the function call and using  | 
| Is this the right behavior?  | 
| @jtratner yep | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so then, should all of these places actually make views on the index? Less sure about sparse...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case what's the different? whether I assign an index or its view?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just whether you want to be creating a view every time. Minimal difference (if at all).
Only copies index/columns with `deep=True` on `BlockManager`. Plus some tests...yay! Change copy to make views of indices. Requires changing groupby to check whether indices are identical, rather than comparing with `is`. Plus add tests for name checks through everything. Also, fixes tests to use `is_()` rather than `is` CLN: Change groupby to use 'is_' instead
| ok | 
| yep | 
BUG: Fix copy s.t. it always copies index/columns.
Fixes #4202 (and maybe some others too).
Only copies index/columns with
deep=TrueonBlockManager. Plus sometests...yay!