Skip to content

Conversation

@anthonyho
Copy link

@anthonyho anthonyho commented Mar 4, 2017

Added new keyword parameters for DataFrame.corrwith(), which allows methods other than Pearson to be used. See #9490.

df2 = pd.DataFrame(np.random.randn(4, 4), index=index[:4], columns=columns)
df1.corrwith(df2)
df2.corrwith(df1, axis=1)
df2.corrwith(df1, axis=1, method='kendall')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add versionsddes tag (and small comment here)


correl = num / dom
correl = Series({col: nanops.nancorr(left[col].values,
right[col].values,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is going to be very slow

we need to rework nancorr to do this instead

Copy link
Author

@anthonyho anthonyho Mar 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the new implementation (which calls nancorr which in turns calls numpy/scipy correlation functions) is actually significantly faster than the current implementation (manually computing Pearson correlation using DataFrame.mean(), DataFrame.sum(), and DataFrame.std())

For example:

Current implementation:

>>> import pandas as pd; import timeit >>> pd.__version__ u'0.19.2' >>> iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv') >>> timeit.timeit(lambda: iris.corrwith(iris), number=10000) 50.891642808914185 >>> timeit.timeit(lambda: iris.T.corrwith(iris.T), number=10000) 42.0677649974823

New implementation:

>>> import pandas as pd; import timeit >>> pd.__version__ '0.19.0+539.g0b77680.dirty' >>> iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv') >>> timeit.timeit(lambda: iris.corrwith(iris, method='pearson'), number=10000) 28.622286081314087 >>> timeit.timeit(lambda: iris.T.corrwith(iris.T, method='pearson'), number=10000) 21.898916959762573

I'm pretty new to this, so please let me know if I'm missing anything here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look thru the benchmarks and pls add some asv as appropriate

include wide and talk data

on wide data this will be slower

@jreback
Copy link
Contributor

jreback commented Apr 3, 2017

can you update

@jreback
Copy link
Contributor

jreback commented May 7, 2017

can you rebase, add some benchmarks to asv and show them.

@jreback jreback added Numeric Operations Arithmetic, Comparison, and Logical operations Enhancement labels May 7, 2017
@jreback
Copy link
Contributor

jreback commented Jun 10, 2017

can you rebase and update?

@jreback
Copy link
Contributor

jreback commented Aug 17, 2017

closing as stale

@jreback jreback closed this Aug 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Enhancement Numeric Operations Arithmetic, Comparison, and Logical operations

2 participants