Skip to content

Conversation

@jtratner
Copy link
Contributor

@jtratner jtratner commented Sep 7, 2014

See #7269 for more. (doesn't completely resolve the speed difference, but it gets a 10-15% improvement).

Timing improvement is there, but it's small.

In [20]: %timeit test.str.match(pattern2) 10 loops, best of 3: 170 ms per loop In [21]: %timeit test.str.extract(pattern2) 10 loops, best of 3: 317 ms per loop In [22]: %timeit -n try_except_str_extract2(test, pattern2) 10 loops, best of 3: 265 ms per loop 

@jreback - this is very minor, but if you think this needs a release note,
where should I put it?

This passed locally for me, but if Travis fails I'll fix it up.

isinstance check takes longer but accomplishes the same thing.
@jreback
Copy link
Contributor

jreback commented Sep 7, 2014

is their a vbench for this (I think yes; pls post results )
release notes in v0.15.0 performance section

@jreback jreback added Performance Memory or execution speed performance Strings String extension data type and string data labels Sep 7, 2014
@jreback jreback added this to the 0.15.0 milestone Sep 7, 2014
@jreback
Copy link
Contributor

jreback commented Sep 7, 2014

This performs much worse on the vbench. Maybe worth testing for nans apriori - if a lot then use the current method, if not too many, use your methdo.

master

In [8]: %timeit many.str.extract(r'(\w*)matchthis(\w*)') 10 loops, best of 3: 62.6 ms per loop 

0.14.1

In [6]: %timeit many.str.extract(r'(\w*)matchthis(\w*)') 10 loops, best of 3: 37.1 ms per loop 
In [1]: import string In [2]: import itertools as IT In [3]: def make_series(letters, strlen, size): ...: return Series( ...: np.fromiter(IT.cycle(letters), count=size*strlen, dtype='|S1') ...: .view('|S{}'.format(strlen))) ...: In [4]: many = make_series('matchthis'+string.uppercase, strlen=19, size=10000) # 31% matches In [5]: few = make_series('matchthis'+string.uppercase*42, strlen=19, size=10000) # 1% matches 
@jreback jreback modified the milestones: 0.15.1, 0.15.0 Sep 7, 2014
@jtratner
Copy link
Contributor Author

jtratner commented Sep 7, 2014

Sure, good call - I didn't check for a vbench. I'll play around with it.

On Sun, Sep 7, 2014 at 6:02 AM, jreback notifications@github.com wrote:

This performs much worse on the vbench. Maybe worth testing for nans
apriori - if a lot then use the current method, if not too many, use your
methdo.

master

In [8]: %timeit many.str.extract(r'(\w_)matchthis(\w_)')
10 loops, best of 3: 62.6 ms per loop

0.14.1

In [6]: %timeit many.str.extract(r'(\w_)matchthis(\w_)')
10 loops, best of 3: 37.1 ms per loop

In [1]: import string

In [2]: import itertools as IT

In [3]: def make_series(letters, strlen, size):
...: return Series(
...: np.fromiter(IT.cycle(letters), count=size*strlen, dtype='|S1')
...: .view('|S{}'.format(strlen)))
...:

In [4]: many = make_series('matchthis'+string.uppercase, strlen=19, size=10000) # 31% matches

In [5]: few = make_series('matchthis'+string.uppercase*42, strlen=19, size=10000) # 1% matches


Reply to this email directly or view it on GitHub
#8202 (comment).

@jtratner
Copy link
Contributor Author

jtratner commented Sep 7, 2014

going to close this until I actually find a way to speed this up.

@jtratner jtratner closed this Sep 7, 2014
@jtratner
Copy link
Contributor Author

jtratner commented Sep 7, 2014

@jreback I think the difference there has more to do with how many matches were found than with actual speed differences (when I run the two versions with the same data, I get almost imperceptible differences - not really sure why I found differences previously).

Is there a good way to write a vbench that generates the same random data each time? Can I seed the generator in the setup phase or something?

@jreback
Copy link
Contributor

jreback commented Sep 7, 2014

sure out in a np.random.seed(...)

(in fact their is an issue to do this for all of the benches. - but needs some validation)

@jreback
Copy link
Contributor

jreback commented Sep 7, 2014

put

@jtratner
Copy link
Contributor Author

jtratner commented Sep 7, 2014

yeah I just saw that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Performance Memory or execution speed performance Strings String extension data type and string data

2 participants