Skip to content

extract runs much slower than match with multiple groups #7269

@jhorowitz-coursera

Description

@jhorowitz-coursera

The regex-group-extraction functionality of match is being replaced by extract, but extract runs much slower when multiple groups are being extracted.

Here is some test code:

import pandas as pd from datetime import datetime from pandas.util.print_versions import show_versions show_versions() test = pd.Series(['here is some sample text' for x in range(100000)]) def test_regex(pattern): now = datetime.now() match_result = test.str.match(pattern) print "Using match:", datetime.now() - now now = datetime.now() extract_result = test.str.extract(pattern) print "Using extract:", datetime.now() - now print "SINGLE GROUP" test_regex('.*some (.).*') print "MULTIPLE GROUPS" test_regex('.*some (.)(.).*')

On my machine (running pandas: 0.13.0rc1-64-gceec8bf), this reports:

SINGLE GROUP Using match: 0:00:00.090317 Using extract: 0:00:00.116123 MULTIPLE GROUPS Using match: 0:00:00.094432 Using extract: 0:00:15.857041 

Metadata

Metadata

Assignees

No one assigned

    Labels

    PerformanceMemory or execution speed performanceStringsString extension data type and string data

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions