PERF: Unnecessary string interning in read_csv?

Going through parsers.pyx, particularly _string_box_utf8, I'm trying to figure out what the point of the hashtable checks are:

 k = kh_get_strbox(table, word) # in the hash table if k != table.n_buckets: # this increments the refcount, but need to test pyval = <object>table.vals[k] else: # box it. new ref? pyval = PyUnicode_Decode(word, strlen(word), "utf-8", encoding_errors) k = kh_put_strbox(table, word, &ret) table.vals[k] = <PyObject *>pyval result[i] = pyval

This was introduced in 2012 a9db003. I don't see a clear reason why this isn't just

result[i] = PyUnicode_Decode(word, strlen(word), "utf-8", encoding_errors)

My best guess is that it involves string interning. Prior to py37, only small strings were interned. Now most strings up to 4096 I think are interned. Under the old system, the hashtable could prevent a ton of memory allocation, but that may no longer be the case. No, that doesn't apply to runtime-created strings. So that may be the reason why, but if so it is still a valid one.

Does anyone have a longer memory than me on this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PERF: Unnecessary string interning in read_csv? #61783

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

PERF: Unnecessary string interning in read_csv? #61783

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions