Skip to content
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.23.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -557,3 +557,4 @@ Other
^^^^^

- Improved error message when attempting to use a Python keyword as an identifier in a ``numexpr`` backed query (:issue:`18221`)
- Bug in `Series.memory_usage` which assumes series will always have more than one element (:issue:`19368`)
3 changes: 2 additions & 1 deletion pandas/_libs/lib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,8 @@ def memory_usage_of_objects(ndarray[object, ndim=1] arr):
cdef int64_t s = 0

n = len(arr)
for i from 0 <= i < n:
# Hrm why was this 0 <= i < n
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you think this is the issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ok, I think are definition of memory_usage should be slightly updated. (in core/base.py) as its meant for pure object dtypes, not sparse.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be updated?

My understanding is that memory_usage for a SparseSeries should equal to the memory_usage of the index plus the memory_usage of the SparseArray

The original bug is honestly an edgecase because if you have more than one dense element it doesn't fail:

import pandas as pd ss = pd.SparseSeries(['a', 'b', 'c'], fill_value='a') ss.memory_usage(deep=True) #=> No segfault because the dense elements is len > 0 ss= pd.SparseSeries(['a', 'a', 'a'], fill_value='a') ss.memory_usage(deep=True) #=> Segfault because dense elements len = 0

I'm not sure core/base.py should be updated. The problem with doing that would be that I would do a length check before handing it off to the lower level pyx file. This fix gets rid of the possible segfault by accessing memory that's not there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I meant to take out my thought process comment ;). I'll do that now

for i from 0 < i < n:
s += arr[i].__sizeof__()
return s

Expand Down
11 changes: 11 additions & 0 deletions pandas/tests/sparse/series/test_series.py
Original file line number Diff line number Diff line change
Expand Up @@ -971,6 +971,17 @@ def test_combine_first(self):
tm.assert_sp_series_equal(result, result2)
tm.assert_sp_series_equal(result, expected)

@pytest.mark.parametrize('deep,fill_values', [([True, False],
[0, 1, np.nan, None])])
def test_memory_usage_deep(self, deep, fill_values):
for fv in fill_values:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use product to do this rather than have an embedded loop

sparse_series = SparseSeries(fill_values, fill_value=fv)
dense_series = Series(fill_values)
sparse_usage = sparse_series.memory_usage(deep=deep)
dense_usage = dense_series.memory_usage(deep=deep)

assert sparse_usage < dense_usage


class TestSparseHandlingMultiIndexes(object):

Expand Down