-
- Notifications
You must be signed in to change notification settings - Fork 19.1k
Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this issue exists on the latest version of pandas.
-
I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
Over in dask/dask#12022 (comment), I'm debugging a test failure with dask and pandas 3.x that comes down to the behavior of DataFrame.copy(deep=True)
with an arrow-backed extension array.
In
pandas/pandas/core/arrays/arrow/array.py
Line 1092 in 628c7fb
def copy(self) -> Self: |
DataFrame.copy(deep=True)
, you'll still have a reference back to the original buffer. If the output of the .copy(deep=True)
is the only one with a reference to the original buffer, then it won't be garbage collected. Consider: import pandas as pd import pyarrow as pa pool = pa.default_memory_pool() print("before", pool.bytes_allocated()) # 0 df = pd.DataFrame({"a": ["a", "b", "c"] * 1000}) print("df", pool.bytes_allocated()) # 27200 del df print("df", pool.bytes_allocated()) # 0 df2 = pd.DataFrame({"a": ["a", "b", "c"] * 1000}) clone = df2.iloc[:0].copy(deep=True) print("df2", pool.bytes_allocated()) # 27200 del df2 print("after - clone", pool.bytes_allocated()) # 27200
Maybe this is fine. We can probably figure out some workaround in dask (in this case we're making an empty dataframe object as a kind of Schema object. We can probably do something other than df.iloc[:0].copy(deep=True)
). But perhaps pandas could consider changing the behavior here.
The downside is that df.copy(deep=True)
will become more expensive and use more memory.
Installed Versions
In [4]: pd.show_versions() INSTALLED VERSIONS ------------------ commit : 962168f06d15d1aced28b414eb82909d3c930916 python : 3.12.8 python-bits : 64 OS : Darwin OS-release : 24.5.0 Version : Darwin Kernel Version 24.5.0: Tue Apr 22 19:53:27 PDT 2025; root:xnu-11417.121.6~2/RELEASE_ARM64_T6041 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 3.0.0.dev0+2254.g962168f06d numpy : 2.4.0.dev0+git20250717.d02611a dateutil : 2.9.0.post0 pip : None Cython : None sphinx : None IPython : 9.4.0 adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None fastparquet : None fsspec : 2025.7.0 html5lib : None hypothesis : 6.136.1 gcsfs : None jinja2 : 3.1.6 lxml.etree : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None psycopg2 : None pymysql : None pyarrow : 21.0.0 pyiceberg : None pyreadstat : None pytest : 8.4.1 python-calamine : None pytz : 2025.2 pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlsxwriter : None zstandard : None qtpy : None pyqt5 : None
Prior Performance
No response