Skip to content

PERF: DataFrame.copy(deep=True) returns a view on the original pyarrow buffer #61930

@TomAugspurger

Description

@TomAugspurger

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

Over in dask/dask#12022 (comment), I'm debugging a test failure with dask and pandas 3.x that comes down to the behavior of DataFrame.copy(deep=True) with an arrow-backed extension array.

In

def copy(self) -> Self:
, we deliberately return a shallow copy (a new object with a view on the original buffers) of the backing array. For correctness, this is fine since pyarrow arrays are immutable, so copying should be unnecessary. However, it does mean that after a DataFrame.copy(deep=True), you'll still have a reference back to the original buffer. If the output of the .copy(deep=True) is the only one with a reference to the original buffer, then it won't be garbage collected. Consider:

import pandas as pd import pyarrow as pa pool = pa.default_memory_pool() print("before", pool.bytes_allocated()) # 0 df = pd.DataFrame({"a": ["a", "b", "c"] * 1000}) print("df", pool.bytes_allocated()) # 27200 del df print("df", pool.bytes_allocated()) # 0 df2 = pd.DataFrame({"a": ["a", "b", "c"] * 1000}) clone = df2.iloc[:0].copy(deep=True) print("df2", pool.bytes_allocated()) # 27200 del df2 print("after - clone", pool.bytes_allocated()) # 27200

Maybe this is fine. We can probably figure out some workaround in dask (in this case we're making an empty dataframe object as a kind of Schema object. We can probably do something other than df.iloc[:0].copy(deep=True)). But perhaps pandas could consider changing the behavior here.

The downside is that df.copy(deep=True) will become more expensive and use more memory.

Installed Versions

In [4]: pd.show_versions() INSTALLED VERSIONS ------------------ commit : 962168f06d15d1aced28b414eb82909d3c930916 python : 3.12.8 python-bits : 64 OS : Darwin OS-release : 24.5.0 Version : Darwin Kernel Version 24.5.0: Tue Apr 22 19:53:27 PDT 2025; root:xnu-11417.121.6~2/RELEASE_ARM64_T6041 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 3.0.0.dev0+2254.g962168f06d numpy : 2.4.0.dev0+git20250717.d02611a dateutil : 2.9.0.post0 pip : None Cython : None sphinx : None IPython : 9.4.0 adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None fastparquet : None fsspec : 2025.7.0 html5lib : None hypothesis : 6.136.1 gcsfs : None jinja2 : 3.1.6 lxml.etree : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None psycopg2 : None pymysql : None pyarrow : 21.0.0 pyiceberg : None pyreadstat : None pytest : 8.4.1 python-calamine : None pytz : 2025.2 pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlsxwriter : None zstandard : None qtpy : None pyqt5 : None 

Prior Performance

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions