Skip to content

Wrong array write when writing chunked array from numpy string data with "F" order #3558

@dtonagel

Description

@dtonagel

Zarr version

v3.1.3

Numcodecs version

v0.16.3

Python Version

3.12.7, 3.13.9

Operating System

Windows

Installation

pip into virtual environment

Description

I have a pandas DataFrame storing string data with string[pyarrow] datatype.

When converting this data to numpy array with either "df.values" or "df.to_numpy()" it produces a numpy "object" array (dtype=="O").

When storing this array in a chunked zarr-array there seems to be a strange "off-by-one" error either when writing or when reading. The result of the array read back from zarr differs from the original array.

When the DataFrame was stored as normal (python) string, this does not happen even though the numpy array looks exactly the same.

Steps to reproduce

# /// script # requires-python = ">=3.12" # dependencies = [ # "zarr@git+https://github.com/zarr-developers/zarr-python.git@main", # "pandas", # "pyarrow", # ] # /// # # This script automatically imports the development branch of zarr to check for issues import zarr # your reproducer code import numpy as np import pandas as pd # Number of rows and columns num_rows = 11 num_cols = 11 # Generate incrementing strings incrementing_strings = [f"S{i:05}" for i in range(1, num_rows * num_cols + 1)] npdata = np.array(incrementing_strings, dtype="str").reshape(num_rows, num_cols) df = pd.DataFrame(npdata) dfpa = df.astype("string[pyarrow]") data = df.values datapa = dfpa.values # According to numpy, the converted data is identical: assert data.dtype == datapa.dtype # dtype == 'O' assert data.shape == datapa.shape assert data.size == datapa.size assert np.all(data == datapa) # Now store both in a zarray (using the same array here but doesn't matter if I use a fresh one) store = zarr.storage.MemoryStore() # Or LocalStore, doesn't matter array = zarr.create_array( store, name="s", shape=data.shape, chunks=(10,10), fill_value="<NA>", dtype=str, ) # This works as expected array[:,:] = data zdata = array[:,:] assert np.all(zdata == data) # This doesn't array[:,:] = datapa zdata = array[:,:] print(f"written={datapa[0,:4]}") # ['S00001' 'S00002' 'S00003' 'S00004'] print(f"read={zdata[0,:4]}") # ['S00001' 'S00012' 'S00023' 'S00034'] assert np.all(zdata == datapa) # Fails

Additional output

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugPotential issues with the zarr-python library

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions