-
- Notifications
You must be signed in to change notification settings - Fork 366
Open
Labels
bugPotential issues with the zarr-python libraryPotential issues with the zarr-python library
Description
Zarr version
v3.1.3
Numcodecs version
v0.16.3
Python Version
3.12.7, 3.13.9
Operating System
Windows
Installation
pip into virtual environment
Description
I have a pandas DataFrame storing string data with string[pyarrow] datatype.
When converting this data to numpy array with either "df.values" or "df.to_numpy()" it produces a numpy "object" array (dtype=="O").
When storing this array in a chunked zarr-array there seems to be a strange "off-by-one" error either when writing or when reading. The result of the array read back from zarr differs from the original array.
When the DataFrame was stored as normal (python) string, this does not happen even though the numpy array looks exactly the same.
Steps to reproduce
# /// script # requires-python = ">=3.12" # dependencies = [ # "zarr@git+https://github.com/zarr-developers/zarr-python.git@main", # "pandas", # "pyarrow", # ] # /// # # This script automatically imports the development branch of zarr to check for issues import zarr # your reproducer code import numpy as np import pandas as pd # Number of rows and columns num_rows = 11 num_cols = 11 # Generate incrementing strings incrementing_strings = [f"S{i:05}" for i in range(1, num_rows * num_cols + 1)] npdata = np.array(incrementing_strings, dtype="str").reshape(num_rows, num_cols) df = pd.DataFrame(npdata) dfpa = df.astype("string[pyarrow]") data = df.values datapa = dfpa.values # According to numpy, the converted data is identical: assert data.dtype == datapa.dtype # dtype == 'O' assert data.shape == datapa.shape assert data.size == datapa.size assert np.all(data == datapa) # Now store both in a zarray (using the same array here but doesn't matter if I use a fresh one) store = zarr.storage.MemoryStore() # Or LocalStore, doesn't matter array = zarr.create_array( store, name="s", shape=data.shape, chunks=(10,10), fill_value="<NA>", dtype=str, ) # This works as expected array[:,:] = data zdata = array[:,:] assert np.all(zdata == data) # This doesn't array[:,:] = datapa zdata = array[:,:] print(f"written={datapa[0,:4]}") # ['S00001' 'S00002' 'S00003' 'S00004'] print(f"read={zdata[0,:4]}") # ['S00001' 'S00012' 'S00023' 'S00034'] assert np.all(zdata == datapa) # FailsAdditional output
No response
Metadata
Metadata
Assignees
Labels
bugPotential issues with the zarr-python libraryPotential issues with the zarr-python library