How to maximize performance to convert a stack of netcdf files (ERA5) to a zarr store? #3409
-
| The same question was asked about 7years ago on stackoverflow, yet no clear answer/solution was provided. Running out of ideas on how to solve this problem, I am reaching out directly to the zarr dev community. I have a bunch of netcdf files, each representing a year of ERA5 data for the variable of choice being about 280Mb. Implementing Zarr in the subsequent workflow made a substabtial improvement. So, I would like to convert my entire dataset to a zarr store on my server (256 CPUs (2 AMD EPYC 7742 64-core processors, with hyper-threading), and 1 TiB memory. It also has a fast, local NVMe-disk for scratch use, and a Nvidia Quadro RTX 6000 card, with 24GB GPU memory and 4608 CUDA cores.), therefore lacking no resources. Using the following code to convert 10 years worth of netcdf files to a zarr store, I came to the conclusion to be underusing the machine resources. from dask.distributed import LocalCluster import xarray as xr cluster = LocalCluster() # I have attempted multiple settings not changing much # so I open 10 years worth of data (about 3 GB), rechunk it and store it to zarr (no compression) ds = xr.open_mfdataset('inputs/climate/yearly/PLEV_201*.nc', parallel=True) ds = ds.chunk(chunks={'time':10000, 'longitude':3,'latitude':3,'level':13}) ds.to_zarr('test.zarr', mode='w')So, within the logic and options I use, would you have pointers as to what may be improved, or even some alternatives to proceed to this conversion? Thanks a sample of the dataset prior to rechunking: In [8]: ds Out[8]: <xarray.Dataset> Size: 2GB Dimensions: (time: 87648, level: 13, latitude: 10, longitude: 8) Coordinates: * time (time) datetime64[ns] 701kB 2010-01-01 ... 2019-12-31T23:00:00 * longitude (longitude) float64 64B -78.44 -78.19 -77.94 ... -76.94 -76.69 * latitude (latitude) float64 80B -8.239 -8.489 -8.739 ... -10.24 -10.49 * level (level) float64 104B 400.0 450.0 500.0 ... 900.0 950.0 1e+03 Data variables: z (time, level, latitude, longitude) float32 365MB dask.array<chunksize=(1, 1, 10, 8), meta=np.ndarray> t (time, level, latitude, longitude) float32 365MB dask.array<chunksize=(1, 1, 10, 8), meta=np.ndarray> u (time, level, latitude, longitude) float32 365MB dask.array<chunksize=(1, 1, 10, 8), meta=np.ndarray> v (time, level, latitude, longitude) float32 365MB dask.array<chunksize=(1, 1, 10, 8), meta=np.ndarray> q (time, level, latitude, longitude) float32 365MB dask.array<chunksize=(1, 1, 10, 8), meta=np.ndarray> r (time, level, latitude, longitude) float32 365MB dask.array<chunksize=(1, 1, 10, 8), meta=np.ndarray> Attributes: CDI: Climate Data Interface version 2.2.4 (https://mp... Conventions: CF-1.7 institution: European Centre for Medium-Range Weather Forecasts GRIB_centre: ecmf GRIB_centreDescription: European Centre for Medium-Range Weather Forecasts history: Tue Jul 01 22:11:48 2025: cdo -O -s -mergetime /... CDO: Climate Data Operators version 2.2.2 (https://mp... |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 4 replies
-
| I don't know what If you're not tied to xarray, then "converting data from netcdf to zarr" boils down to "load data from $source, decompress it, and re-compress it, copy it to $dest". In the past when I needed to speed this kind of thing up, I defined a single function But ultimately why is it important that you use all of your compute resources? It seems more important that the computation finish in a reasonable amount of time. How long was the estimated time to complete the conversion? |
Beta Was this translation helpful? Give feedback.
-
| I seem to have found few tricks to accomplih the conversion in reasonable time.
Here is a sample code to achieve in reasonable time the operation I needed: import xarray as xr import pandas as pd import dask tvec = pd.date_range('2011-01-01 00:00:00', '2014-12-31 23:00:00', freq='1h', inclusive='both') ds = xr.open_mfdataset('inputs/climate/yearly/PLEV_2011.nc', parallel=True, chunks='auto') za = dask.array.zeros((len(tvec), ds.level.shape[0], ds.latitude.shape[0], ds.longitude.shape[0] ), dtype='float32') dd = xr.Dataset( coords={ 'time':tvec, 'longitude': ds.longitude.values, 'latitude': ds.latitude.values, 'level': ds.level.values, }, data_vars={ 'z': (('time', 'level', 'latitude', 'longitude'), za), 't': (('time', 'level', 'latitude', 'longitude'), za), 'u': (('time', 'level', 'latitude', 'longitude'), za), 'v': (('time', 'level', 'latitude', 'longitude'), za), 'q': (('time', 'level', 'latitude', 'longitude'), za), 'r': (('time', 'level', 'latitude', 'longitude'), za), }, attrs=ds.attrs ) dd = dd.chunk(chunks={'time':8760, 'longitude':3,'latitude':3,'level':13}).persist() dd.to_zarr('test.zarr', mode='w',zarr_format=3) ds = None for year in [2011, 2012, 2013, 2014]: ds = xr.open_mfdataset(f'inputs/climate/yearly/PLEV_{year}.nc', parallel=True, chunks='auto') ds = ds.persist() ds.to_zarr('test.zarr', mode='a',region='auto', align_chunks=True) ds = NoneFor future reference, a forum dedicated to problem of this nature is: https://discourse.pangeo.io/ |
Beta Was this translation helpful? Give feedback.
I seem to have found few tricks to accomplih the conversion in reasonable time.
.persist()command from DaskHere is a sample code to achieve in reasonable time the operation I needed: