Zarr Configuration

Zarr Configuration#

If you are using a local file system, use zarrs-python:

import zarr
zarr.config.set({"codec_pipeline.path": "zarrs.ZarrsCodecPipeline"})

Otherwise normal use zarr without zarrs-python (wich does not support, for example, remote stores).

zarrs Performance#

Please look at zarrs-python’s docs for more info but there are two important setting to consider:

zarr.config.set({
    "threading.max_workers": None,
    "codec_pipeline": {
        "direct_io": False
    }
})

The threading.max_workers will control how many threads are used by zarrs, and by extension, our data loader. This parameter is global and controls both the rust parallelism and the Python parallelism. If you notice thrashing or similar oversubscription behavior of threads, please open an issue.

Some linux file systems’ performance may suffer from the high level of parallelism combined with many page cache misses/evictions (i.e., when the dataset size on disk far exceeds available RAM). Concretely, once the page cache is full, if there are a high number of cache misses/evictions relative to cache hits (i.e., the on-disk dataset exceeds that of available RAM by a lot), the data loading freezes. Our experience matches that of a paper looking at mmap for databases, that a dataset size about 20X larger than available RAM memory (for the page cache) will trigger this performance degradation. For example, if you only have 32GB of RAM, a ~600GB dataset on disk will trigger this degradation. Thus, to bypass the page cache and any potential cache misses, use direct_io. If this setting is set on a system that does not support direct_io, file reading will fall back to normal buffered io.

zarr-python performance#

In this case, likely the store of interest is in the cloud. Please see zarr python’s config docs for more info but likely of most interest aside from the above mentioned threading.max_workers is

zarr.config.set({"async.concurrency": 64})

which is 64 by default. See the zarr page on concurrency for more information. Furthermore, we recommend using obstore over fsspec for performance reasons without using the zarrs pipeline.