Preshuffling Performance Considerations#
Memory usage#
While the preshuffler can operate out-of-core in annbatch.Loader.add_adatas(), each individual shuffled dataset is read into memory before being written to disk.
The amount of memory is configurable with the dataset_size parameter.
Upstream of that is the shuffle_chunk_size parameter, which controls the (location-randomized) contiguous block size on-disk from your input data that is read into memory before an in-memory shuffle.
This parameter’s interaction with dask chunks is important - if your dask chunks are very large and shuffle_chunk_size considerably smaller, more memory is needed.
For this reason, anndata.experimental.read_lazy() and anndata.experimental.read_elem_lazy()’s default chunk size of 1000 is very reasonable.
Speed#
HDF5 files are quite slow as they are single-threaded (controlled by a global lock) and involve repeatedly opening and closing file handles.
In our paper [GFA+26], we showed that staring with zarr files gives a nearly 2x speedup for preshuffling.
To accelerate using hdf5 files, though, you can “virtualize” critical parts of your input datasets using virtualizarr to be read through the zarr multihtreaded reader e.g.,
import anndata as ad
from pathlib import Path
import zarr
from virtualizarr.parsers import HDFParser
from obstore.store import LocalStore
from obspec_utils.registry import ObjectStoreRegistry
from concurrent.futures import ProcessPoolExecutor
import h5py
path = Path('path_to_anndatas')
def create_X_store(path: Path):
parser = HDFParser("X")
store = LocalStore(prefix=f"{path.parent}/")
registry = ObjectStoreRegistry({f"file://{str(path.parent)}/": store})
manifest_store = parser(f"file://{str(path)}/", registry)
return (path, manifest_store)
with ProcessPoolExecutor(max_workers=64) as executor:
stores = dict(executor.map(create_X_store, path.glob("*.h5ad")))
def load_adata(path):
X = ad.experimental.read_elem_lazy(zarr.open(stores[path]))
with h5py.File(path) as f:
var=ad.io.read_elem(f["var"])
obs=ad.io.read_elem(f["obs"])
return ad.AnnData(X=X, var=var, obs=obs)
collection = annbatch.DatasetCollection("path_to_collection.zarr")
collection.add_adatas(
adata_paths=path.glob("*.h5ad"),
load_adata=load_adata,
n_obs_per_chunk=64,
dataset_size="64GB",
)
If you can open remote hdf5 files, using zarr’s internal async engine will also likely accelerate i/o as well.
In the future, we hope to offer full anndata object virtualization.