annbatch

annbatch#

A data loader and io utilities for mini-batched data loading of on-disk anndata files, co-developed by Lamin Labs and scverse.

annbatch lets you train models on terabyte-scale collections of AnnData files that do not fit into memory, while keeping your GPU fed with high-throughput, shuffled mini-batches by doing chunked fetching (similar to Nvidia’s Merlin or webdataset). It also supports in-memory data.

Note

You can also use the annbatch.Loader on raw zarr.Array objects via add_datasets() if your object does not fit the anndata.AnnData class object cleanly, as long as the data is semantically row-wise oriented on-disk. The annbatch.Loader.__iter__() fetching of (contiguous) data is simply done along the first (0th) axis of your data i.e., on-disk zarr with potentially more than two dimensions. See the single-cell microscopy images tutorial for an example that streams a 4-dimensional image stack this way. AnnData simply provides a convenient wrapper for providing annotated data with two dimensions.

Note that the preshuffler DatasetCollection requires AnnData inputs, and that preshuffling is highly recommended for top performance.

If you have genetics data, see cellink for info on converting to anndata.

annbatch data-loading speed compared to other dataloaders

Installation

New to annbatch? Check out the installation guide and pick the right extras.

Installation

Quickstart

A hands-on notebook: convert your .h5ad files and stream shuffled mini-batches.

Quickstart

Tutorials

End-to-end runnable tutorials: scRNA-seq, genetics (VCF), microscopy images, and multi-GPU training.

Single-cell RNA-seq

User guide

An in-depth tour of preprocessing, chunked loading, sampling and benchmarks.

Detailed Walkthrough

API reference

The API reference contains a detailed description of the annbatch API.

API

Discussion

Need help? Reach out on the scverse forum to get your questions answered.

https://discourse.scverse.org/

GitHub

Found a bug? Interested in contributing? Check out the source on GitHub.

https://github.com/scverse/annbatch

Citation#

If you use annbatch in your work, please cite the annbatch publication:

annbatch unlocks terabyte-scale training of biological data in anndata

Gold, I., Fischer, F., Arnoldt, L., Wolf, F. A., & Theis, F. J. (2026). annbatch unlocks terabyte-scale training of biological data in anndata. arXiv. https://doi.org/10.48550/arxiv.2604.01949

@article{Gold_2026,
    author        = {Gold, I. and Fischer, F. and Arnoldt, L. and Wolf, F. A. and Theis, F. J.},
    title         = {annbatch unlocks terabyte-scale training of biological data in anndata},
    journal       = {arXiv},
    year          = {2026},
    doi           = {10.48550/arxiv.2604.01949},
    eprint        = {2604.01949},
    archivePrefix = {arXiv},
    url           = {https://doi.org/10.48550/arxiv.2604.01949}
}

annbatch

Contents

annbatch#

Citation#