annbatch

Contents

annbatch#

A data loader and io utilities for mini-batched data loading of on-disk anndata files, co-developed by Lamin Labs and scverse.

annbatch lets you train models on terabyte-scale collections of AnnData files that do not fit into memory, while keeping your GPU fed with high-throughput, shuffled mini-batches by doing chunked fetching (similar to Nvidia’s Merlin or webdataset). It also supports in-memory data.

Note

You can also use the annbatch.Loader on raw zarr.Array objects via add_datasets() if your object does not fit the anndata.AnnData class object cleanly, as long as the data is semantically row-wise oriented on-disk. The annbatch.Loader.__iter__() fetching of (contiguous) data is simply done along the first (0th) axis of your data i.e., on-disk zarr with potentially more than two dimensions. See the single-cell microscopy images tutorial for an example that streams a 4-dimensional image stack this way. AnnData simply provides a convenient wrapper for providing annotated data with two dimensions.

Note that the preshuffler DatasetCollection requires AnnData inputs, and that preshuffling is highly recommended for top performance.

If you have genetics data, see cellink for info on converting to anndata.

annbatch data-loading speed compared to other dataloaders
Installation

New to annbatch? Check out the installation guide and pick the right extras.

Installation
Quickstart

A hands-on notebook: convert your .h5ad files and stream shuffled mini-batches.

Quickstart
Tutorials

End-to-end runnable tutorials: scRNA-seq, genetics (VCF), microscopy images, and multi-GPU training.

Single-cell RNA-seq
User guide

An in-depth tour of preprocessing, chunked loading, sampling and benchmarks.

Detailed Walkthrough
API reference

The API reference contains a detailed description of the annbatch API.

API
Discussion

Need help? Reach out on the scverse forum to get your questions answered.

https://discourse.scverse.org/
GitHub

Found a bug? Interested in contributing? Check out the source on GitHub.

https://github.com/scverse/annbatch

Citation#

If you use annbatch in your work, please cite the annbatch publication:

annbatch unlocks terabyte-scale training of biological data in anndata

Gold, I., Fischer, F., Arnoldt, L., Wolf, F. A., & Theis, F. J. (2026). annbatch unlocks terabyte-scale training of biological data in anndata. arXiv. https://doi.org/10.48550/arxiv.2604.01949

@article{Gold_2026,
    author        = {Gold, I. and Fischer, F. and Arnoldt, L. and Wolf, F. A. and Theis, F. J.},
    title         = {annbatch unlocks terabyte-scale training of biological data in anndata},
    journal       = {arXiv},
    year          = {2026},
    doi           = {10.48550/arxiv.2604.01949},
    eprint        = {2604.01949},
    archivePrefix = {arXiv},
    url           = {https://doi.org/10.48550/arxiv.2604.01949}
}