About annbatch#

annbatch is a data loader and io utilities for mini-batched data loading of on-disk AnnData files. It is built to train models on terabyte-scale collections of AnnData that do not fit into memory, while keeping a modern GPU fully utilized with high-throughput, shuffled mini-batches.

Why annbatch?#

Most models for scRNA-seq data are small compared to models in computer vision or natural language processing, which shifts the bottleneck from compute onto the data-loading pipeline: to keep the GPU busy, data loading has to be fast. annbatch combines a chunked, block-shuffled fetching strategy with sharded, zarr-backed AnnData stores — accelerated locally by zarrs-python — to deliver order-of-magnitude faster loading than other out-of-core dataloaders. See the Detailed Walkthrough for benchmarks and details.

Ecosystem#

annbatch is co-developed by Lamin Labs and scverse, and builds directly on anndata, zarr and zarrs-python.

Funding#

annbatch is supported by the Chan Zuckerberg Initiative’s Essential Open Source Software for Science (EOSS) program.

scverse#

annbatch is part of the scverse® project (website, governance), which is fiscally sponsored by NumFOCUS. If you like scverse and want to support our mission, please consider making a tax-deductible donation to help the project pay for developer time, professional services, travel, workshops, and a variety of other needs.

Citing annbatch#

If you use annbatch in your work, please cite it — see Citing annbatch.