About annbatch#
annbatch is a data loader and io utilities for mini-batched data loading of on-disk
AnnData files. It is built to train models on terabyte-scale
collections of AnnData that do not fit into memory, while keeping a modern GPU fully utilized
with high-throughput, shuffled mini-batches.
Why annbatch?#
Most models for scRNA-seq data are small compared to models in computer vision or natural language
processing, which shifts the bottleneck from compute onto the data-loading pipeline: to keep the
GPU busy, data loading has to be fast. annbatch combines a chunked, block-shuffled fetching
strategy with sharded, zarr-backed AnnData stores — accelerated locally by
zarrs-python — to deliver order-of-magnitude faster loading
than other out-of-core dataloaders. See the Detailed Walkthrough for benchmarks and details.
Ecosystem#
annbatch is co-developed by Lamin Labs and scverse, and builds directly on anndata, zarr and zarrs-python.
Funding#
annbatch is supported by the Chan Zuckerberg Initiative’s Essential Open Source Software for Science (EOSS) program.
scverse#
annbatch is part of the scverse® project (website, governance), which is fiscally sponsored by NumFOCUS. If you like scverse and want to support our mission, please consider making a tax-deductible donation to help the project pay for developer time, professional services, travel, workshops, and a variety of other needs.
Citing annbatch#
If you use annbatch in your work, please cite it — see Citing annbatch.