annbatch.DatasetCollection#

class annbatch.DatasetCollection(group, *, mode='a', is_collection_h5ad=False)#

A preshuffled collection object including functionality for creating, adding to, and loading collections shuffled by annbatch.

Attributes table#

is_empty

Whether or not there is an existing store at the group location.

Methods table#

add_adatas(adata_paths, *[, load_adata, ...])

Take AnnData paths and create or add to an on-disk set of AnnData datasets with uniform var spaces at the desired path (with dataset_size rows per dataset if running for the first time).

Attributes#

DatasetCollection.is_empty#

Whether or not there is an existing store at the group location.

Methods#

DatasetCollection.add_adatas(adata_paths, *, load_adata=<function _default_load_adata>, groupby=None, var_subset=None, n_obs_per_chunk=64, shard_size='1GB', zarr_compressor=(BloscCodec(_tunable_attrs={'typesize'}, typesize=1, cname=<BloscCname.lz4: 'lz4'>, clevel=3, shuffle=<BloscShuffle.shuffle: 'shuffle'>, blocksize=0), ), h5ad_compressor='gzip', dataset_size='20GB', shuffle_chunk_size=1000, shuffle=True, rng=None)#

Take AnnData paths and create or add to an on-disk set of AnnData datasets with uniform var spaces at the desired path (with dataset_size rows per dataset if running for the first time).

The set of AnnData datasets is collectively referred to as a “collection” where each dataset is called dataset_i{.h5ad}. The main purpose of this function is to create shuffled sharded zarr datasets, which is the default behavior of this function. However, this function can also output h5 datasets and also unshuffled datasets as well. The var space is by default outer-joined initially, and then subsequently added datasets (i.e., on second calls to this function) are subsetted, but this behavior can be controlled by var_subset. A key src_path is added to obs to indicate where individual row came from. We highly recommend making your indexes unique across files, and this function will call AnnData.obs_names_make_unique. Memory usage should be controlled by dataset_size + shuffle_chunk_size as so many rows will be read into memory before writing to disk. After the dataset completes, a marker is added to the group’s attrs to note that this dataset has been shuffled by annbatch. This is only for internal purposes at the moment so that we can recognize datasets that have been shuffled by an instance of this class.

Parameters:
adata_paths Iterable[Group | Group | PathLike[str] | str]

Paths to the AnnData files used to create the zarr store.

load_adata Callable[[Group | Group | PathLike[str] | str], AnnData] (default: <function _default_load_adata at 0x79c35b8ab060>)

Function to customize (lazy-)loading the invidiual input anndata files. By default, anndata.experimental.read_lazy() is used with categoricals/nullables read into memory and (-1) chunks for obs. If you only need a subset of the input anndata files’ elems (e.g., only X and certain obs columns), you can provide a custom function here to speed up loading and harmonize your data. Beware that concatenating nullables/categoricals (i.e., what happens if len(adata_paths) > 1 internally in this function) from anndata.experimental.backed.Dataset2D obs is very time consuming - consider loading these into memory if you use this argument.

groupby str | Iterable[str] | None (default: None)

Optional obs columns to sort by within each output dataset before writing.

var_subset Iterable[str] | None (default: None)

Subset of gene names to include in the store. If None, all genes are included. Genes are subset based on the var_names attribute of the concatenated AnnData object.

n_obs_per_chunk int (default: 64)

Number of observations per zarr chunk. For dense arrays this is used directly as the first-axis chunk size. For sparse arrays it is converted to element counts using the average number of non-zero elements per row of the matrix being written.

shard_size int | str (default: '1GB')

Number of observations per zarr shard, or a size string (e.g. '1GB'). If a size string is provided, the number of obersevations per zarr shard is estimated automatically. String sizes get parsed using the humanfriendly package. For sparse arrays the number of observations is converted to element counts using the average number of non-zero elements per row of the matrix being written

zarr_compressor Iterable[BytesBytesCodec] (default: (BloscCodec(_tunable_attrs={'typesize'}, typesize=1, cname=<BloscCname.lz4: 'lz4'>, clevel=3, shuffle=<BloscShuffle.shuffle: 'shuffle'>, blocksize=0),))

Compressors to use to compress the data in the zarr store.

h5ad_compressor Literal['gzip', 'lzf'] | None (default: 'gzip')

Compressors to use to compress the data in the h5ad store. See anndata.write_h5ad.

dataset_size int | str (default: '20GB')

Number of observations to load into memory at once for shuffling / pre-processing, or a size string (e.g. '2GB', '512MB'). When a size string is provided, the observation count is derived from the estimated uncompressed bytes per row of the input data. String sizes get parsed using the humanfriendly package. The higher this number, the more memory is used, but the better the shuffling. This corresponds to the size of the dataset level shards created. Only applicable when adding datasets for the first time, otherwise ignored.

shuffle bool (default: True)

Whether to shuffle the data before writing it to the store. Ignored once the store is non-empty.

shuffle_chunk_size int (default: 1000)

How many contiguous rows to load into memory before shuffling at once. (shuffle_chunk_size // dataset_size) slices will be loaded of size shuffle_chunk_size.

rng Generator | None (default: None)

Random number generator for shuffling.

Return type:

Self

Examples

>>> import anndata as ad
>>> from annbatch import DatasetCollection
# create a custom load function to only keep `.X`, `.obs` and `.var` in the output store
>>> def read_lazy_x_and_obs_only(path):
...     adata = ad.experimental.read_lazy(path)
...     return ad.AnnData(
...         X=adata.X,
...         obs=adata.obs.to_memory(),
...         var=adata.var.to_memory(),
...)
>>> datasets = [
...     "path/to/first_adata.h5ad",
...     "path/to/second_adata.h5ad",
...     "path/to/third_adata.h5ad",
... ]
>>> DatasetCollection("path/to/output/zarr_store.zarr").add_adatas(
...    datasets,
...    load_adata=read_lazy_x_and_obs_only,
...)