annbatch.DatasetCollection#
- class annbatch.DatasetCollection(group, *, mode='a', is_collection_h5ad=False)#
A preshuffled collection object including functionality for creating, adding to, and loading collections shuffled by
annbatch.
Attributes table#
Whether or not there is an existing store at the group location. |
Methods table#
|
Take AnnData paths and create or add to an on-disk set of AnnData datasets with uniform var spaces at the desired path (with |
Attributes#
- DatasetCollection.is_empty#
Whether or not there is an existing store at the group location.
Methods#
- DatasetCollection.add_adatas(adata_paths, *, load_adata=<function _default_load_adata>, groupby=None, var_subset=None, n_obs_per_chunk=64, shard_size='1GB', zarr_compressor=(BloscCodec(_tunable_attrs={'typesize'}, typesize=1, cname=<BloscCname.lz4: 'lz4'>, clevel=3, shuffle=<BloscShuffle.shuffle: 'shuffle'>, blocksize=0), ), h5ad_compressor='gzip', dataset_size='20GB', shuffle_chunk_size=1000, shuffle=True, rng=None)#
Take AnnData paths and create or add to an on-disk set of AnnData datasets with uniform var spaces at the desired path (with
dataset_sizerows per dataset if running for the first time).The set of AnnData datasets is collectively referred to as a “collection” where each dataset is called
dataset_i{.h5ad}. The main purpose of this function is to create shuffled sharded zarr datasets, which is the default behavior of this function. However, this function can also output h5 datasets and also unshuffled datasets as well. The var space is by default outer-joined initially, and then subsequently added datasets (i.e., on second calls to this function) are subsetted, but this behavior can be controlled byvar_subset. A keysrc_pathis added toobsto indicate where individual row came from. We highly recommend making your indexes unique across files, and this function will callAnnData.obs_names_make_unique. Memory usage should be controlled bydataset_size+shuffle_chunk_sizeas so many rows will be read into memory before writing to disk. After the dataset completes, a marker is added to the group’sattrsto note that this dataset has been shuffled byannbatch. This is only for internal purposes at the moment so that we can recognize datasets that have been shuffled by an instance of this class.- Parameters:
- adata_paths
Iterable[Group|Group|PathLike[str] |str] Paths to the AnnData files used to create the zarr store.
- load_adata
Callable[[Group|Group|PathLike[str] |str],AnnData] (default:<function _default_load_adata at 0x79c35b8ab060>) Function to customize (lazy-)loading the invidiual input anndata files. By default,
anndata.experimental.read_lazy()is used with categoricals/nullables read into memory and(-1)chunks forobs. If you only need a subset of the input anndata files’ elems (e.g., onlyXand certainobscolumns), you can provide a custom function here to speed up loading and harmonize your data. Beware that concatenating nullables/categoricals (i.e., what happens iflen(adata_paths) > 1internally in this function) fromanndata.experimental.backed.Dataset2Dobsis very time consuming - consider loading these into memory if you use this argument.- groupby
str|Iterable[str] |None(default:None) Optional
obscolumns to sort by within each output dataset before writing.- var_subset
Iterable[str] |None(default:None) Subset of gene names to include in the store. If None, all genes are included. Genes are subset based on the
var_namesattribute of the concatenated AnnData object.- n_obs_per_chunk
int(default:64) Number of observations per zarr chunk. For dense arrays this is used directly as the first-axis chunk size. For sparse arrays it is converted to element counts using the average number of non-zero elements per row of the matrix being written.
- shard_size
int|str(default:'1GB') Number of observations per zarr shard, or a size string (e.g.
'1GB'). If a size string is provided, the number of obersevations per zarr shard is estimated automatically. String sizes get parsed using the humanfriendly package. For sparse arrays the number of observations is converted to element counts using the average number of non-zero elements per row of the matrix being written- zarr_compressor
Iterable[BytesBytesCodec] (default:(BloscCodec(_tunable_attrs={'typesize'}, typesize=1, cname=<BloscCname.lz4: 'lz4'>, clevel=3, shuffle=<BloscShuffle.shuffle: 'shuffle'>, blocksize=0),)) Compressors to use to compress the data in the zarr store.
- h5ad_compressor
Literal['gzip','lzf'] |None(default:'gzip') Compressors to use to compress the data in the h5ad store. See anndata.write_h5ad.
- dataset_size
int|str(default:'20GB') Number of observations to load into memory at once for shuffling / pre-processing, or a size string (e.g.
'2GB','512MB'). When a size string is provided, the observation count is derived from the estimated uncompressed bytes per row of the input data. String sizes get parsed using the humanfriendly package. The higher this number, the more memory is used, but the better the shuffling. This corresponds to the size of the dataset level shards created. Only applicable when adding datasets for the first time, otherwise ignored.- shuffle
bool(default:True) Whether to shuffle the data before writing it to the store. Ignored once the store is non-empty.
- shuffle_chunk_size
int(default:1000) How many contiguous rows to load into memory before shuffling at once.
(shuffle_chunk_size // dataset_size)slices will be loaded of sizeshuffle_chunk_size.- rng
Generator|None(default:None) Random number generator for shuffling.
- adata_paths
- Return type:
Self
Examples
>>> import anndata as ad >>> from annbatch import DatasetCollection # create a custom load function to only keep `.X`, `.obs` and `.var` in the output store >>> def read_lazy_x_and_obs_only(path): ... adata = ad.experimental.read_lazy(path) ... return ad.AnnData( ... X=adata.X, ... obs=adata.obs.to_memory(), ... var=adata.var.to_memory(), ...) >>> datasets = [ ... "path/to/first_adata.h5ad", ... "path/to/second_adata.h5ad", ... "path/to/third_adata.h5ad", ... ] >>> DatasetCollection("path/to/output/zarr_store.zarr").add_adatas( ... datasets, ... load_adata=read_lazy_x_and_obs_only, ...)