Changelog#

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[0.1.6]#

Performance#

  • New internal use of numpy.ndarray for indexing means the small chunk sizes (i.e., perfect random sampling) is much more performant.

Docs#

  • New docs including a logo!

[0.1.5]#

Fixed#

  • Handle indexers for indices and data separately because their underlying chunking can differ

[0.1.4]#

Performance#

  • Preallocate buffers for in-memory handling. concat_strategy argument no longer has any affect as the new strategy is as memory efficient and as fast as both strategies.

Features#

  • Added groupby support to annbatch.DatasetCollection.add_adatas() to group observations per dataset before writing collections. When appending to an existing on-disk collection, groupby columns must already exist and categorical categories must be identical to those on-disk.

[0.1.3]#

Features#

Breaking#

[0.1.2]#

Fixed#

  • To handle torch>=2.11 + cupy-cuda12x, because torch installs cuda13 by default from this version onwards, we now install cupy-cuda12x[ctk] to ensure the cuda version used matches that of cupy. For information on this change see the cupy docs.

[0.1.1]#

Fixed#

  • Exclude torch 2.11 on account of https://github.com/cupy/cupy/issues/9827

[0.1.0]#

Breaking#

Fixed#

Features#

  • shard_size in annbatch.DatasetCollection.add_adatas() and shard_size in annbatch.write_sharded() now accept a human-readable size string (e.g. '1GB', '512MB') in addition to an integer number of observations. When a string is provided, the observation count is derived independently for each array element from its uncompressed bytes-per-row so that every shard stays close to the target size.

  • dataset_size in annbatch.DatasetCollection.add_adatas() now accepts a human-readable size string (e.g. '20GB', '512MB') in addition to an integer number of observations. When a string is provided, the per-row byte size is estimated from the on-disk metadata of the input datasets during validation and used to derive the observation count. The default has changed from 2_097_152 to '20GB'.

[0.0.8]#

  • Loader acccepts an rng argument now

[0.0.7]#

[0.0.6]#

  • Don’t concatenate all i/o-ed chunks in-memory, instead yielding from individual chunks as though they were concatenated (i.e., not abreaking hcange with the annbatch.abc.Sampler API). Should improve memory performance especially for dense data

[0.0.5]#

  • Fix bug with bringing the nullable/categorical columns into memory by default

Breaking#

  • Now annbatch.Loader expects preload_nchunks * chunk_size % batch_size == 0 for simplification and efficiency.

Added#

[0.0.4]#

  • Load into memory nullables/categoricals from obs by default when shuffling (i.e., no custom load_adata argument to annbatch.DatasetCollection.add_adatas)

[0.0.3]#

Breaking#

  • Revert h5ad shuffling into one big store (i.e., go back to sharding into individual files) and add warning that h5ad is not fully supported by annbatch. is_collection_h5ad argument to initialization of annbatch.DatasetCollection must be passed when initializing into to use a preshuffled collection of h5ad files, reading or writing.

  • Renamed annbatch.types.LoaderOutput ["labels"] and ["data"] to ["obs"] and ["X"] respectively.

[0.0.2]#

Breaking#

  • ZarrSparseDataset and ZarrDenseDataset have been conslidated into annbatch.Loader

  • create_anndata_collection and add_to_collection have been moved into the annbatch.DatasetCollection.add_adatas method

  • Default reading of input data is now fully lazy in annbatch.DatasetCollection.add_adatas, and therefore the shuffle process may now be slower although have better memory properties. Use load_adata argument in annbatch.DatasetCollection.add_adatas to customize this behavior.

  • Files shuffled under the old create_anndata_collection will not be recognized by annbatch.DatasetCollection and therefore are not usable with the new annbatch.Loader.use_collection API. At the moment, the file metadata we maintain is only for internal purposes - however, if you wish to migrate to be able to use annbatch.DatasetCollection in conjunction with annbatch.Loader.use_collection, the root folder of the old collection must have attrs {"encoding-type": "annbatch-preshuffled", "encoding-version": "0.1.0"} and be a zarr.Group. The subfolders (i.e., datasets) must be called dataset_([0-9]*). Otherwise you can use the annbatch.DatasetCollection.add_adatas as before.

Changed#

  • preload_to_gpu now depends on whether cupy is installed instead of defaulting to True

[0.0.1]#

Added#

  • First release