Changelog#
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
[0.1.6]#
Performance#
New internal use of
numpy.ndarrayfor indexing means the small chunk sizes (i.e., perfect random sampling) is much more performant.
Docs#
New docs including a logo!
[0.1.5]#
Fixed#
Handle indexers for
indicesanddataseparately because their underlying chunking can differ
[0.1.4]#
Performance#
Preallocate buffers for in-memory handling.
concat_strategyargument no longer has any affect as the new strategy is as memory efficient and as fast as both strategies.
Features#
Added
groupbysupport toannbatch.DatasetCollection.add_adatas()to group observations per dataset before writing collections. When appending to an existing on-disk collection, groupby columns must already exist and categorical categories must be identical to those on-disk.
[0.1.3]#
Features#
Added
annbatch.samplers.RandomSamplerandannbatch.samplers.SequentialSampleras replacements forannbatch.ChunkSampler.Exposed
annbatch.samplers.DistributedSamplerfor distributed training.
Breaking#
Deprecated
annbatch.ChunkSamplerin favor ofannbatch.samplers.RandomSamplerandannbatch.samplers.SequentialSampler.
[0.1.2]#
Fixed#
To handle
torch>=2.11+cupy-cuda12x, becausetorchinstallscuda13by default from this version onwards, we now installcupy-cuda12x[ctk]to ensure thecudaversion used matches that ofcupy. For information on this change see the cupy docs.
[0.1.1]#
Fixed#
Exclude
torch2.11 on account of https://github.com/cupy/cupy/issues/9827
[0.1.0]#
Breaking#
Renamed
annbatch.Loader.add_anndatastoannbatch.Loader.add_adatas().Renamed
annbatch.Loader.add_anndatatoannbatch.Loader.add_adata().The
sparse_chunk_size,sparse_shard_size,dense_chunk_size, anddense_shard_sizeparameters ofannbatch.write_sharded()have been replaced byn_obs_per_chunk(number of observations per chunk, automatically converted to element counts for sparse arrays) andshard_size(number of observations per shard or a size string). The corresponding parameters inannbatch.DatasetCollection.add_adatas()aren_obs_per_chunkandshard_size.
Fixed#
Formatted progress bar descriptions to be more readable.
annbatch.DatasetCollectionnow accepts arngargument to theannbatch.DatasetCollection.add_adatas()method.
Features#
shard_sizeinannbatch.DatasetCollection.add_adatas()andshard_sizeinannbatch.write_sharded()now accept a human-readable size string (e.g.'1GB','512MB') in addition to an integer number of observations. When a string is provided, the observation count is derived independently for each array element from its uncompressed bytes-per-row so that every shard stays close to the target size.dataset_sizeinannbatch.DatasetCollection.add_adatas()now accepts a human-readable size string (e.g.'20GB','512MB') in addition to an integer number of observations. When a string is provided, the per-row byte size is estimated from the on-disk metadata of the input datasets during validation and used to derive the observation count. The default has changed from2_097_152to'20GB'.
[0.0.8]#
Loaderacccepts anrngargument now
[0.0.7]#
Make the in-memory concatenation strategy configurable for
annbatch.Loader.__iter__()via aconcat_strategyargument to__init__- sparse on-disk will concatenated then shuffled/yielded (faster, higher memory usage) but dense will be shuffled and then concated/yielded (lower memory usage).Downcast
indicesof sparse matrices if possible when writing to disk viaanndata.settings.write_csr_csc_indices_with_min_possible_dtype
[0.0.6]#
Don’t concatenate all i/o-ed chunks in-memory, instead yielding from individual chunks as though they were concatenated (i.e., not abreaking hcange with the
annbatch.abc.SamplerAPI). Should improve memory performance especially for dense data
[0.0.5]#
Fix bug with bringing the nullable/categorical columns into memory by default
Breaking#
Now
annbatch.Loaderexpectspreload_nchunks * chunk_size % batch_size == 0for simplification and efficiency.
Added#
Introduced an
annbatch.abc.Samplerabstract base class. Users can implement and pass any class instance that is a subclass to thebatch_samplerargument ofannbatch.Loader.Exposed the older default sampling scheme as
annbatch.ChunkSampler, which is used internally to match older behavior whenbatch_samplerisn’t provided toannbatch.Loader.
[0.0.4]#
Load into memory nullables/categoricals from
obsby default when shuffling (i.e., no customload_adataargument toannbatch.DatasetCollection.add_adatas)
[0.0.3]#
Breaking#
Revert
h5adshuffling into one big store (i.e., go back to sharding into individual files) and add warning thath5adis not fully supported byannbatch.is_collection_h5adargument to initialization ofannbatch.DatasetCollectionmust be passed when initializing into to use a preshuffled collection ofh5adfiles, reading or writing.Renamed
annbatch.types.LoaderOutput["labels"]and["data"]to["obs"]and["X"]respectively.
[0.0.2]#
Breaking#
ZarrSparseDatasetandZarrDenseDatasethave been conslidated intoannbatch.Loadercreate_anndata_collectionandadd_to_collectionhave been moved into theannbatch.DatasetCollection.add_adatasmethodDefault reading of input data is now fully lazy in
annbatch.DatasetCollection.add_adatas, and therefore the shuffle process may now be slower although have better memory properties. Useload_adataargument inannbatch.DatasetCollection.add_adatasto customize this behavior.Files shuffled under the old
create_anndata_collectionwill not be recognized byannbatch.DatasetCollectionand therefore are not usable with the newannbatch.Loader.use_collectionAPI. At the moment, the file metadata we maintain is only for internal purposes - however, if you wish to migrate to be able to useannbatch.DatasetCollectionin conjunction withannbatch.Loader.use_collection, the root folder of the old collection must have attrs{"encoding-type": "annbatch-preshuffled", "encoding-version": "0.1.0"}and be azarr.Group. The subfolders (i.e., datasets) must be calleddataset_([0-9]*). Otherwise you can use theannbatch.DatasetCollection.add_adatasas before.
Changed#
preload_to_gpunow depends on whethercupyis installed instead of defaulting toTrue
[0.0.1]#
Added#
First release