Single-cell microscopy images#
Single-cell imaging produces a dense tensor per cell — here 5 × 128 × 128 (channels × height × width).
In a scPortrait single-cell dataset these tensors sit in obsm["single_cell_images"] as a dense n_cells × 5 × 128 × 128 array.
The key idea: the Loader fetches contiguous rows along the first axis of any zarr array — it does not care whether that array is a 2-D count matrix or a 4-D image stack.
So we shuffle the images into a sharded zarr collection and then stream them directly with add_datasets().
Configure zarrs#
import warnings
import zarr
zarr.config.set({"codec_pipeline.path": "zarrs.ZarrsCodecPipeline"})
for msg in ("Consolidated metadata is currently not part in the Zarr format 3 specification.*",):
warnings.filterwarnings("ignore", message=msg)
Get the data#
scPortrait ships small example single-cell image datasets (.h5sc files).
import scportrait
h5sc_paths = [str(p) for p in scportrait.data.autophagosome_h5sc()]
print(h5sc_paths)
Each file stores its images in obsm["single_cell_images"].
import h5py
with h5py.File(h5sc_paths[0], "r") as f:
images = f["obsm"]["single_cell_images"]
print(f"single_cell_images: {images.shape} {images.dtype} (cells × channels × H × W)")
single_cell_images: (30, 5, 128, 128) float16 (cells × channels × H × W)
Convert into a shuffled collection#
We give add_adatas() a load_adata that puts the image stack in obsm; the AnnData needs no X at all.
add_adatas concatenates all inputs, so we prefix the cell indices with the file name to keep them unique across files — both example files reuse the same indices otherwise.
n_obs_per_chunk is the number of images read contiguously per chunk — keep it small, since each image is far larger than a row of counts.
from pathlib import Path
import anndata as ad
from annbatch import DatasetCollection
def _load_images(path: str) -> ad.AnnData:
f = h5py.File(path, "r")
obs = ad.io.read_elem(f["obs"])
images = ad.experimental.read_elem_lazy(f["obsm"]["single_cell_images"], chunks=(16, -1, -1, -1))
obs.index = f"{Path(path).stem}_" + obs.index.astype(str)
return ad.AnnData(obs=obs, obsm={"single_cell_images": images})
collection = DatasetCollection(zarr.open("images_collection.zarr", mode="w"))
collection.add_adatas(
adata_paths=h5sc_paths,
load_adata=_load_images,
shuffle=True,
n_obs_per_chunk=16, # images read contiguously per chunk
dataset_size=100_000, # images per on-disk dataset
)
print("datasets in collection:", len(list(collection)))
Stream shuffled image mini-batches#
The image data lives in obsm, not X, so we point add_datasets() at the obsm image arrays directly.
The loader streams them along the first axis exactly as it would a count matrix.
from annbatch import Loader
image_arrays = [group["obsm"]["single_cell_images"] for group in collection]
obs = [ad.io.read_elem(group["obs"]) for group in collection]
loader = Loader(batch_size=16, chunk_size=8, preload_nchunks=32, preload_to_gpu=False).add_datasets(
datasets=image_arrays, obs=obs
)
batch = next(iter(loader))
images = batch["X"].cuda()
print(f"image batch: {tuple(images.shape)} {images.dtype} on {images.device}")
image batch: (16, 5, 128, 128) torch.float16 on cuda:0