Single-cell microscopy images

Single-cell microscopy images#

Single-cell imaging produces a dense tensor per cell — here 5 × 128 × 128 (channels × height × width). In a scPortrait single-cell dataset these tensors sit in obsm["single_cell_images"] as a dense n_cells × 5 × 128 × 128 array.

The key idea: the Loader fetches contiguous rows along the first axis of any zarr array — it does not care whether that array is a 2-D count matrix or a 4-D image stack. So we shuffle the images into a sharded zarr collection and then stream them directly with add_datasets().

Configure zarrs#

import warnings

import zarr

zarr.config.set({"codec_pipeline.path": "zarrs.ZarrsCodecPipeline"})
for msg in ("Consolidated metadata is currently not part in the Zarr format 3 specification.*",):
    warnings.filterwarnings("ignore", message=msg)

Get the data#

scPortrait ships small example single-cell image datasets (.h5sc files).

import scportrait

h5sc_paths = [str(p) for p in scportrait.data.autophagosome_h5sc()]
print(h5sc_paths)

Each file stores its images in obsm["single_cell_images"].

import h5py

with h5py.File(h5sc_paths[0], "r") as f:
    images = f["obsm"]["single_cell_images"]
    print(f"single_cell_images: {images.shape} {images.dtype}  (cells × channels × H × W)")

single_cell_images: (30, 5, 128, 128) float16  (cells × channels × H × W)

Convert into a shuffled collection#

We give add_adatas() a load_adata that puts the image stack in obsm; the AnnData needs no X at all. add_adatas concatenates all inputs, so we prefix the cell indices with the file name to keep them unique across files — both example files reuse the same indices otherwise. n_obs_per_chunk is the number of images read contiguously per chunk — keep it small, since each image is far larger than a row of counts.

from pathlib import Path

import anndata as ad

from annbatch import DatasetCollection


def _load_images(path: str) -> ad.AnnData:
    f = h5py.File(path, "r")
    obs = ad.io.read_elem(f["obs"])
    images = ad.experimental.read_elem_lazy(f["obsm"]["single_cell_images"], chunks=(16, -1, -1, -1))
    obs.index = f"{Path(path).stem}_" + obs.index.astype(str)
    return ad.AnnData(obs=obs, obsm={"single_cell_images": images})


collection = DatasetCollection(zarr.open("images_collection.zarr", mode="w"))
collection.add_adatas(
    adata_paths=h5sc_paths,
    load_adata=_load_images,
    shuffle=True,
    n_obs_per_chunk=16,  # images read contiguously per chunk
    dataset_size=100_000,  # images per on-disk dataset
)
print("datasets in collection:", len(list(collection)))

Stream shuffled image mini-batches#

The image data lives in obsm, not X, so we point add_datasets() at the obsm image arrays directly. The loader streams them along the first axis exactly as it would a count matrix.

from annbatch import Loader

image_arrays = [group["obsm"]["single_cell_images"] for group in collection]
obs = [ad.io.read_elem(group["obs"]) for group in collection]

loader = Loader(batch_size=16, chunk_size=8, preload_nchunks=32, preload_to_gpu=False).add_datasets(
    datasets=image_arrays, obs=obs
)

batch = next(iter(loader))
images = batch["X"].cuda()
print(f"image batch: {tuple(images.shape)} {images.dtype} on {images.device}")

image batch: (16, 5, 128, 128) torch.float16 on cuda:0