Genetics (VCF / whole-genome sequencing)

Genetics (VCF / whole-genome sequencing)#

Genotype data is a samples × variants matrix: for each individual and each genomic position, how many copies of the alternate allele they carry (0, 1, or 2). Most variants are rare, so for any given individual the row is overwhelmingly zeros — a natural fit for the sparse streaming annbatch excels at.

This guide runs the full path the annbatch paper uses for the 1000 Genomes data:

VCF ──plink2──▶ PGEN ──cellink──▶ AnnData (zarr) ──annbatch──▶ shuffled mini-batches

plink2 filters and packs the VCF into the compact PGEN format. cellink streams PGEN into an AnnData-compatible zarr store without ever loading the whole matrix into memory.

Configure zarr and anndata#

import warnings

import anndata as ad
import zarr

zarr.config.set({"codec_pipeline.path": "zarrs.ZarrsCodecPipeline"})

# AnnData 0.13.0 defaults to the following settings
ad.settings.zarr_write_format = 3
ad.settings.auto_shard_zarr_v3 = True

for msg in ("Consolidated metadata is currently not part in the Zarr format 3 specification.*",):
    warnings.filterwarnings("ignore", message=msg)

Get the data#

We use chromosome 22 of the 1000 Genomes high-coverage WGS call set (3202 individuals). The full per-chromosome VCF is ~27 GB, so for a quick run we stream just a 1 Mb region with bcftools (conda install -c bioconda bcftools) through its remote index — a few hundred MB instead of the whole file. The pipeline below is identical for a full chromosome, or for all of them merged together.

!test -f 1000G_chr22.vcf.gz || bcftools view -r chr22:20000000-21000000 \
    https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20201028_3202_raw_GT_with_annot/20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chr22.recalibrated_variants.vcf.gz \
    -Oz -o 1000G_chr22.vcf.gz

VCF → PGEN with plink2#

We keep biallelic SNPs and assign each variant a stable chr:pos:ref:alt ID.

!plink2 --vcf 1000G_chr22.vcf.gz --make-pgen --max-alleles 2 --snps-only just-acgt \
        --set-all-var-ids '@:#:$r:$a' --new-id-max-allele-len 1000 missing \
        --vcf-half-call m --threads 8 --out chr22 > chr22.make_pgen.log 2>&1
!grep -E "samples .* loaded|variants remaining" chr22.log

Select rare variants#

Here we keep polymorphic variants with minor-allele frequency below 1% — the sparse, high-dimensional regime. Common variants with MAF ≥ 1% are denser and can be stored as a dense matrix. annbatch can stream dense arrays too, as shown in the Single-cell microscopy images guide.

!plink2 --pfile chr22 --maf 1e-9 --max-maf 0.0099999 --make-pgen \
        --threads 8 --out chr22_rare > chr22.rare.log 2>&1
!grep -E "variants remaining" chr22_rare.log

PGEN → AnnData with cellink#

cellink.io.stream_pgen_to_zarr() streams the PGEN into a sparse, on-disk AnnData zarr store. cellink.io.read_pgen_zarr() reads it back. We cap the number of variants here for the sake of demonstration.

import logging

from cellink.io import read_pgen_zarr, stream_pgen_to_zarr

logging.getLogger("cellink.io._pgen").setLevel(logging.WARNING)  # silence per-chunk INFO logs

stream_pgen_to_zarr(
    "chr22_rare",
    "chr22_rare.zarr",
    sparse=True,
    sparse_format="csr",
    max_variants=20_000,
    chunk_samples=3202,
    chunk_variants=16384,
)

genotypes = read_pgen_zarr("chr22_rare.zarr")
density = genotypes.X.nnz / (genotypes.shape[0] * genotypes.shape[1])
print(
    f"{genotypes.shape[0]:,} individuals × {genotypes.shape[1]:,} variants | {genotypes.X.dtype} | density {density:.2%}"
)

3,202 individuals × 20,000 variants | int8 | density 0.26%

Convert into a shuffled annbatch collection#

The cellink store is already an AnnData zarr, so we hand it straight to add_adatas(). The load_adata gives the variants their chr:pos:ref:alt IDs (the "2" column of the PGEN variant table) and keeps the sample ID in obs.

import anndata as ad
import pandas as pd

from annbatch import DatasetCollection


def _load_genotypes(path: str) -> ad.AnnData:
    a = ad.read_zarr(path)
    a.obs = a.obs[["iid"]].astype({"iid": "str"}).set_index("iid")
    a.var = pd.DataFrame(index=a.var["2"].astype("str").to_numpy())
    return a


collection = DatasetCollection(zarr.open("genetics_collection.zarr", mode="w"))
collection.add_adatas(adata_paths=["chr22_rare.zarr"], load_adata=_load_genotypes, shuffle=True)

Stream shuffled mini-batches#

Each batch is a sparse genotype matrix; densify on the GPU and cast to float for your model.

from annbatch import Loader


def _load_x(group: zarr.Group) -> ad.AnnData:
    return ad.AnnData(X=ad.io.sparse_dataset(group["X"]))


loader = Loader(batch_size=512, chunk_size=64, preload_nchunks=32, preload_to_gpu=False).use_collection(
    collection, load_adata=_load_x
)

batch = next(iter(loader))
x = batch["X"].cuda().to_dense().float()
print(f"genotype batch: {tuple(x.shape)} {x.dtype} on {x.device}")

genotype batch: (512, 20000) torch.float32 on cuda:0

Scaling up#

The full pipeline is unchanged at scale: download the whole per-chromosome VCFs instead of the 1 Mb slice, merge all chromosomes with plink2, drop the max_variants cap, and pass every per-chunk store to add_adatas. The paper scales this to 500k individuals by replicating the 1000 Genomes panel; annbatch then streams the resulting multi-terabyte collection the same way it streams the example above.