Run InSituCNV

This notebook runs the full InSituCNV workflow on a spatial transcriptomics dataset: normalize raw counts, smooth over neighbors, normalize and log-transform, add genomic positions, run infercnvpy, cluster CNV profiles, plot heatmaps and spatial CNV maps, then manually choose normal/tumor clusters and optional tumor clone labels.

The example setup uses generic terminology. Replace the paths, annotation column, and reference categories with values for your own dataset.

1. Setup

Run this notebook from the package environment:

conda activate insitucnv_env
cd InSituCNV
pip install -e .
jupyter lab

from pathlib import Path
import pandas as pd
import scanpy as sc
import insitucnv as icv

sc.settings.verbosity = 2
sc.set_figure_params(figsize=(6, 6), dpi=100)

SAMPLE_NAME = "example_sample"
DATA_PATH = Path("data/your_dataset.h5ad")
OUTPUT_DIR = Path("outputs") / SAMPLE_NAME
PLOTS_DIR = OUTPUT_DIR / "plots"

RAW_LAYER = "raw_counts"  # Change this if your raw count layer has another name.
REFERENCE_KEY = "cell_type"  # obs column containing normal/reference labels.
REFERENCE_CATEGORIES = [
    "T_cells",
    "B_cells",
    "Stromal",
]
SPATIAL_KEY = "spatial"

OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
PLOTS_DIR.mkdir(parents=True, exist_ok=True)

DATA_PATH

2. Load the AnnData object

Before running the CNV steps, check that your AnnData object has:

raw counts in adata.layers[RAW_LAYER], or raw counts in adata.X that can be copied into that layer
normal/reference labels in adata.obs[REFERENCE_KEY]
spatial coordinates in adata.obsm[SPATIAL_KEY]
a precomputed neighbor graph for smoothing

adata = sc.read_h5ad(DATA_PATH)
adata

print("layers:", list(adata.layers.keys()))
print("obsm:", list(adata.obsm.keys()))
print("obs columns:", list(adata.obs.columns))
print("reference labels:", sorted(adata.obs[REFERENCE_KEY].astype(str).unique()))

if SPATIAL_KEY not in adata.obsm:
    raise KeyError(f"adata.obsm[{SPATIAL_KEY!r}] is required for spatial plots.")

if REFERENCE_KEY not in adata.obs:
    raise KeyError(f"adata.obs[{REFERENCE_KEY!r}] is required for inferCNV reference labels.")

if RAW_LAYER not in adata.layers:
    print(f"Layer {RAW_LAYER!r} was not found. Copying current adata.X into adata.layers[{RAW_LAYER!r}].")
    adata.layers[RAW_LAYER] = adata.X.copy()

icv.pl.plot_spatial(
    adata,
    color=REFERENCE_KEY,
    spatial_key=SPATIAL_KEY,
    output_path=PLOTS_DIR / "reference_labels_spatial.png",
    point_size=15,
    title=f"{SAMPLE_NAME} reference labels",
    show=True,
)

3. Define inferCNV reference cells

The reference categories should be normal or healthy cells for your dataset. For a tissue sample this is often immune, stromal, endothelial, or other non-tumor categories, depending on your annotations.

present_reference_categories = [
    category
    for category in REFERENCE_CATEGORIES
    if category in set(adata.obs[REFERENCE_KEY].astype(str))
]

if not present_reference_categories:
    raise ValueError("None of the configured REFERENCE_CATEGORIES were found in the reference annotation column.")

present_reference_categories

4. Normalize, smooth, log-normalize, and add genomic positions

This step:

restores raw counts from adata.layers[RAW_LAYER] into adata.X
normalizes total counts into adata.layers["norm"]
smooths normalized counts over the neighbor graph into adata.layers["M"]
normalizes and log-transforms the smoothed matrix into adata.layers["log_norm"]
subsets to genes with genomic positions and adds chromosome, start, and end

adata = icv.tl.prepare_cnv_input(
    adata,
    raw_layer=RAW_LAYER,
    target_sum=1e4,
    smoothing_neighbors=100,
    normalized_layer="norm",
    smoothed_layer="M",
    log_layer="log_norm",
)

adata

adata.var[["chromosome", "start", "end"]].head()

5. Run inferCNV

Adjust window_size, step, and lfc_clip if you need different inferCNV settings for your panel or tissue.

icv.tl.run_infercnv(
    adata,
    reference_key=REFERENCE_KEY,
    reference_categories=present_reference_categories,
    input_layer="log_norm",
    window_size=60,
    step=10,
    lfc_clip=4,
    chunksize=1000,
    calculate_gene_values=True,
)

adata

6. Compute CNV neighbors and cluster at multiple resolutions

Testing several resolutions lets you compare how granular the CNV clone structure should be. The keys created here are cnv_leiden_res0.1, cnv_leiden_res0.2, and cnv_leiden_res0.3.

icv.tl.compute_cnv_neighbors(adata)

resolutions = [0.1, 0.2, 0.3]
cluster_keys = icv.tl.cluster_cnv_resolutions(
    adata,
    resolutions=resolutions,
    dendrogram=True,
)

cluster_keys

for key in cluster_keys:
    display(pd.crosstab(adata.obs[key], adata.obs[REFERENCE_KEY]))

7. Plot CNV heatmaps and spatial CNV clusters

Review these plots before assigning any cluster as normal, tumor, or a separate tumor clone.

for key in cluster_keys:
    print(key)
    icv.pl.plot_chromosome_heatmap(
        adata,
        groupby=key,
        output_path=PLOTS_DIR / f"{key}_heatmap.png",
        dendrogram=True,
        vmin=-0.4,
        vmax=0.4,
        show=True,
    )
    icv.pl.plot_spatial(
        adata,
        color=key,
        output_path=PLOTS_DIR / f"{key}_spatial.png",
        spatial_key=SPATIAL_KEY,
        point_size=1,
        title=f"{SAMPLE_NAME} {key}",
        show=True,
    )

8. Manually choose normal, tumor, and clone clusters

Use the heatmaps and spatial maps above to decide which clustering resolution to annotate. The package no longer infers a normal cluster automatically; edit the lists below after visual review.

primary_key = "cnv_leiden_res0.1"

cluster_summary = (
    adata.obs[primary_key]
    .astype(str)
    .value_counts()
    .sort_index()
    .rename_axis(primary_key)
    .to_frame("n_cells")
)
cluster_summary

# Edit these lists after inspecting the heatmap and spatial CNV plot.
normal_clusters = []  # e.g. ["0"]
tumor_clusters = []  # optional; leave empty to label every non-normal cluster as tumor
tumor_clone_clusters = []  # e.g. ["1", "2"] for tumor clusters with distinct CNV profiles

if not normal_clusters and not tumor_clusters:
    raise ValueError("Edit normal_clusters and/or tumor_clusters after reviewing the CNV plots above.")

all_clusters = set(adata.obs[primary_key].astype(str).unique())
normal_clusters_for_clones = normal_clusters or sorted(all_clusters - set(tumor_clusters))

icv.tl.assign_cnv_status(
    adata,
    cluster_key=primary_key,
    normal_clusters=normal_clusters or None,
    tumor_clusters=tumor_clusters or None,
    output_key="cnv_status",
)
icv.tl.assign_cnv_subclones(
    adata,
    cluster_key=primary_key,
    tumor_clusters=tumor_clone_clusters,
    normal_clusters=normal_clusters_for_clones,
    output_key="cnv_clone",
)

display(pd.crosstab(adata.obs[primary_key], adata.obs["cnv_status"]))
display(pd.crosstab(adata.obs[primary_key], adata.obs["cnv_clone"]))

for color, filename, title in [
    ("cnv_status", "cnv_status_spatial.png", f"{SAMPLE_NAME} manual CNV status"),
    ("cnv_clone", "cnv_clone_spatial.png", f"{SAMPLE_NAME} manual CNV clones"),
]:
    icv.pl.plot_spatial(
        adata,
        color=color,
        output_path=PLOTS_DIR / filename,
        spatial_key=SPATIAL_KEY,
        point_size=1,
        title=title,
        show=True,
    )

9. Optional: tumor-enriched CNV clustering

For some datasets, you may want to recluster tumor-enriched cells (e.g. epithelial cells in carcinomas) by CNV profile while labeling all other cells as non-tumor. Edit subset_values if your tumor-enriched annotation uses a different name.

epi_cluster_keys = icv.tl.cluster_cnv_resolutions(
    adata,
    resolutions=[0.1],
    key_prefix="epi_cnv_leiden_res",
    subset_key=REFERENCE_KEY,
    subset_values=["Epithelial"],
    subset_label_prefix="epi_",
    outside_label="non-epi",
    dendrogram=True,
)

epi_cluster_keys

for key in epi_cluster_keys:
    icv.pl.plot_spatial(
        adata,
        color=key,
        output_path=PLOTS_DIR / f"{key}_spatial.png",
        spatial_key=SPATIAL_KEY,
        point_size=1,
        title=f"{SAMPLE_NAME} {key}",
        show=True,
    )

10. Export manually selected results

Run these cells after filling in the manual normal_clusters, tumor_clusters, and optional tumor_clone_clusters lists above. The gene-level table uses adata.layers["gene_values_cnv"], which was created by infercnvpy because calculate_gene_values=True.

mean_cnv = icv.tl.export_mean_cnv_per_gene(
    adata,
    OUTPUT_DIR / f"{SAMPLE_NAME}_tumor_mean_cnv_per_gene.tsv",
    layer="gene_values_cnv",
    obs_key="cnv_status",
    obs_values=("tumor",),
)

mean_cnv.head()

icv.tl.export_cell_groups(
    adata,
    OUTPUT_DIR / f"{SAMPLE_NAME}_cnv_status_xenium_cell_groups.csv",
    group_key="cnv_status",
)
icv.tl.export_cell_groups(
    adata,
    OUTPUT_DIR / f"{SAMPLE_NAME}_cnv_clone_xenium_cell_groups.csv",
    group_key="cnv_clone",
)

adata.write(OUTPUT_DIR / f"{SAMPLE_NAME}_InSituCNV.h5ad", compression="gzip")
OUTPUT_DIR