Run InSituCNV

This notebook runs the full InSituCNV workflow on a spatial transcriptomics dataset: normalize raw counts, smooth over neighbors, normalize and log-transform, add genomic positions, run infercnvpy, cluster CNV profiles, plot heatmaps and spatial CNV maps, then manually choose normal/tumor clusters and optional tumor clone labels.

The example setup uses generic terminology. Replace the paths, annotation column, and reference categories with values for your own dataset.

1. Setup

Run this notebook from the package environment:

conda activate insitucnv_env
cd InSituCNV
pip install -e .
jupyter lab
from pathlib import Path
import pandas as pd
import scanpy as sc
import insitucnv as icv

sc.settings.verbosity = 2
sc.set_figure_params(figsize=(6, 6), dpi=100)

SAMPLE_NAME = "example_sample"
DATA_PATH = Path("data/your_dataset.h5ad")
OUTPUT_DIR = Path("outputs") / SAMPLE_NAME
PLOTS_DIR = OUTPUT_DIR / "plots"

RAW_LAYER = "raw_counts"  # Change this if your raw count layer has another name.
REFERENCE_KEY = "cell_type"  # obs column containing normal/reference labels.
REFERENCE_CATEGORIES = [
    "T_cells",
    "B_cells",
    "Stromal",
]
SPATIAL_KEY = "spatial"

OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
PLOTS_DIR.mkdir(parents=True, exist_ok=True)

DATA_PATH

2. Load the AnnData object

Before running the CNV steps, check that your AnnData object has:

  • raw counts in adata.layers[RAW_LAYER], or raw counts in adata.X that can be copied into that layer

  • normal/reference labels in adata.obs[REFERENCE_KEY]

  • spatial coordinates in adata.obsm[SPATIAL_KEY]

  • a precomputed neighbor graph for smoothing

adata = sc.read_h5ad(DATA_PATH)
adata
print("layers:", list(adata.layers.keys()))
print("obsm:", list(adata.obsm.keys()))
print("obs columns:", list(adata.obs.columns))
print("reference labels:", sorted(adata.obs[REFERENCE_KEY].astype(str).unique()))
if SPATIAL_KEY not in adata.obsm:
    raise KeyError(f"adata.obsm[{SPATIAL_KEY!r}] is required for spatial plots.")

if REFERENCE_KEY not in adata.obs:
    raise KeyError(f"adata.obs[{REFERENCE_KEY!r}] is required for inferCNV reference labels.")

if RAW_LAYER not in adata.layers:
    print(f"Layer {RAW_LAYER!r} was not found. Copying current adata.X into adata.layers[{RAW_LAYER!r}].")
    adata.layers[RAW_LAYER] = adata.X.copy()

icv.pl.plot_spatial(
    adata,
    color=REFERENCE_KEY,
    spatial_key=SPATIAL_KEY,
    output_path=PLOTS_DIR / "reference_labels_spatial.png",
    point_size=15,
    title=f"{SAMPLE_NAME} reference labels",
    show=True,
)

3. Define inferCNV reference cells

The reference categories should be normal or healthy cells for your dataset. For a tissue sample this is often immune, stromal, endothelial, or other non-tumor categories, depending on your annotations.

present_reference_categories = [
    category
    for category in REFERENCE_CATEGORIES
    if category in set(adata.obs[REFERENCE_KEY].astype(str))
]

if not present_reference_categories:
    raise ValueError("None of the configured REFERENCE_CATEGORIES were found in the reference annotation column.")

present_reference_categories

4. Normalize, smooth, log-normalize, and add genomic positions

This step:

  1. restores raw counts from adata.layers[RAW_LAYER] into adata.X

  2. normalizes total counts into adata.layers["norm"]

  3. smooths normalized counts over the neighbor graph into adata.layers["M"]

  4. normalizes and log-transforms the smoothed matrix into adata.layers["log_norm"]

  5. subsets to genes with genomic positions and adds chromosome, start, and end

adata = icv.tl.prepare_cnv_input(
    adata,
    raw_layer=RAW_LAYER,
    target_sum=1e4,
    smoothing_neighbors=100,
    normalized_layer="norm",
    smoothed_layer="M",
    log_layer="log_norm",
)

adata
adata.var[["chromosome", "start", "end"]].head()

5. Run inferCNV

Adjust window_size, step, and lfc_clip if you need different inferCNV settings for your panel or tissue.

icv.tl.run_infercnv(
    adata,
    reference_key=REFERENCE_KEY,
    reference_categories=present_reference_categories,
    input_layer="log_norm",
    window_size=60,
    step=10,
    lfc_clip=4,
    chunksize=1000,
    calculate_gene_values=True,
)

adata

6. Compute CNV neighbors and cluster at multiple resolutions

Testing several resolutions lets you compare how granular the CNV clone structure should be. The keys created here are cnv_leiden_res0.1, cnv_leiden_res0.2, and cnv_leiden_res0.3.

icv.tl.compute_cnv_neighbors(adata)

resolutions = [0.1, 0.2, 0.3]
cluster_keys = icv.tl.cluster_cnv_resolutions(
    adata,
    resolutions=resolutions,
    dendrogram=True,
)

cluster_keys
for key in cluster_keys:
    display(pd.crosstab(adata.obs[key], adata.obs[REFERENCE_KEY]))

7. Plot CNV heatmaps and spatial CNV clusters

Review these plots before assigning any cluster as normal, tumor, or a separate tumor clone.

for key in cluster_keys:
    print(key)
    icv.pl.plot_chromosome_heatmap(
        adata,
        groupby=key,
        output_path=PLOTS_DIR / f"{key}_heatmap.png",
        dendrogram=True,
        vmin=-0.4,
        vmax=0.4,
        show=True,
    )
    icv.pl.plot_spatial(
        adata,
        color=key,
        output_path=PLOTS_DIR / f"{key}_spatial.png",
        spatial_key=SPATIAL_KEY,
        point_size=1,
        title=f"{SAMPLE_NAME} {key}",
        show=True,
    )

8. Manually choose normal, tumor, and clone clusters

Use the heatmaps and spatial maps above to decide which clustering resolution to annotate. The package no longer infers a normal cluster automatically; edit the lists below after visual review.

primary_key = "cnv_leiden_res0.1"

cluster_summary = (
    adata.obs[primary_key]
    .astype(str)
    .value_counts()
    .sort_index()
    .rename_axis(primary_key)
    .to_frame("n_cells")
)
cluster_summary
# Edit these lists after inspecting the heatmap and spatial CNV plot.
normal_clusters = []  # e.g. ["0"]
tumor_clusters = []  # optional; leave empty to label every non-normal cluster as tumor
tumor_clone_clusters = []  # e.g. ["1", "2"] for tumor clusters with distinct CNV profiles

if not normal_clusters and not tumor_clusters:
    raise ValueError("Edit normal_clusters and/or tumor_clusters after reviewing the CNV plots above.")

all_clusters = set(adata.obs[primary_key].astype(str).unique())
normal_clusters_for_clones = normal_clusters or sorted(all_clusters - set(tumor_clusters))

icv.tl.assign_cnv_status(
    adata,
    cluster_key=primary_key,
    normal_clusters=normal_clusters or None,
    tumor_clusters=tumor_clusters or None,
    output_key="cnv_status",
)
icv.tl.assign_cnv_subclones(
    adata,
    cluster_key=primary_key,
    tumor_clusters=tumor_clone_clusters,
    normal_clusters=normal_clusters_for_clones,
    output_key="cnv_clone",
)

display(pd.crosstab(adata.obs[primary_key], adata.obs["cnv_status"]))
display(pd.crosstab(adata.obs[primary_key], adata.obs["cnv_clone"]))

for color, filename, title in [
    ("cnv_status", "cnv_status_spatial.png", f"{SAMPLE_NAME} manual CNV status"),
    ("cnv_clone", "cnv_clone_spatial.png", f"{SAMPLE_NAME} manual CNV clones"),
]:
    icv.pl.plot_spatial(
        adata,
        color=color,
        output_path=PLOTS_DIR / filename,
        spatial_key=SPATIAL_KEY,
        point_size=1,
        title=title,
        show=True,
    )

9. Optional: tumor-enriched CNV clustering

For some datasets, you may want to recluster tumor-enriched cells (e.g. epithelial cells in carcinomas) by CNV profile while labeling all other cells as non-tumor. Edit subset_values if your tumor-enriched annotation uses a different name.

epi_cluster_keys = icv.tl.cluster_cnv_resolutions(
    adata,
    resolutions=[0.1],
    key_prefix="epi_cnv_leiden_res",
    subset_key=REFERENCE_KEY,
    subset_values=["Epithelial"],
    subset_label_prefix="epi_",
    outside_label="non-epi",
    dendrogram=True,
)

epi_cluster_keys
for key in epi_cluster_keys:
    icv.pl.plot_spatial(
        adata,
        color=key,
        output_path=PLOTS_DIR / f"{key}_spatial.png",
        spatial_key=SPATIAL_KEY,
        point_size=1,
        title=f"{SAMPLE_NAME} {key}",
        show=True,
    )

10. Export manually selected results

Run these cells after filling in the manual normal_clusters, tumor_clusters, and optional tumor_clone_clusters lists above. The gene-level table uses adata.layers["gene_values_cnv"], which was created by infercnvpy because calculate_gene_values=True.

mean_cnv = icv.tl.export_mean_cnv_per_gene(
    adata,
    OUTPUT_DIR / f"{SAMPLE_NAME}_tumor_mean_cnv_per_gene.tsv",
    layer="gene_values_cnv",
    obs_key="cnv_status",
    obs_values=("tumor",),
)

mean_cnv.head()
icv.tl.export_cell_groups(
    adata,
    OUTPUT_DIR / f"{SAMPLE_NAME}_cnv_status_xenium_cell_groups.csv",
    group_key="cnv_status",
)
icv.tl.export_cell_groups(
    adata,
    OUTPUT_DIR / f"{SAMPLE_NAME}_cnv_clone_xenium_cell_groups.csv",
    group_key="cnv_clone",
)

adata.write(OUTPUT_DIR / f"{SAMPLE_NAME}_InSituCNV.h5ad", compression="gzip")
OUTPUT_DIR