02. Smooth And Run inferCNV

This notebook runs the core insituCNV workflow in a step-by-step format.

The idea is that users can read each block, change a few parameters, and rerun only the relevant cells.

The expected starting point is a checked .h5ad file from Notebook 1, with:

adata.layers['raw_counts']
adata.obsm['spatial']
a valid annotation column that marks normal/reference cells

from pathlib import Path
import infercnvpy as cnv
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scanpy as sc

PROJECT_ROOT = Path('..').resolve()
import insitucnv as icv
from insitucnv.analysis import find_optimal_clustering

INPUT_PATH = PROJECT_ROOT / 'results' / 'example_workflow' / 'adata_checked_input.h5ad'
OUTPUT_DIR = PROJECT_ROOT / 'results' / 'example_workflow'
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

REFERENCE_KEY = 'cell_type'
REFERENCE_CATEGORIES = ['T_cells', 'B_cells', 'Stromal']

SMOOTHING_NEIGHBORS = 20
WINDOW_SIZE = 60
STEP = 10
LFC_CLIP = 4
CLUSTER_RESOLUTIONS = [0.1, 0.2, 0.3]

sc.settings.figdir = str(OUTPUT_DIR / 'figures')
Path(sc.settings.figdir).mkdir(parents=True, exist_ok=True)

Load The Checked Input

adata = sc.read_h5ad(INPUT_PATH)
adata

Keep Only The Reference Categories That Are Present

This makes the notebook more robust when users adapt it to a different dataset.

REFERENCE_CATEGORIES = [cat for cat in REFERENCE_CATEGORIES if cat in set(adata.obs[REFERENCE_KEY].astype(str))]
REFERENCE_CATEGORIES

Build The Expression Neighbor Graph Used For Smoothing

The smoothing step uses nearest-neighbor relationships from the expression space.

We normalize the raw counts, compute PCA and neighbors, and then smooth the normalized counts over the graph.

adata.X = adata.layers['raw_counts'].copy()
sc.pp.normalize_total(adata)
adata.layers['norm'] = adata.X.copy()
sc.pp.log1p(adata)
sc.pp.pca(adata)
sc.pp.neighbors(adata, n_neighbors=15)

Smooth The Data For CNV Inference

This writes a smoothed count matrix into adata.layers['M'].

icv.tl.smooth_data_for_cnv(
    adata,
    layer_w_norm_counts='norm',
    n_neighbors=SMOOTHING_NEIGHBORS,
)
adata.layers['smooth_norm'] = adata.layers['M'].copy()

Prepare The Smoothed Matrix For inferCNV

This follows the same pattern used in the publication workflow:

move the smoothed matrix into adata.X
normalize again
log-transform
add genomic positions

adata.X = adata.layers['M'].copy()
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
adata = icv.pp.add_genomic_positions(adata)
adata

Run inferCNV

The chosen reference categories should represent normal cells or normal tissue regions.

cnv.tl.infercnv(
    adata,
    window_size=WINDOW_SIZE,
    step=STEP,
    lfc_clip=LFC_CLIP,
    reference_key=REFERENCE_KEY,
    reference_cat=REFERENCE_CATEGORIES,
    chunksize=1000,
    calculate_gene_values=True,
)

CNV PCA, Neighbors, And Resolution Sweep

cnv.tl.pca(adata)
cnv.pp.neighbors(adata)

metrics = find_optimal_clustering(
    adata,
    resolutions=CLUSTER_RESOLUTIONS,
    obsm_key='X_cnv',
    spatial_key='spatial',
)
metrics

Choose A Clustering Resolution

Edit SELECTED_RESOLUTION after looking at the metrics table above.

For a quick first pass, 0.1 is often a sensible default on smaller samples.

SELECTED_RESOLUTION = 0.1
CLUSTER_KEY = f'cnv_leiden_{SELECTED_RESOLUTION:g}'

cnv.tl.leiden(adata, resolution=SELECTED_RESOLUTION, key_added=CLUSTER_KEY)
adata.obs[CLUSTER_KEY].value_counts()

Review CNV clusters before manual annotation

Inspect the cluster sizes, spatial plot, and chromosome heatmap before assigning any cluster as normal or tumor. No normal cluster is inferred automatically.

cluster_summary = (
    adata.obs[CLUSTER_KEY]
    .astype(str)
    .value_counts()
    .sort_index()
    .rename_axis(CLUSTER_KEY)
    .to_frame('n_cells')
)
cluster_summary

Visualize the CNV heatmap and spatial clusters

These are the main plots to inspect before deciding which clusters are normal, tumor, or distinct tumor clones.

sc.pl.embedding(adata, basis='spatial', color=REFERENCE_KEY, frameon=False, size=12)

sc.pl.embedding(adata, basis='spatial', color=CLUSTER_KEY, frameon=False, size=12)

cnv.pl.chromosome_heatmap(
    adata,
    groupby=CLUSTER_KEY,
    dendrogram=True,
    vmin=-0.4,
    vmax=0.4,
)

Manually label tumor/normal clusters and optional tumor clones

Edit the lists below after reviewing the heatmap and spatial CNV plot. Use tumor_clone_clusters only for tumor clusters with CNV profiles that are distinct enough to report separately.

NORMAL_CLUSTERS = []  # e.g. ['0']
TUMOR_CLUSTERS = []  # optional; leave empty to label every non-normal cluster as tumor
TUMOR_CLONE_CLUSTERS = []  # e.g. ['1', '2']

if not NORMAL_CLUSTERS and not TUMOR_CLUSTERS:
    raise ValueError('Edit NORMAL_CLUSTERS and/or TUMOR_CLUSTERS after reviewing the CNV plots above.')

all_clusters = set(adata.obs[CLUSTER_KEY].astype(str).unique())
normal_clusters_for_clones = NORMAL_CLUSTERS or sorted(all_clusters - set(TUMOR_CLUSTERS))

icv.tl.assign_cnv_status(
    adata,
    cluster_key=CLUSTER_KEY,
    normal_clusters=NORMAL_CLUSTERS or None,
    tumor_clusters=TUMOR_CLUSTERS or None,
    output_key='cnv_status',
)
icv.tl.assign_cnv_subclones(
    adata,
    cluster_key=CLUSTER_KEY,
    tumor_clusters=TUMOR_CLONE_CLUSTERS,
    normal_clusters=normal_clusters_for_clones,
    output_key='cnv_clone',
)

display(pd.crosstab(adata.obs[CLUSTER_KEY], adata.obs['cnv_status']))
display(pd.crosstab(adata.obs[CLUSTER_KEY], adata.obs['cnv_clone']))
sc.pl.embedding(adata, basis='spatial', color=['cnv_status', 'cnv_clone'], frameon=False, size=12)

Save The Final CNV Object

metrics.to_csv(OUTPUT_DIR / 'cluster_resolution_metrics.csv', index=False)
adata.write(OUTPUT_DIR / 'adata_cnv_results.h5ad', compression='gzip')
print(OUTPUT_DIR / 'adata_cnv_results.h5ad')