02. Smooth And Run inferCNV
This notebook runs the core insituCNV workflow in a step-by-step format.
The idea is that users can read each block, change a few parameters, and rerun only the relevant cells.
The expected starting point is a checked .h5ad file from Notebook 1, with:
adata.layers['raw_counts']adata.obsm['spatial']a valid annotation column that marks normal/reference cells
from pathlib import Path
import infercnvpy as cnv
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scanpy as sc
PROJECT_ROOT = Path('..').resolve()
import insitucnv as icv
from insitucnv.analysis import find_optimal_clustering
INPUT_PATH = PROJECT_ROOT / 'results' / 'example_workflow' / 'adata_checked_input.h5ad'
OUTPUT_DIR = PROJECT_ROOT / 'results' / 'example_workflow'
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
REFERENCE_KEY = 'cell_type'
REFERENCE_CATEGORIES = ['T_cells', 'B_cells', 'Stromal']
SMOOTHING_NEIGHBORS = 20
WINDOW_SIZE = 60
STEP = 10
LFC_CLIP = 4
CLUSTER_RESOLUTIONS = [0.1, 0.2, 0.3]
sc.settings.figdir = str(OUTPUT_DIR / 'figures')
Path(sc.settings.figdir).mkdir(parents=True, exist_ok=True)
Load The Checked Input
adata = sc.read_h5ad(INPUT_PATH)
adata
Keep Only The Reference Categories That Are Present
This makes the notebook more robust when users adapt it to a different dataset.
REFERENCE_CATEGORIES = [cat for cat in REFERENCE_CATEGORIES if cat in set(adata.obs[REFERENCE_KEY].astype(str))]
REFERENCE_CATEGORIES
Build The Expression Neighbor Graph Used For Smoothing
The smoothing step uses nearest-neighbor relationships from the expression space.
We normalize the raw counts, compute PCA and neighbors, and then smooth the normalized counts over the graph.
adata.X = adata.layers['raw_counts'].copy()
sc.pp.normalize_total(adata)
adata.layers['norm'] = adata.X.copy()
sc.pp.log1p(adata)
sc.pp.pca(adata)
sc.pp.neighbors(adata, n_neighbors=15)
Smooth The Data For CNV Inference
This writes a smoothed count matrix into adata.layers['M'].
icv.tl.smooth_data_for_cnv(
adata,
layer_w_norm_counts='norm',
n_neighbors=SMOOTHING_NEIGHBORS,
)
adata.layers['smooth_norm'] = adata.layers['M'].copy()
Prepare The Smoothed Matrix For inferCNV
This follows the same pattern used in the publication workflow:
move the smoothed matrix into
adata.Xnormalize again
log-transform
add genomic positions
adata.X = adata.layers['M'].copy()
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
adata = icv.pp.add_genomic_positions(adata)
adata
Run inferCNV
The chosen reference categories should represent normal cells or normal tissue regions.
cnv.tl.infercnv(
adata,
window_size=WINDOW_SIZE,
step=STEP,
lfc_clip=LFC_CLIP,
reference_key=REFERENCE_KEY,
reference_cat=REFERENCE_CATEGORIES,
chunksize=1000,
calculate_gene_values=True,
)
CNV PCA, Neighbors, And Resolution Sweep
cnv.tl.pca(adata)
cnv.pp.neighbors(adata)
metrics = find_optimal_clustering(
adata,
resolutions=CLUSTER_RESOLUTIONS,
obsm_key='X_cnv',
spatial_key='spatial',
)
metrics
Choose A Clustering Resolution
Edit SELECTED_RESOLUTION after looking at the metrics table above.
For a quick first pass, 0.1 is often a sensible default on smaller samples.
SELECTED_RESOLUTION = 0.1
CLUSTER_KEY = f'cnv_leiden_{SELECTED_RESOLUTION:g}'
cnv.tl.leiden(adata, resolution=SELECTED_RESOLUTION, key_added=CLUSTER_KEY)
adata.obs[CLUSTER_KEY].value_counts()
Review CNV clusters before manual annotation
Inspect the cluster sizes, spatial plot, and chromosome heatmap before assigning any cluster as normal or tumor. No normal cluster is inferred automatically.
cluster_summary = (
adata.obs[CLUSTER_KEY]
.astype(str)
.value_counts()
.sort_index()
.rename_axis(CLUSTER_KEY)
.to_frame('n_cells')
)
cluster_summary
Visualize the CNV heatmap and spatial clusters
These are the main plots to inspect before deciding which clusters are normal, tumor, or distinct tumor clones.
sc.pl.embedding(adata, basis='spatial', color=REFERENCE_KEY, frameon=False, size=12)
sc.pl.embedding(adata, basis='spatial', color=CLUSTER_KEY, frameon=False, size=12)
cnv.pl.chromosome_heatmap(
adata,
groupby=CLUSTER_KEY,
dendrogram=True,
vmin=-0.4,
vmax=0.4,
)
Manually label tumor/normal clusters and optional tumor clones
Edit the lists below after reviewing the heatmap and spatial CNV plot. Use tumor_clone_clusters only for tumor clusters with CNV profiles that are distinct enough to report separately.
NORMAL_CLUSTERS = [] # e.g. ['0']
TUMOR_CLUSTERS = [] # optional; leave empty to label every non-normal cluster as tumor
TUMOR_CLONE_CLUSTERS = [] # e.g. ['1', '2']
if not NORMAL_CLUSTERS and not TUMOR_CLUSTERS:
raise ValueError('Edit NORMAL_CLUSTERS and/or TUMOR_CLUSTERS after reviewing the CNV plots above.')
all_clusters = set(adata.obs[CLUSTER_KEY].astype(str).unique())
normal_clusters_for_clones = NORMAL_CLUSTERS or sorted(all_clusters - set(TUMOR_CLUSTERS))
icv.tl.assign_cnv_status(
adata,
cluster_key=CLUSTER_KEY,
normal_clusters=NORMAL_CLUSTERS or None,
tumor_clusters=TUMOR_CLUSTERS or None,
output_key='cnv_status',
)
icv.tl.assign_cnv_subclones(
adata,
cluster_key=CLUSTER_KEY,
tumor_clusters=TUMOR_CLONE_CLUSTERS,
normal_clusters=normal_clusters_for_clones,
output_key='cnv_clone',
)
display(pd.crosstab(adata.obs[CLUSTER_KEY], adata.obs['cnv_status']))
display(pd.crosstab(adata.obs[CLUSTER_KEY], adata.obs['cnv_clone']))
sc.pl.embedding(adata, basis='spatial', color=['cnv_status', 'cnv_clone'], frameon=False, size=12)
Save The Final CNV Object
metrics.to_csv(OUTPUT_DIR / 'cluster_resolution_metrics.csv', index=False)
adata.write(OUTPUT_DIR / 'adata_cnv_results.h5ad', compression='gzip')
print(OUTPUT_DIR / 'adata_cnv_results.h5ad')