{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "6a9c869b",
   "metadata": {},
   "source": [
    "# 02. Smooth And Run inferCNV\n",
    "\n",
    "This notebook runs the core `insituCNV` workflow in a step-by-step format.\n",
    "\n",
    "The idea is that users can read each block, change a few parameters, and rerun only the relevant cells.\n",
    "\n",
    "The expected starting point is a checked `.h5ad` file from Notebook 1, with:\n",
    "\n",
    "- `adata.layers['raw_counts']`\n",
    "- `adata.obsm['spatial']`\n",
    "- a valid annotation column that marks normal/reference cells"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bc5fec80",
   "metadata": {},
   "outputs": [],
    "source": [
    "from pathlib import Path\n",
    "import infercnvpy as cnv\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import scanpy as sc\n",
    "\n",
    "PROJECT_ROOT = Path('..').resolve()\n",
    "import insitucnv as icv\n",
    "from insitucnv.analysis import find_optimal_clustering\n",
    "\n",
    "INPUT_PATH = PROJECT_ROOT / 'results' / 'example_workflow' / 'adata_checked_input.h5ad'\n",
    "OUTPUT_DIR = PROJECT_ROOT / 'results' / 'example_workflow'\n",
    "OUTPUT_DIR.mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "REFERENCE_KEY = 'cell_type'\n",
    "REFERENCE_CATEGORIES = ['T_cells', 'B_cells', 'Stromal']\n",
    "\n",
    "SMOOTHING_NEIGHBORS = 20\n",
    "WINDOW_SIZE = 60\n",
    "STEP = 10\n",
    "LFC_CLIP = 4\n",
    "CLUSTER_RESOLUTIONS = [0.1, 0.2, 0.3]\n",
    "\n",
    "sc.settings.figdir = str(OUTPUT_DIR / 'figures')\n",
    "Path(sc.settings.figdir).mkdir(parents=True, exist_ok=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c8855ad1",
   "metadata": {},
   "source": [
    "## Load The Checked Input"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "34bc5a90",
   "metadata": {},
   "outputs": [],
   "source": [
    "adata = sc.read_h5ad(INPUT_PATH)\n",
    "adata"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d10cedf5",
   "metadata": {},
   "source": [
    "## Keep Only The Reference Categories That Are Present\n",
    "\n",
    "This makes the notebook more robust when users adapt it to a different dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "53f05a80",
   "metadata": {},
   "outputs": [],
   "source": [
    "REFERENCE_CATEGORIES = [cat for cat in REFERENCE_CATEGORIES if cat in set(adata.obs[REFERENCE_KEY].astype(str))]\n",
    "REFERENCE_CATEGORIES"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19596d76",
   "metadata": {},
   "source": [
    "## Build The Expression Neighbor Graph Used For Smoothing\n",
    "\n",
    "The smoothing step uses nearest-neighbor relationships from the expression space.\n",
    "\n",
    "We normalize the raw counts, compute PCA and neighbors, and then smooth the normalized counts over the graph."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2f9a334c",
   "metadata": {},
   "outputs": [],
   "source": [
    "adata.X = adata.layers['raw_counts'].copy()\n",
    "sc.pp.normalize_total(adata)\n",
    "adata.layers['norm'] = adata.X.copy()\n",
    "sc.pp.log1p(adata)\n",
    "sc.pp.pca(adata)\n",
    "sc.pp.neighbors(adata, n_neighbors=15)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f5f43ca1",
   "metadata": {},
   "source": [
    "## Smooth The Data For CNV Inference\n",
    "\n",
    "This writes a smoothed count matrix into `adata.layers['M']`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cf249f7f",
   "metadata": {},
   "outputs": [],
   "source": [
    "icv.tl.smooth_data_for_cnv(\n",
    "    adata,\n",
    "    layer_w_norm_counts='norm',\n",
    "    n_neighbors=SMOOTHING_NEIGHBORS,\n",
    ")\n",
    "adata.layers['smooth_norm'] = adata.layers['M'].copy()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0ad22986",
   "metadata": {},
   "source": [
    "## Prepare The Smoothed Matrix For inferCNV\n",
    "\n",
    "This follows the same pattern used in the publication workflow:\n",
    "\n",
    "- move the smoothed matrix into `adata.X`\n",
    "- normalize again\n",
    "- log-transform\n",
    "- add genomic positions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "40a950dd",
   "metadata": {},
   "outputs": [],
   "source": [
    "adata.X = adata.layers['M'].copy()\n",
    "sc.pp.normalize_total(adata)\n",
    "sc.pp.log1p(adata)\n",
    "adata = icv.pp.add_genomic_positions(adata)\n",
    "adata"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6769ce59",
   "metadata": {},
   "source": [
    "## Run inferCNV\n",
    "\n",
    "The chosen reference categories should represent normal cells or normal tissue regions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "470b1f90",
   "metadata": {},
   "outputs": [],
   "source": [
    "cnv.tl.infercnv(\n",
    "    adata,\n",
    "    window_size=WINDOW_SIZE,\n",
    "    step=STEP,\n",
    "    lfc_clip=LFC_CLIP,\n",
    "    reference_key=REFERENCE_KEY,\n",
    "    reference_cat=REFERENCE_CATEGORIES,\n",
    "    chunksize=1000,\n",
    "    calculate_gene_values=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0fef5325",
   "metadata": {},
   "source": [
    "## CNV PCA, Neighbors, And Resolution Sweep"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bbda6691",
   "metadata": {},
   "outputs": [],
   "source": [
    "cnv.tl.pca(adata)\n",
    "cnv.pp.neighbors(adata)\n",
    "\n",
    "metrics = find_optimal_clustering(\n",
    "    adata,\n",
    "    resolutions=CLUSTER_RESOLUTIONS,\n",
    "    obsm_key='X_cnv',\n",
    "    spatial_key='spatial',\n",
    ")\n",
    "metrics"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fee159c7",
   "metadata": {},
   "source": [
    "## Choose A Clustering Resolution\n",
    "\n",
    "Edit `SELECTED_RESOLUTION` after looking at the metrics table above.\n",
    "\n",
    "For a quick first pass, `0.1` is often a sensible default on smaller samples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7b03a693",
   "metadata": {},
   "outputs": [],
   "source": [
    "SELECTED_RESOLUTION = 0.1\n",
    "CLUSTER_KEY = f'cnv_leiden_{SELECTED_RESOLUTION:g}'\n",
    "\n",
    "cnv.tl.leiden(adata, resolution=SELECTED_RESOLUTION, key_added=CLUSTER_KEY)\n",
    "adata.obs[CLUSTER_KEY].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1c0fb90b",
   "metadata": {},
   "source": [
    "## Review CNV clusters before manual annotation\n",
    "\n",
    "Inspect the cluster sizes, spatial plot, and chromosome heatmap before assigning any cluster as normal or tumor. No normal cluster is inferred automatically."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d140f57e",
   "metadata": {},
   "outputs": [],
   "source": [
    "cluster_summary = (\n",
    "    adata.obs[CLUSTER_KEY]\n",
    "    .astype(str)\n",
    "    .value_counts()\n",
    "    .sort_index()\n",
    "    .rename_axis(CLUSTER_KEY)\n",
    "    .to_frame('n_cells')\n",
    ")\n",
    "cluster_summary"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3cf2d06d",
   "metadata": {},
   "source": [
    "## Visualize the CNV heatmap and spatial clusters\n",
    "\n",
    "These are the main plots to inspect before deciding which clusters are normal, tumor, or distinct tumor clones."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "076f2eb2",
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.pl.embedding(adata, basis='spatial', color=REFERENCE_KEY, frameon=False, size=12)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0f22570a",
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.pl.embedding(adata, basis='spatial', color=CLUSTER_KEY, frameon=False, size=12)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9c1384e0",
   "metadata": {},
   "outputs": [],
   "source": [
    "cnv.pl.chromosome_heatmap(\n",
    "    adata,\n",
    "    groupby=CLUSTER_KEY,\n",
    "    dendrogram=True,\n",
    "    vmin=-0.4,\n",
    "    vmax=0.4,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ab7b65ac",
   "metadata": {},
   "source": [
    "## Manually label tumor/normal clusters and optional tumor clones\n",
    "\n",
    "Edit the lists below after reviewing the heatmap and spatial CNV plot. Use `tumor_clone_clusters` only for tumor clusters with CNV profiles that are distinct enough to report separately."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dc911e1d",
   "metadata": {},
   "outputs": [],
   "source": [
    "NORMAL_CLUSTERS = []  # e.g. ['0']\n",
    "TUMOR_CLUSTERS = []  # optional; leave empty to label every non-normal cluster as tumor\n",
    "TUMOR_CLONE_CLUSTERS = []  # e.g. ['1', '2']\n",
    "\n",
    "if not NORMAL_CLUSTERS and not TUMOR_CLUSTERS:\n",
    "    raise ValueError('Edit NORMAL_CLUSTERS and/or TUMOR_CLUSTERS after reviewing the CNV plots above.')\n",
    "\n",
    "all_clusters = set(adata.obs[CLUSTER_KEY].astype(str).unique())\n",
    "normal_clusters_for_clones = NORMAL_CLUSTERS or sorted(all_clusters - set(TUMOR_CLUSTERS))\n",
    "\n",
    "icv.tl.assign_cnv_status(\n",
    "    adata,\n",
    "    cluster_key=CLUSTER_KEY,\n",
    "    normal_clusters=NORMAL_CLUSTERS or None,\n",
    "    tumor_clusters=TUMOR_CLUSTERS or None,\n",
    "    output_key='cnv_status',\n",
    ")\n",
    "icv.tl.assign_cnv_subclones(\n",
    "    adata,\n",
    "    cluster_key=CLUSTER_KEY,\n",
    "    tumor_clusters=TUMOR_CLONE_CLUSTERS,\n",
    "    normal_clusters=normal_clusters_for_clones,\n",
    "    output_key='cnv_clone',\n",
    ")\n",
    "\n",
    "display(pd.crosstab(adata.obs[CLUSTER_KEY], adata.obs['cnv_status']))\n",
    "display(pd.crosstab(adata.obs[CLUSTER_KEY], adata.obs['cnv_clone']))\n",
    "sc.pl.embedding(adata, basis='spatial', color=['cnv_status', 'cnv_clone'], frameon=False, size=12)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "75bbb288",
   "metadata": {},
   "source": [
    "## Save The Final CNV Object"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3c1b7521",
   "metadata": {},
   "outputs": [],
   "source": [
    "metrics.to_csv(OUTPUT_DIR / 'cluster_resolution_metrics.csv', index=False)\n",
    "adata.write(OUTPUT_DIR / 'adata_cnv_results.h5ad', compression='gzip')\n",
    "print(OUTPUT_DIR / 'adata_cnv_results.h5ad')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}