{
  "cells": [
    {
      "cell_type": "markdown",
      "id": "3838b43a",
      "metadata": {},
      "source": [
        "# Run InSituCNV\n",
        "\n",
        "This notebook runs the full InSituCNV workflow on a spatial transcriptomics dataset:\n",
        "normalize raw counts, smooth over neighbors, normalize and log-transform, add genomic positions, run `infercnvpy`, cluster CNV profiles, plot heatmaps and spatial CNV maps, then manually choose normal/tumor clusters and optional tumor clone labels.\n",
        "\n",
        "The example setup uses generic terminology. Replace the paths, annotation column, and reference categories with values for your own dataset."
      ]
    },
    {
      "cell_type": "markdown",
      "id": "5da280e2",
      "metadata": {},
      "source": [
        "## 1. Setup\n",
        "\n",
        "Run this notebook from the package environment:\n",
        "\n",
        "```bash\n",
        "conda activate insitucnv_env\n",
        "cd InSituCNV\n",
        "pip install -e .\n",
        "jupyter lab\n",
        "```"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "2e6d355b",
      "metadata": {},
      "outputs": [],
      "source": [
        "from pathlib import Path\n",
        "import pandas as pd\n",
        "import scanpy as sc\n",
        "import insitucnv as icv\n",
        "\n",
        "sc.settings.verbosity = 2\n",
        "sc.set_figure_params(figsize=(6, 6), dpi=100)\n",
        "\n",
        "SAMPLE_NAME = \"example_sample\"\n",
        "DATA_PATH = Path(\"data/your_dataset.h5ad\")\n",
        "OUTPUT_DIR = Path(\"outputs\") / SAMPLE_NAME\n",
        "PLOTS_DIR = OUTPUT_DIR / \"plots\"\n",
        "\n",
        "RAW_LAYER = \"raw_counts\"  # Change this if your raw count layer has another name.\n",
        "REFERENCE_KEY = \"cell_type\"  # obs column containing normal/reference labels.\n",
        "REFERENCE_CATEGORIES = [\n",
        "    \"T_cells\",\n",
        "    \"B_cells\",\n",
        "    \"Stromal\",\n",
        "]\n",
        "SPATIAL_KEY = \"spatial\"\n",
        "\n",
        "OUTPUT_DIR.mkdir(parents=True, exist_ok=True)\n",
        "PLOTS_DIR.mkdir(parents=True, exist_ok=True)\n",
        "\n",
        "DATA_PATH"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "161fce91",
      "metadata": {},
      "source": [
        "## 2. Load the AnnData object\n",
        "\n",
        "Before running the CNV steps, check that your AnnData object has:\n",
        "\n",
        "- raw counts in `adata.layers[RAW_LAYER]`, or raw counts in `adata.X` that can be copied into that layer\n",
        "- normal/reference labels in `adata.obs[REFERENCE_KEY]`\n",
        "- spatial coordinates in `adata.obsm[SPATIAL_KEY]`\n",
        "- a precomputed neighbor graph for smoothing"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "fe0bd3c6",
      "metadata": {},
      "outputs": [],
      "source": [
        "adata = sc.read_h5ad(DATA_PATH)\n",
        "adata"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "4dddeb88",
      "metadata": {},
      "outputs": [],
      "source": [
        "print(\"layers:\", list(adata.layers.keys()))\n",
        "print(\"obsm:\", list(adata.obsm.keys()))\n",
        "print(\"obs columns:\", list(adata.obs.columns))\n",
        "print(\"reference labels:\", sorted(adata.obs[REFERENCE_KEY].astype(str).unique()))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "ccc7b8f0",
      "metadata": {},
      "outputs": [],
      "source": [
        "if SPATIAL_KEY not in adata.obsm:\n",
        "    raise KeyError(f\"adata.obsm[{SPATIAL_KEY!r}] is required for spatial plots.\")\n",
        "\n",
        "if REFERENCE_KEY not in adata.obs:\n",
        "    raise KeyError(f\"adata.obs[{REFERENCE_KEY!r}] is required for inferCNV reference labels.\")\n",
        "\n",
        "if RAW_LAYER not in adata.layers:\n",
        "    print(f\"Layer {RAW_LAYER!r} was not found. Copying current adata.X into adata.layers[{RAW_LAYER!r}].\")\n",
        "    adata.layers[RAW_LAYER] = adata.X.copy()\n",
        "\n",
        "icv.pl.plot_spatial(\n",
        "    adata,\n",
        "    color=REFERENCE_KEY,\n",
        "    spatial_key=SPATIAL_KEY,\n",
        "    output_path=PLOTS_DIR / \"reference_labels_spatial.png\",\n",
        "    point_size=15,\n",
        "    title=f\"{SAMPLE_NAME} reference labels\",\n",
        "    show=True,\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "548075e7",
      "metadata": {},
      "source": [
        "## 3. Define inferCNV reference cells\n",
        "\n",
        "The reference categories should be normal or healthy cells for your dataset. For a tissue sample this is often immune, stromal, endothelial, or other non-tumor categories, depending on your annotations."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "07db7ea6",
      "metadata": {},
      "outputs": [],
      "source": [
        "present_reference_categories = [\n",
        "    category\n",
        "    for category in REFERENCE_CATEGORIES\n",
        "    if category in set(adata.obs[REFERENCE_KEY].astype(str))\n",
        "]\n",
        "\n",
        "if not present_reference_categories:\n",
        "    raise ValueError(\"None of the configured REFERENCE_CATEGORIES were found in the reference annotation column.\")\n",
        "\n",
        "present_reference_categories"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "c527faf5",
      "metadata": {},
      "source": [
        "## 4. Normalize, smooth, log-normalize, and add genomic positions\n",
        "\n",
        "This step:\n",
        "\n",
        "1. restores raw counts from `adata.layers[RAW_LAYER]` into `adata.X`\n",
        "2. normalizes total counts into `adata.layers[\"norm\"]`\n",
        "3. smooths normalized counts over the neighbor graph into `adata.layers[\"M\"]`\n",
        "4. normalizes and log-transforms the smoothed matrix into `adata.layers[\"log_norm\"]`\n",
        "5. subsets to genes with genomic positions and adds `chromosome`, `start`, and `end`"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "81541f8e",
      "metadata": {},
      "outputs": [],
      "source": [
        "adata = icv.tl.prepare_cnv_input(\n",
        "    adata,\n",
        "    raw_layer=RAW_LAYER,\n",
        "    target_sum=1e4,\n",
        "    smoothing_neighbors=100,\n",
        "    normalized_layer=\"norm\",\n",
        "    smoothed_layer=\"M\",\n",
        "    log_layer=\"log_norm\",\n",
        ")\n",
        "\n",
        "adata"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "c225c573",
      "metadata": {},
      "outputs": [],
      "source": [
        "adata.var[[\"chromosome\", \"start\", \"end\"]].head()"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "07a3fd77",
      "metadata": {},
      "source": [
        "## 5. Run inferCNV\n",
        "\n",
        "Adjust `window_size`, `step`, and `lfc_clip` if you need different inferCNV settings for your panel or tissue."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "a4950494",
      "metadata": {},
      "outputs": [],
      "source": [
        "icv.tl.run_infercnv(\n",
        "    adata,\n",
        "    reference_key=REFERENCE_KEY,\n",
        "    reference_categories=present_reference_categories,\n",
        "    input_layer=\"log_norm\",\n",
        "    window_size=60,\n",
        "    step=10,\n",
        "    lfc_clip=4,\n",
        "    chunksize=1000,\n",
        "    calculate_gene_values=True,\n",
        ")\n",
        "\n",
        "adata"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "cca6225e",
      "metadata": {},
      "source": [
        "## 6. Compute CNV neighbors and cluster at multiple resolutions\n",
        "\n",
        "Testing several resolutions lets you compare how granular the CNV clone structure should be. The keys created here are `cnv_leiden_res0.1`, `cnv_leiden_res0.2`, and `cnv_leiden_res0.3`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "fa2d929b",
      "metadata": {},
      "outputs": [],
      "source": [
        "icv.tl.compute_cnv_neighbors(adata)\n",
        "\n",
        "resolutions = [0.1, 0.2, 0.3]\n",
        "cluster_keys = icv.tl.cluster_cnv_resolutions(\n",
        "    adata,\n",
        "    resolutions=resolutions,\n",
        "    dendrogram=True,\n",
        ")\n",
        "\n",
        "cluster_keys"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "1b70f81d",
      "metadata": {},
      "outputs": [],
      "source": [
        "for key in cluster_keys:\n",
        "    display(pd.crosstab(adata.obs[key], adata.obs[REFERENCE_KEY]))"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "0845e449",
      "metadata": {},
      "source": [
        "## 7. Plot CNV heatmaps and spatial CNV clusters\n",
        "\n",
        "Review these plots before assigning any cluster as normal, tumor, or a separate tumor clone."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "9c55cd40",
      "metadata": {},
      "outputs": [],
      "source": [
        "for key in cluster_keys:\n",
        "    print(key)\n",
        "    icv.pl.plot_chromosome_heatmap(\n",
        "        adata,\n",
        "        groupby=key,\n",
        "        output_path=PLOTS_DIR / f\"{key}_heatmap.png\",\n",
        "        dendrogram=True,\n",
        "        vmin=-0.4,\n",
        "        vmax=0.4,\n",
        "        show=True,\n",
        "    )\n",
        "    icv.pl.plot_spatial(\n",
        "        adata,\n",
        "        color=key,\n",
        "        output_path=PLOTS_DIR / f\"{key}_spatial.png\",\n",
        "        spatial_key=SPATIAL_KEY,\n",
        "        point_size=1,\n",
        "        title=f\"{SAMPLE_NAME} {key}\",\n",
        "        show=True,\n",
        "    )"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "7f2be605",
      "metadata": {},
      "source": [
        "## 8. Manually choose normal, tumor, and clone clusters\n",
        "\n",
        "Use the heatmaps and spatial maps above to decide which clustering resolution to annotate. The package no longer infers a normal cluster automatically; edit the lists below after visual review."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "900d97f7",
      "metadata": {},
      "outputs": [],
      "source": [
        "primary_key = \"cnv_leiden_res0.1\"\n",
        "\n",
        "cluster_summary = (\n",
        "    adata.obs[primary_key]\n",
        "    .astype(str)\n",
        "    .value_counts()\n",
        "    .sort_index()\n",
        "    .rename_axis(primary_key)\n",
        "    .to_frame(\"n_cells\")\n",
        ")\n",
        "cluster_summary"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "51dc68e9",
      "metadata": {},
      "outputs": [],
      "source": [
        "# Edit these lists after inspecting the heatmap and spatial CNV plot.\n",
        "normal_clusters = []  # e.g. [\"0\"]\n",
        "tumor_clusters = []  # optional; leave empty to label every non-normal cluster as tumor\n",
        "tumor_clone_clusters = []  # e.g. [\"1\", \"2\"] for tumor clusters with distinct CNV profiles\n",
        "\n",
        "if not normal_clusters and not tumor_clusters:\n",
        "    raise ValueError(\"Edit normal_clusters and/or tumor_clusters after reviewing the CNV plots above.\")\n",
        "\n",
        "all_clusters = set(adata.obs[primary_key].astype(str).unique())\n",
        "normal_clusters_for_clones = normal_clusters or sorted(all_clusters - set(tumor_clusters))\n",
        "\n",
        "icv.tl.assign_cnv_status(\n",
        "    adata,\n",
        "    cluster_key=primary_key,\n",
        "    normal_clusters=normal_clusters or None,\n",
        "    tumor_clusters=tumor_clusters or None,\n",
        "    output_key=\"cnv_status\",\n",
        ")\n",
        "icv.tl.assign_cnv_subclones(\n",
        "    adata,\n",
        "    cluster_key=primary_key,\n",
        "    tumor_clusters=tumor_clone_clusters,\n",
        "    normal_clusters=normal_clusters_for_clones,\n",
        "    output_key=\"cnv_clone\",\n",
        ")\n",
        "\n",
        "display(pd.crosstab(adata.obs[primary_key], adata.obs[\"cnv_status\"]))\n",
        "display(pd.crosstab(adata.obs[primary_key], adata.obs[\"cnv_clone\"]))\n",
        "\n",
        "for color, filename, title in [\n",
        "    (\"cnv_status\", \"cnv_status_spatial.png\", f\"{SAMPLE_NAME} manual CNV status\"),\n",
        "    (\"cnv_clone\", \"cnv_clone_spatial.png\", f\"{SAMPLE_NAME} manual CNV clones\"),\n",
        "]:\n",
        "    icv.pl.plot_spatial(\n",
        "        adata,\n",
        "        color=color,\n",
        "        output_path=PLOTS_DIR / filename,\n",
        "        spatial_key=SPATIAL_KEY,\n",
        "        point_size=1,\n",
        "        title=title,\n",
        "        show=True,\n",
        "    )"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "f89f8248",
      "metadata": {},
      "source": [
        "## 9. Optional: tumor-enriched CNV clustering\n",
        "\n",
        "For some datasets, you may want to recluster tumor-enriched cells (e.g. epithelial cells in carcinomas) by CNV profile while labeling all other cells as `non-tumor`. Edit `subset_values` if your tumor-enriched annotation uses a different name."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "ad62d046",
      "metadata": {},
      "outputs": [],
      "source": [
        "epi_cluster_keys = icv.tl.cluster_cnv_resolutions(\n",
        "    adata,\n",
        "    resolutions=[0.1],\n",
        "    key_prefix=\"epi_cnv_leiden_res\",\n",
        "    subset_key=REFERENCE_KEY,\n",
        "    subset_values=[\"Epithelial\"],\n",
        "    subset_label_prefix=\"epi_\",\n",
        "    outside_label=\"non-epi\",\n",
        "    dendrogram=True,\n",
        ")\n",
        "\n",
        "epi_cluster_keys"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "1b9c6c8e",
      "metadata": {},
      "outputs": [],
      "source": [
        "for key in epi_cluster_keys:\n",
        "    icv.pl.plot_spatial(\n",
        "        adata,\n",
        "        color=key,\n",
        "        output_path=PLOTS_DIR / f\"{key}_spatial.png\",\n",
        "        spatial_key=SPATIAL_KEY,\n",
        "        point_size=1,\n",
        "        title=f\"{SAMPLE_NAME} {key}\",\n",
        "        show=True,\n",
        "    )"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "bc2f0123",
      "metadata": {},
      "source": [
        "## 10. Export manually selected results\n",
        "\n",
        "Run these cells after filling in the manual `normal_clusters`, `tumor_clusters`, and optional `tumor_clone_clusters` lists above. The gene-level table uses `adata.layers[\"gene_values_cnv\"]`, which was created by `infercnvpy` because `calculate_gene_values=True`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "2fd6aaf6",
      "metadata": {},
      "outputs": [],
      "source": [
        "mean_cnv = icv.tl.export_mean_cnv_per_gene(\n",
        "    adata,\n",
        "    OUTPUT_DIR / f\"{SAMPLE_NAME}_tumor_mean_cnv_per_gene.tsv\",\n",
        "    layer=\"gene_values_cnv\",\n",
        "    obs_key=\"cnv_status\",\n",
        "    obs_values=(\"tumor\",),\n",
        ")\n",
        "\n",
        "mean_cnv.head()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "f3793f6a",
      "metadata": {},
      "outputs": [],
      "source": [
        "icv.tl.export_cell_groups(\n",
        "    adata,\n",
        "    OUTPUT_DIR / f\"{SAMPLE_NAME}_cnv_status_xenium_cell_groups.csv\",\n",
        "    group_key=\"cnv_status\",\n",
        ")\n",
        "icv.tl.export_cell_groups(\n",
        "    adata,\n",
        "    OUTPUT_DIR / f\"{SAMPLE_NAME}_cnv_clone_xenium_cell_groups.csv\",\n",
        "    group_key=\"cnv_clone\",\n",
        ")\n",
        "\n",
        "adata.write(OUTPUT_DIR / f\"{SAMPLE_NAME}_InSituCNV.h5ad\", compression=\"gzip\")\n",
        "OUTPUT_DIR"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3 (ipykernel)",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.11.9"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}