CNN prediction ============== This page covers the full lifecycle of CNN/DANN predictions: checking that your feature vectors are reasonable before training, visualising genome-wide sweep probabilities, and post-processing predictions into candidate regions and gene-level rankings. All functions are in ``flexsweep.utils``. Prediction visualization ------------------------ Manhattan plot ~~~~~~~~~~~~~~ .. code-block:: python from flexsweep.utils import plot_manhattan # CNN output — plots -log10(1 - prob_sweep) by default plot_manhattan( "yri_vcf/predictions.csv", out="manhattan.png", ) # Custom column — e.g. a scan pvalue column plot_manhattan( df, p_col="ihs_pvalue", chr_col="chrom", pos_col="pos", log_transform=True, out="manhattan_ihs.png", ) Generic genome-wide Manhattan plot. Accepts CNN prediction CSV/DataFrame or any tabular data with a genomic position and a value column. .. list-table:: :header-rows: 1 :widths: 25 12 63 * - Parameter - Default - Description * - ``input`` - required - File path (CSV) or Polars DataFrame. * - ``p_col`` - ``None`` - Column to use as the value. ``None`` → computes ``1 − prob_sweep`` (CNN default behaviour). * - ``chr_col`` - ``None`` - Chromosome column name. ``None`` → ``"chr"`` (CNN default). * - ``pos_col`` - ``None`` - Position column name. ``None`` → ``"start"`` (CNN default). * - ``log_transform`` - ``True`` - Plot ``-log10(value)`` on the y-axis. * - ``threshold_lines`` - ``None`` - List of ``(y_value, linestyle, label)`` for horizontal lines. ``None`` → CNN defaults (y = 3 solid, y = 2 dashed). Pass ``[]`` to suppress. * - ``figsize`` - ``(14, 5)`` - Figure size in inches. * - ``out`` - ``None`` - Save path. If ``None``, shows interactively. * - ``title`` - ``None`` - Plot title. Sweep probability density ~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from flexsweep.utils import plot_sweep_density fig = plot_sweep_density( "yri_vcf/predictions.csv", output_path="sweep_density.svg", ) Per-chromosome histograms of ``prob_sweep``. Each panel shows the distribution of sweep probability for one contig and reports the percentage of windows above 0.5. Useful for a quick genome-wide sanity check after prediction. .. list-table:: :header-rows: 1 :widths: 25 12 63 * - Parameter - Default - Description * - ``prediction`` - required - Path to prediction CSV/Parquet or Polars DataFrame. Must have columns ``chr``, ``start``, ``end``, ``prob_sweep``. * - ``output_path`` - ``None`` - Save path (SVG). If ``None``, shows interactively. Post-processing predictions --------------------------- Merge candidate regions ~~~~~~~~~~~~~~~~~~~~~~~ Merge contiguous windows above a probability threshold into non-overlapping candidate sweep intervals: .. code-block:: python from flexsweep.utils import merge_regions df_merged, summary = merge_regions( "yri_vcf/predictions.csv", p=0.9, ) print(summary) # per-chromosome: merged_span, total_span, pct ``merge_regions`` filters windows where ``prob_sweep > p``, merges adjacent intervals on the same chromosome, and returns: * ``df_merged`` — Polars LazyFrame of merged intervals (``chr``, ``start``, ``end``, ``prob_sweep``). * ``summary`` — DataFrame with ``chr``, ``merged_span`` (bp in merged intervals), ``total_span`` (total analysed bp), and ``pct`` (fraction of the analysed genome above the threshold). .. list-table:: :header-rows: 1 :widths: 25 12 63 * - Parameter - Default - Description * - ``prediction`` - required - File path (CSV) or Polars DataFrame with columns ``chr``, ``start``, ``end``, ``prob_sweep``. * - ``p`` - required - Probability threshold. Windows with ``prob_sweep > p`` are merged. Rank genomic features ~~~~~~~~~~~~~~~~~~~~~ Assign sweep probabilities to genes or other genomic features using the nearest prediction window. Available as a Python function and via the ``flexsweep rank`` CLI. **CLI:** .. code-block:: bash flexsweep rank \ --prediction yri_vcf/predictions.csv \ --feature_coordinates genes.bed **Python:** .. code-block:: python from flexsweep.utils import rank_probabilities df_ranked = rank_probabilities( prediction="yri_vcf/predictions.csv", feature_coordinates="genes.bed", k=111, ) For each gene (or BED feature), ``rank_probabilities`` finds the *k* nearest prediction windows on the same chromosome (using a ``bedtools closest -k`` equivalent), then assigns the maximum ``prob_sweep`` among those windows. Genes are returned sorted by ``prob_sweep`` descending — the output is a ranked gene list. The ``feature_coordinates`` BED file must have columns ``chr``, ``start``, ``end``, ``gene_id``, ``strand`` (no header, 0-based). Chromosome labels must be numeric (``1``–``22``); the function prepends ``chr`` automatically. .. list-table:: :header-rows: 1 :widths: 25 12 63 * - Parameter - Default - Description * - ``prediction`` - required - CNN prediction CSV or Polars DataFrame with ``chr``, ``start``, ``end``, ``prob_sweep`` columns. * - ``feature_coordinates`` - required - BED file path (str) or Polars DataFrame of genomic features. * - ``rank_distance`` - ``False`` - If ``True``, additionally rank by distance to the nearest window. * - ``k`` - ``111`` - Number of nearest prediction windows to consider per gene. Equivalent to ``bedtools closest -k k``. Training diagnostics -------------------- Use these two functions to inspect your feature vectors before or after training, particularly to check for domain shift between simulations and empirical data. Feature vector PCA ~~~~~~~~~~~~~~~~~~ .. code-block:: python from flexsweep.utils import plot_fv_pca fig = plot_fv_pca( train_data="yri_test/fvs.parquet", empirical_data="yri_vcf/fvs_yri.parquet", subsample=5000, output_path="fv_pca.svg", ) Projects the feature matrix onto its first two principal components, coloured by neutral (blue) and sweep (red). Pass ``empirical_data`` to overlay empirical windows as a third colour — a large separation between the simulation cloud and the empirical cloud indicates domain shift that may require DANN training. .. list-table:: :header-rows: 1 :widths: 25 12 63 * - Parameter - Default - Description * - ``train_data`` - required - Path to ``fvs*.parquet`` or a Polars DataFrame. Must have a ``model`` column (``neutral`` / sweep label). * - ``empirical_data`` - ``None`` - Path to empirical ``fvs*.parquet`` or DataFrame (no ``model`` column). When provided, plotted as a third distribution. * - ``subsample`` - ``5000`` - Maximum rows to use (avoids slow PCA on large datasets). * - ``output_path`` - ``None`` - Save path (SVG). If ``None``, shows interactively. Statistic distributions ~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from flexsweep.utils import plot_stat_distributions fig = plot_stat_distributions( train_data="yri_test/fvs.parquet", empirical_data="yri_vcf/fvs_yri.parquet", stats=["pi", "h12", "ihs", "nsl", "tajima_d"], output_path="stat_distributions.svg", ) Violin plots of each statistic split by neutral, sweep, and (optionally) empirical data. This is the primary diagnostic for identifying which statistics are shifted between simulations and real data — a stat whose empirical distribution is far from both the neutral and sweep simulation distributions is a candidate to exclude from DANN training (see ``ihs`` exclusion in CLAUDE.md for an example). .. list-table:: :header-rows: 1 :widths: 25 12 63 * - Parameter - Default - Description * - ``train_data`` - required - Path to ``fvs*.parquet`` or Polars DataFrame with ``model`` column. * - ``empirical_data`` - ``None`` - Empirical ``fvs*.parquet`` or DataFrame (no ``model`` column). * - ``stats`` - all stats - List of statistic base names to plot, e.g. ``["pi", "h12", "ihs"]``. Defaults to the full Flex-sweep stat set. * - ``output_path`` - ``None`` - Save path (SVG). If ``None``, shows interactively.