CNN prediction
This page covers the full lifecycle of CNN/DANN predictions: checking that your feature vectors are reasonable before training, visualising genome-wide sweep probabilities, and post-processing predictions into candidate regions and gene-level rankings.
All functions are in flexsweep.utils.
Prediction visualization
Manhattan plot
from flexsweep.utils import plot_manhattan
# CNN output — plots -log10(1 - prob_sweep) by default
plot_manhattan(
"yri_vcf/predictions.csv",
out="manhattan.png",
)
# Custom column — e.g. a scan pvalue column
plot_manhattan(
df,
p_col="ihs_pvalue",
chr_col="chrom",
pos_col="pos",
log_transform=True,
out="manhattan_ihs.png",
)
Generic genome-wide Manhattan plot. Accepts CNN prediction CSV/DataFrame or any tabular data with a genomic position and a value column.
Parameter |
Default |
Description |
|---|---|---|
|
required |
File path (CSV) or Polars DataFrame. |
|
|
Column to use as the value. |
|
|
Chromosome column name. |
|
|
Position column name. |
|
|
Plot |
|
|
List of |
|
|
Figure size in inches. |
|
|
Save path. If |
|
|
Plot title. |
Sweep probability density
from flexsweep.utils import plot_sweep_density
fig = plot_sweep_density(
"yri_vcf/predictions.csv",
output_path="sweep_density.svg",
)
Per-chromosome histograms of prob_sweep. Each panel shows the
distribution of sweep probability for one contig and reports the percentage
of windows above 0.5. Useful for a quick genome-wide sanity check after
prediction.
Parameter |
Default |
Description |
|---|---|---|
|
required |
Path to prediction CSV/Parquet or Polars DataFrame. Must have columns
|
|
|
Save path (SVG). If |
Post-processing predictions
Merge candidate regions
Merge contiguous windows above a probability threshold into non-overlapping candidate sweep intervals:
from flexsweep.utils import merge_regions
df_merged, summary = merge_regions(
"yri_vcf/predictions.csv",
p=0.9,
)
print(summary) # per-chromosome: merged_span, total_span, pct
merge_regions filters windows where prob_sweep > p, merges adjacent
intervals on the same chromosome, and returns:
df_merged— Polars LazyFrame of merged intervals (chr,start,end,prob_sweep).summary— DataFrame withchr,merged_span(bp in merged intervals),total_span(total analysed bp), andpct(fraction of the analysed genome above the threshold).
Parameter |
Default |
Description |
|---|---|---|
|
required |
File path (CSV) or Polars DataFrame with columns |
|
required |
Probability threshold. Windows with |
Rank genomic features
Assign sweep probabilities to genes or other genomic features using the
nearest prediction window. Available as a Python function and via the
flexsweep rank CLI.
CLI:
flexsweep rank \
--prediction yri_vcf/predictions.csv \
--feature_coordinates genes.bed
Python:
from flexsweep.utils import rank_probabilities
df_ranked = rank_probabilities(
prediction="yri_vcf/predictions.csv",
feature_coordinates="genes.bed",
k=111,
)
For each gene (or BED feature), rank_probabilities finds the k nearest
prediction windows on the same chromosome (using a bedtools closest -k
equivalent), then assigns the maximum prob_sweep among those windows.
Genes are returned sorted by prob_sweep descending — the output is a
ranked gene list.
The feature_coordinates BED file must have columns chr, start,
end, gene_id, strand (no header, 0-based). Chromosome labels
must be numeric (1–22); the function prepends chr automatically.
Parameter |
Default |
Description |
|---|---|---|
|
required |
CNN prediction CSV or Polars DataFrame with |
|
required |
BED file path (str) or Polars DataFrame of genomic features. |
|
|
If |
|
|
Number of nearest prediction windows to consider per gene. Equivalent
to |
Training diagnostics
Use these two functions to inspect your feature vectors before or after training, particularly to check for domain shift between simulations and empirical data.
Feature vector PCA
from flexsweep.utils import plot_fv_pca
fig = plot_fv_pca(
train_data="yri_test/fvs.parquet",
empirical_data="yri_vcf/fvs_yri.parquet",
subsample=5000,
output_path="fv_pca.svg",
)
Projects the feature matrix onto its first two principal components, coloured
by neutral (blue) and sweep (red). Pass empirical_data to overlay
empirical windows as a third colour — a large separation between the
simulation cloud and the empirical cloud indicates domain shift that may
require DANN training.
Parameter |
Default |
Description |
|---|---|---|
|
required |
Path to |
|
|
Path to empirical |
|
|
Maximum rows to use (avoids slow PCA on large datasets). |
|
|
Save path (SVG). If |
Statistic distributions
from flexsweep.utils import plot_stat_distributions
fig = plot_stat_distributions(
train_data="yri_test/fvs.parquet",
empirical_data="yri_vcf/fvs_yri.parquet",
stats=["pi", "h12", "ihs", "nsl", "tajima_d"],
output_path="stat_distributions.svg",
)
Violin plots of each statistic split by neutral, sweep, and (optionally)
empirical data. This is the primary diagnostic for identifying which
statistics are shifted between simulations and real data — a stat whose
empirical distribution is far from both the neutral and sweep simulation
distributions is a candidate to exclude from DANN training (see ihs
exclusion in CLAUDE.md for an example).
Parameter |
Default |
Description |
|---|---|---|
|
required |
Path to |
|
|
Empirical |
|
all stats |
List of statistic base names to plot, e.g.
|
|
|
Save path (SVG). If |