Enrichment analysis

The enrichment analysis tests whether a user-defined set of genomic elements shows a significant excess or deficit of positive selection signals relative to matched control sets. It is an adaptation of the method from Enard et al. (2020), which was originally developed to test whether virus-interacting proteins (VIPs) accumulate sweep signals. The term VIPs appears throughout the code and original pipeline as the label for the case element set; here it refers generically to whatever set you provide.

The case set can be any genomic annotation: immune genes, disease-associated loci, regulatory elements, conserved non-coding sequences, or any other category. The key requirement is supplying relevant confounding factors for your element type so that control sets can be properly matched (see Bootstrapping below).

Biological background

Gene ranking and the enrichment curve

Any per-element selection signal can be used — for example, the CNN sweep probability from flexsweep cnn. The score is averaged over a fixed-size genomic window centred on each element, giving one score per element per population. Elements are then ranked genome-wide from strongest signal (rank 1) to weakest.

A rank threshold is a cutoff applied to these ranks. Threshold = 500 means “the top 500 elements with the strongest selection signal”. The analysis tests many thresholds simultaneously (by default spanning top 20 to top 6000).

For each threshold t and population:

Case count = number of case elements with rank ≤ t
Control mean = mean count across all matched control sets
Ratio = (case count + 0.1) / (control mean + 0.1)

Plotting the ratio against the threshold produces the enrichment curve. A ratio > 1 means case elements are over-represented among the top-ranked elements. Enrichment concentrated at stringent thresholds (small t) suggests strong, rare sweeps; enrichment spread across lenient thresholds suggests a broader, moderate signal.

Bootstrapping

A naive comparison would be confounded because many genomic properties correlate with selection signals independently of biology — GC content, gene length, expression level, and local recombination rate all influence sweep scores. To avoid false positives, the bootstrap builds matched control sets: elements drawn from a background pool (non-case elements at least min_distance bp from any case element) such that the mean of every confounding factor stays within ± tolerance of the case-set mean.

The choice of confounders depends on your element type. For protein-coding genes (as in Enard et al. 2020), typical confounders are coding-sequence density, GC content, recombination rate, expression level, conservation scores, or dN/dS. For regulatory elements, relevant confounders might include chromatin accessibility, sequence conservation, or distance to the nearest transcription start site. Factors with a non-monotonic relationship to sweep prevalence (e.g. dN/dS) can be excluded from matching. Any numeric per-element values are accepted via the --factors file.

Matching is repeated n_runs × n_batches times, producing a distribution of expected counts reflecting what you would see by chance given the same genomic properties as the case set.

False Discovery Rate

The per-threshold p-values from the bootstrap are not independent — nearly the same elements appear at adjacent thresholds, so corrections across thresholds are not straightforward. Instead, the whole-curve FDR provides a single p-value by:

Genome shuffling — generating null genomes where selection ranks are randomly reassigned between elements while preserving local clustering structure (blocks of shuffling_segs elements are randomly relocated).
Re-running the sweep count on each shuffled genome with the same case and control sets as the real analysis, ensuring bootstrap biases are the same in the null as in the real test.
Computing the test statistic as the total area above the control mean under the enrichment curve (summed excess across all thresholds passing filters).
Reporting the FDR p-value as the fraction of null replicates where the null statistic ≥ the real statistic.

How the pipeline works

Bootstrap control sets — builds n_runs × n_batches sets of control elements matched to the case set on all confounding factors, keeping a minimum genomic distance from case elements. Each control set has the same size as the case set.
Count sweeps — for each rank threshold and population, counts case and control elements overlapping a sweep signal, producing enrichment ratios with bootstrap confidence intervals and empirical p-values. Neighbouring elements within cluster_distance bp can optionally be merged into sweep clusters before counting.
Genome shuffling — creates a null distribution by randomly shuffling selection rank blocks across the genome n_shuffles times.
FDR estimation — computes an empirical whole-curve FDR p-value by comparing the real enrichment statistic to the null distribution.

Quick start

Command line

flexsweep enrichment \
    --sweep_files yri_ceu_ranks.tsv \
    --gene_set case_elements.tsv \
    --factors confounders.tsv \
    --annotation genes.bed \
    --populations YRI,CEU \
    --groups AFR,EUR \
    --thresholds 6000,5000,4000,3000,2000,1500,1000,500,200,100,50,20 \
    --pop_interest All \
    --n_runs 10 \
    --nthreads 8

Python interface

from flexsweep.enrichment import run_enrichment

fdr_results = run_enrichment(
    sweep_files=["yri_ceu_ranks.tsv"],
    gene_set="case_elements.tsv",
    factors_file="confounders.tsv",
    annotation_file="genes.bed",
    populations=["YRI", "CEU"],
    groups=["AFR", "EUR"],
    thresholds=[6000, 5000, 4000, 3000, 2000, 1500, 1000, 500, 200, 100, 50, 20],
    pop_interest="All",
    n_runs=10,
    nthreads=8,
)

for df in fdr_results:
    print(df)

The function prints timing information (TIMING_BOOTSTRAP, TIMING_POSTBOOTSTRAP, TIMING_TOTAL) and a per-file FDR summary line to stdout, and returns a list of Polars DataFrames — one per sweep file — with columns scope, p_value, n_replicates, real_stat, max_null_stat.

Input file formats

sweep_files: Tab- or space-separated file (optionally gzipped). First column: gene_id. Remaining columns: per-population sweep ranks (integers). Column order must match --populations.
gene_set: Two-column TSV (no header): gene_id, label (yes/no). Elements labelled yes are the case set; no are the control-eligible pool. The file argument is named gene_set and the labels yes/no follow the convention of the original Enard et al. (2020) VIP pipeline.
factors: Tab- or space-separated file (no header). First column: gene_id. Remaining columns: numeric confounding factor values used for control matching.
annotation: BED file (0-based, no header): chr, start, end, gene_id. Used for genomic distance computation and genome shuffling.

Parameters

Option	Default	Description
`--sweep_files`	required	Comma-separated paths to sweep rank files.
`--gene_set`	required	TSV with `gene_id` and `yes`/`no` label columns.
`--factors`	required	TSV confounding factors file.
`--annotation`	required	BED gene coordinates file (0-based, no header).
`--populations`	required	Comma-separated population codes matching sweep file column order.
`--groups`	required	Comma-separated group labels (same length as `--populations`).
`--thresholds`	required	Comma-separated rank thresholds for the enrichment curve.
`--pop_interest`	`All`	Population or group scope for FDR output. Use `All` to aggregate across all populations.
`--cluster_distance`	`500000`	Maximum bp distance between elements to count them as neighbours when clustering sweep events.
`--n_runs`	`10`	Number of bootstrap batches. Total control sets = `n_runs × iterations_per_run`.
`--tolerance`	`0.05`	Allowed ± fraction deviation in factor averages when matching control elements to the case set.
`--min_distance`	`1250000`	Minimum bp distance from any case element for an element to be eligible as a control.
`--flip`	`False`	Flip test direction when the control pool is smaller than the case set. Increases statistical power in that scenario.
`--max_rep`	`25`	Maximum average resamples per control element across all bootstrap sets.
`--nthreads`	`1`	Number of parallel workers (joblib).
`--n_shuffles`	`8`	Number of FDR shuffle replicates. Must be a multiple of 8.
`--shuffling_segs`	`2`	Size of each genomic shuffle block in elements. Corresponds to `Shuffling_segments_number` in the Enard et al. (2020) Perl pipeline.
`--bootstrap_dir`	`""`	Folder containing pre-computed bootstrap output in the format produced by the Enard et al. (2020) Perl pipeline (`VIPs/file_1` and `nonVIPs/file_1`). When provided, the bootstrap step is skipped.

Key parameter notes

--min_distance (default 1.25 Mb) prevents nearby elements from acting as controls, ensuring controls are independent of the case set.
--tolerance (default 0.05 = ±5%) controls how strictly controls must match the case set on each confounding factor. Loosen it (e.g. 0.1) if the bootstrap produces very few valid control sets.
--shuffling_segs is the size of each genomic shuffle block (in elements), not the number of blocks. With ~16,000 valid genes and shuffling_segs=2, the genome is cut into ~8,000 two-element blocks that are randomly re-arranged. Note: the Enard et al. (2020) Perl manual describes this parameter as “Number of segments to shuffle”, which is incorrect — it is the block size.
--n_shuffles must be a multiple of 8 (parallelism constraint).

Understanding the output

run_enrichment returns one Polars DataFrame per sweep file. Each row represents the whole-curve FDR result for one population/group scope:

Column	Meaning
`scope`	Population code, group name, or `"All"` (aggregate across all populations).
`p_value`	Empirical FDR p-value. Small values (< 0.05) indicate significant enrichment of sweep signals in the case set. Large values (> 0.95) indicate depletion.
`n_replicates`	Number of null shuffles used to estimate the p-value.
`real_stat`	The observed test statistic — total area above the control mean under the enrichment curve, summed across all thresholds passing the filters. Larger = more enrichment.
`max_null_stat`	The maximum test statistic across all null shuffle replicates. If `real_stat > max_null_stat`, the FDR p-value is 0 / `n_shuffles`.

The intermediate per-threshold enrichment curve (ratio, p-value per threshold, confidence intervals) is produced internally by count_sweeps_multipop and can be accessed directly via the Python API if needed.

Interpreting the enrichment curve

Ratio > 1 — case elements are over-represented among the top-ranked elements at that threshold. The further above 1, the stronger the signal.
Ratio ≈ 1 — case elements behave like controls; no excess of selection.
Ratio < 1 — case elements are under-represented (possible depletion).
Enrichment at stringent thresholds (top 50–200) points to a small number of elements with very strong sweep signals.
Enrichment at lenient thresholds (top 1000–6000) suggests a broader, moderate signal across the set.

Clustering note (``use_clust``)

By default, elements within cluster_distance bp of each other are merged into a single genomic cluster before counting, preventing the same sweep event from being counted multiple times. The count_sweeps=True Python-API option counts distinct sweep events (connected components of the neighbour graph) rather than individual elements — this extension is not present in the original Enard et al. (2020) Perl pipeline.

Notes and constraints

Gene IDs must start with ENSG (Ensembl format).
n_shuffles must be a multiple of 8.
n_batches (Iterations_number in the Enard et al. (2020) Perl pipeline) must be a multiple of 10.
The maximum useful rank threshold is 2000 (hardcoded pipeline limit for speed).
shuffling_segs is the block size in elements, not the number of blocks. The Enard et al. (2020) Perl manual describes it as “Number of segments to shuffle” — this is incorrect.
HLA-region genes and histone genes should be excluded from the case set before passing to the pipeline.
Use --pop_interest All (the default) to reproduce the All: FDR scope from the Enard et al. (2020) Perl pipeline, which accumulates the test statistic over all individual populations.
count_sweeps=True (Python API only) counts sweep events via connected-component analysis instead of individual elements. The Enard et al. (2020) Perl pipeline always counts elements.