Enrichment analysis
The enrichment analysis tests whether a user-defined set of genomic elements shows a significant excess or deficit of positive selection signals relative to matched control sets. It is an adaptation of the method from Enard et al. (2020), which was originally developed to test whether virus-interacting proteins (VIPs) accumulate sweep signals. The term VIPs appears throughout the code and original pipeline as the label for the case element set; here it refers generically to whatever set you provide.
The case set can be any genomic annotation: immune genes, disease-associated loci, regulatory elements, conserved non-coding sequences, or any other category. The key requirement is supplying relevant confounding factors for your element type so that control sets can be properly matched (see Bootstrapping below).
Biological background
Gene ranking and the enrichment curve
Any per-element selection signal can be used — for example, the CNN sweep
probability from flexsweep cnn. The score is averaged over a fixed-size
genomic window centred on each element, giving one score per element per
population. Elements are then ranked genome-wide from strongest signal
(rank 1) to weakest.
A rank threshold is a cutoff applied to these ranks. Threshold = 500 means “the top 500 elements with the strongest selection signal”. The analysis tests many thresholds simultaneously (by default spanning top 20 to top 6000).
For each threshold t and population:
Case count = number of case elements with rank ≤ t
Control mean = mean count across all matched control sets
Ratio = (case count + 0.1) / (control mean + 0.1)
Plotting the ratio against the threshold produces the enrichment curve. A ratio > 1 means case elements are over-represented among the top-ranked elements. Enrichment concentrated at stringent thresholds (small t) suggests strong, rare sweeps; enrichment spread across lenient thresholds suggests a broader, moderate signal.
Bootstrapping
A naive comparison would be confounded because many genomic properties
correlate with selection signals independently of biology — GC content, gene
length, expression level, and local recombination rate all influence sweep
scores. To avoid false positives, the bootstrap builds matched control sets:
elements drawn from a background pool (non-case elements at least
min_distance bp from any case element) such that the mean of every
confounding factor stays within ± tolerance of the case-set mean.
The choice of confounders depends on your element type. For protein-coding
genes (as in Enard et al. 2020), typical confounders are coding-sequence
density, GC content, recombination rate, expression level, conservation scores,
or dN/dS. For regulatory elements, relevant confounders might include chromatin
accessibility, sequence conservation, or distance to the nearest transcription
start site. Factors with a non-monotonic relationship to sweep prevalence (e.g.
dN/dS) can be excluded from matching. Any numeric per-element values are
accepted via the --factors file.
Matching is repeated n_runs × n_batches times, producing a distribution of
expected counts reflecting what you would see by chance given the same genomic
properties as the case set.
False Discovery Rate
The per-threshold p-values from the bootstrap are not independent — nearly the same elements appear at adjacent thresholds, so corrections across thresholds are not straightforward. Instead, the whole-curve FDR provides a single p-value by:
Genome shuffling — generating null genomes where selection ranks are randomly reassigned between elements while preserving local clustering structure (blocks of
shuffling_segselements are randomly relocated).Re-running the sweep count on each shuffled genome with the same case and control sets as the real analysis, ensuring bootstrap biases are the same in the null as in the real test.
Computing the test statistic as the total area above the control mean under the enrichment curve (summed excess across all thresholds passing filters).
Reporting the FDR p-value as the fraction of null replicates where the null statistic ≥ the real statistic.
How the pipeline works
Bootstrap control sets — builds
n_runs × n_batchessets of control elements matched to the case set on all confounding factors, keeping a minimum genomic distance from case elements. Each control set has the same size as the case set.Count sweeps — for each rank threshold and population, counts case and control elements overlapping a sweep signal, producing enrichment ratios with bootstrap confidence intervals and empirical p-values. Neighbouring elements within
cluster_distancebp can optionally be merged into sweep clusters before counting.Genome shuffling — creates a null distribution by randomly shuffling selection rank blocks across the genome
n_shufflestimes.FDR estimation — computes an empirical whole-curve FDR p-value by comparing the real enrichment statistic to the null distribution.
Quick start
Command line
flexsweep enrichment \
--sweep_files yri_ceu_ranks.tsv \
--gene_set case_elements.tsv \
--factors confounders.tsv \
--annotation genes.bed \
--populations YRI,CEU \
--groups AFR,EUR \
--thresholds 6000,5000,4000,3000,2000,1500,1000,500,200,100,50,20 \
--pop_interest All \
--n_runs 10 \
--nthreads 8
Python interface
from flexsweep.enrichment import run_enrichment
fdr_results = run_enrichment(
sweep_files=["yri_ceu_ranks.tsv"],
gene_set="case_elements.tsv",
factors_file="confounders.tsv",
annotation_file="genes.bed",
populations=["YRI", "CEU"],
groups=["AFR", "EUR"],
thresholds=[6000, 5000, 4000, 3000, 2000, 1500, 1000, 500, 200, 100, 50, 20],
pop_interest="All",
n_runs=10,
nthreads=8,
)
for df in fdr_results:
print(df)
The function prints timing information (TIMING_BOOTSTRAP,
TIMING_POSTBOOTSTRAP, TIMING_TOTAL) and a per-file FDR summary line
to stdout, and returns a list of Polars DataFrames — one per sweep file — with
columns scope, p_value, n_replicates, real_stat,
max_null_stat.
Input file formats
- sweep_files
Tab- or space-separated file (optionally gzipped). First column:
gene_id. Remaining columns: per-population sweep ranks (integers). Column order must match--populations.- gene_set
Two-column TSV (no header):
gene_id,label(yes/no). Elements labelledyesare the case set;noare the control-eligible pool. The file argument is namedgene_setand the labelsyes/nofollow the convention of the original Enard et al. (2020) VIP pipeline.- factors
Tab- or space-separated file (no header). First column:
gene_id. Remaining columns: numeric confounding factor values used for control matching.- annotation
BED file (0-based, no header):
chr,start,end,gene_id. Used for genomic distance computation and genome shuffling.
Parameters
Option |
Default |
Description |
|---|---|---|
|
required |
Comma-separated paths to sweep rank files. |
|
required |
TSV with |
|
required |
TSV confounding factors file. |
|
required |
BED gene coordinates file (0-based, no header). |
|
required |
Comma-separated population codes matching sweep file column order. |
|
required |
Comma-separated group labels (same length as |
|
required |
Comma-separated rank thresholds for the enrichment curve. |
|
|
Population or group scope for FDR output. Use |
|
|
Maximum bp distance between elements to count them as neighbours when clustering sweep events. |
|
|
Number of bootstrap batches. Total control sets =
|
|
|
Allowed ± fraction deviation in factor averages when matching control elements to the case set. |
|
|
Minimum bp distance from any case element for an element to be eligible as a control. |
|
|
Flip test direction when the control pool is smaller than the case set. Increases statistical power in that scenario. |
|
|
Maximum average resamples per control element across all bootstrap sets. |
|
|
Number of parallel workers (joblib). |
|
|
Number of FDR shuffle replicates. Must be a multiple of 8. |
|
|
Size of each genomic shuffle block in elements. Corresponds to
|
|
|
Folder containing pre-computed bootstrap output in the format produced
by the Enard et al. (2020) Perl pipeline ( |
Key parameter notes
--min_distance(default 1.25 Mb) prevents nearby elements from acting as controls, ensuring controls are independent of the case set.--tolerance(default 0.05 = ±5%) controls how strictly controls must match the case set on each confounding factor. Loosen it (e.g. 0.1) if the bootstrap produces very few valid control sets.--shuffling_segsis the size of each genomic shuffle block (in elements), not the number of blocks. With ~16,000 valid genes andshuffling_segs=2, the genome is cut into ~8,000 two-element blocks that are randomly re-arranged. Note: the Enard et al. (2020) Perl manual describes this parameter as “Number of segments to shuffle”, which is incorrect — it is the block size.--n_shufflesmust be a multiple of 8 (parallelism constraint).
Understanding the output
run_enrichment returns one Polars DataFrame per sweep file. Each row
represents the whole-curve FDR result for one population/group scope:
Column |
Meaning |
|---|---|
|
Population code, group name, or |
|
Empirical FDR p-value. Small values (< 0.05) indicate significant enrichment of sweep signals in the case set. Large values (> 0.95) indicate depletion. |
|
Number of null shuffles used to estimate the p-value. |
|
The observed test statistic — total area above the control mean under the enrichment curve, summed across all thresholds passing the filters. Larger = more enrichment. |
|
The maximum test statistic across all null shuffle replicates. If
|
The intermediate per-threshold enrichment curve (ratio, p-value per threshold,
confidence intervals) is produced internally by count_sweeps_multipop and
can be accessed directly via the Python API if needed.
Interpreting the enrichment curve
Ratio > 1 — case elements are over-represented among the top-ranked elements at that threshold. The further above 1, the stronger the signal.
Ratio ≈ 1 — case elements behave like controls; no excess of selection.
Ratio < 1 — case elements are under-represented (possible depletion).
Enrichment at stringent thresholds (top 50–200) points to a small number of elements with very strong sweep signals.
Enrichment at lenient thresholds (top 1000–6000) suggests a broader, moderate signal across the set.
Clustering note (``use_clust``)
By default, elements within cluster_distance bp of each other are merged
into a single genomic cluster before counting, preventing the same sweep event
from being counted multiple times. The count_sweeps=True Python-API option
counts distinct sweep events (connected components of the neighbour graph)
rather than individual elements — this extension is not present in the original
Enard et al. (2020) Perl pipeline.
Notes and constraints
Gene IDs must start with
ENSG(Ensembl format).n_shufflesmust be a multiple of 8.n_batches(Iterations_numberin the Enard et al. (2020) Perl pipeline) must be a multiple of 10.The maximum useful rank threshold is 2000 (hardcoded pipeline limit for speed).
shuffling_segsis the block size in elements, not the number of blocks. The Enard et al. (2020) Perl manual describes it as “Number of segments to shuffle” — this is incorrect.HLA-region genes and histone genes should be excluded from the case set before passing to the pipeline.
Use
--pop_interest All(the default) to reproduce theAll:FDR scope from the Enard et al. (2020) Perl pipeline, which accumulates the test statistic over all individual populations.count_sweeps=True(Python API only) counts sweep events via connected-component analysis instead of individual elements. The Enard et al. (2020) Perl pipeline always counts elements.