Outlier scan
flexsweep scan is a standalone positive-selection scan independent of the
CNN/DANN pipeline. It computes one or more selection statistics from VCF data
and assigns each locus an empirical p-value based on the genome-wide
distribution of that statistic. No neutral simulations are required.
How it works
Each statistic is computed at its natural resolution — per-SNP statistics (iHS, nSL, etc.) produce one value per polymorphic site; sliding-window statistics (H12, LASSI, etc.) produce one value per window. Once all contigs are processed, raw values from the entire genome are pooled and ranked together to produce genome-wide empirical p-values.
Empirical p-values
For each statistic an empirical p-value is assigned following the empirical outlier approach (Akey 2009):
where rank is computed on the negative value so that the largest statistic gets the smallest p-value (rank 1 → p ≈ 0, outlier). \(N_{\mathrm{valid}}\) is the count of non-NaN loci. NaN values are excluded from the ranking and do not contribute to \(N_{\mathrm{valid}}\).
For signed statistics (iHS, nSL, Tajima’s D, Fay-Wu H, Zeng E), ranking is done on \(|x_i|\) before convert to negative, so that extreme values at both tails are flagged as outliers.
The output column is named {stat}_pvalue. A value close to 0 means the
locus is among the most extreme in the genome for that statistic.
Note
The empirical p-value is not an analytical p-value. It reflects the position of a locus within this genome’s distribution; it does not correspond to a controlled false-positive rate.
Available statistics
Per-SNP statistics
One score per polymorphic site. No sliding window.
Key |
Rank column |
Description |
|---|---|---|
|
|
Integrated haplotype score (Voight et al. 2006). Detects incomplete
hard sweeps via extended haplotype homozygosity. Normalized within
DAF bins. Configurable: |
|
|
Number of segregating sites by length (Ferrer-Admetlla et al. 2014).
Robust alternative to iHS; no genetic map required. Normalized within
DAF bins. Configurable: |
|
|
Identifying the favored allele in a sweep (Akbari et al. 2018).
Pinpoints the causal mutation within a detected sweep region. Runs on
non-overlapping regions. Configurable: |
|
|
Derived intra-allelic nucleotide diversity ratio (Barreiro et al.
2009). Configurable: |
|
|
Frequency of high-frequency derived variants in a focal window
(Lauterbur et al. 2023). Configurable: |
|
|
Frequency of low-frequency derived variants in a focal window
(Lauterbur et al. 2023). Configurable: |
|
|
Ratio of segregating sites on derived vs. ancestral haplotypes
(Lauterbur et al. 2023). Configurable: |
|
|
Haplotype-derived allele frequency, other background (Lauterbur et al.
2023). Configurable: |
|
|
Haplotype-derived allele frequency, sweep background (Lauterbur et al.
2023). Stricter ancestral-frequency thresholds than |
|
|
Average pairwise haplotype homozygosity tract length H(x) (Messer
2015). Measures the mean shared haplotype block length across all
sample pairs; detects hard and soft sweeps. Configurable:
|
Sliding SNP-window statistics
One score per window of a fixed number of SNPs.
Key |
Rank column |
Default window |
Default step |
Description |
|---|---|---|---|---|
|
|
201 SNPs |
10 SNPs |
Haplotype allele frequency (Ronen et al. 2015). Mean pairwise
haplotype similarity across a SNP window. Configurable via shared
|
|
|
200 SNPs |
10 SNPs |
H12 haplotype homozygosity (Garud et al. 2015). Combines the two most
common haplotype frequencies. Configurable via shared |
|
|
200 SNPs |
10 SNPs |
Full Garud statistics: H1, H12, H2/H1 (Garud et al. 2015).
Configurable via shared |
|
|
201 SNPs |
10 SNPs |
Composite likelihood sweep scan using the haplotype frequency spectrum
(DeGiorgio et al. 2014). Configurable: |
|
|
201 SNPs |
10 SNPs |
Spatially-aware saltiLASSI (DeGiorgio & Szpiech 2022). Configurable:
|
|
|
50 SNPs |
1 SNP |
RAiSD μ composite statistic combining SFS, SNP density variation, and
LD (Alachiotis & Pavlidis 2018). Configurable: |
Sliding bp-window statistics
One score per physical window. All configurable via shared w_size_bp
(default 1 Mb) and step_bp (default 10 kb), except omega, zns,
beta, and ncd which have their own narrower defaults.
Key |
Rank column |
Default window |
Default step |
Description |
|---|---|---|---|---|
|
|
1 Mb |
10 kb |
Tajima’s D (Tajima 1989). SFS-based test; negative values signal
directional selection. Signed stat — ranked by |
|
|
1 Mb |
10 kb |
Nucleotide diversity θπ. |
|
|
1 Mb |
10 kb |
Watterson’s θW. |
|
|
1 Mb |
10 kb |
Fay & Wu’s H (Fay & Wu 2000). Sensitive to high-frequency derived
alleles. Signed stat — ranked by |
|
|
1 Mb |
10 kb |
Zeng’s E (Zeng et al. 2006). Signed stat — ranked by |
|
|
1 Mb |
10 kb |
Achaz Y (Achaz 2009). Robust to sequencing errors. |
|
|
1 Mb |
10 kb |
Fu & Li’s D (Fu & Li 1993). |
|
|
1 Mb |
10 kb |
Fu & Li’s D* (no outgroup required). |
|
|
1 Mb |
10 kb |
Fu & Li’s F (Fu & Li 1993). |
|
|
1 Mb |
10 kb |
Fu & Li’s F* (no outgroup required). |
|
|
1 Mb |
10 kb |
Composite: Tajima’s D, π, θW, Fay-Wu H in one pass.
Ranked by |
|
|
100 kb |
10 kb |
Kim & Nielsen’s ω (Kim & Nielsen 2004). LD patterns around a putative sweep centre. |
|
|
100 kb |
10 kb |
Kelly’s ZnS (Kelly 1997). Mean pairwise r² across all SNP pairs in a window. |
|
|
50 kb |
5 kb |
Beta1 statistic for balancing selection (Siewert & Voight 2017).
Configurable: |
|
|
3 kb |
1.5 kb |
NCD1 for balancing selection (Bitarello et al. 2018). Configurable:
|
Window mode
SNP-count windows are required for H12, LASSI, saltiLASSI, and RAiSD — physical windows confound SNP density with haplotype diversity for those statistics. SFS-based statistics (Tajima’s D, Fay-Wu H, etc.) use physical bp windows by default.
With the default window_mode="auto", each statistic uses its built-in
mode. Pass window_mode="snp" or window_mode="bp" to force a uniform
mode across all window statistics.
Normalization
Statistics sensitive to allele frequency (iHS, nSL, DIND, high_freq, low_freq, s_ratio, hapdaf_o, hapdaf_s) are z-scored within genome-wide DAF bins before p-values are computed. This removes the frequency-dependent bias that would otherwise cause high-frequency SNPs to dominate outlier lists.
DAF-only normalization (default):
Compute 50 equal-frequency DAF bin edges over the genome-wide DAF distribution.
Assign each SNP to a bin.
Within each bin: z-score = (value − mean) / std. Bins with fewer than 2 SNPs are left as NaN.
Joint DAF × recombination rate normalization (when recombination_map
and n_r_bins are both set):
The genome is additionally stratified by recombination rate. Each SNP is
assigned to a (DAF bin, recomb rate bin) cell and z-scored within that cell.
This further reduces false positives in low-recombination regions (Johnson et
al. approach). To enable it, pass both recombination_map and
n_r_bins (typically n_r_bins=10):
results = scan(
"data/vcf/",
"results/YRI",
stats=["ihs", "nsl"],
recombination_map="data/decode_sexavg_2019.txt.gz",
n_daf_bins=50,
n_r_bins=10,
)
Note
Passing only recombination_map without n_r_bins uses DAF-only
normalization with genetic-distance windows for T3 stats (dind, hapdaf,
s_ratio). Set n_r_bins explicitly to enable joint normalization.
Multi-contig usage
Pass a directory to --vcf_path to process all *.vcf.gz / *.bcf.gz
files. The scan uses a two-step approach:
1. Each contig is processed independently; raw unranked values are produced for every requested statistic.
2. Results from all contigs are concatenated per statistic. Normalization and empirical p-values are computed across all contigs together, ensuring p-values reflect the true genome-wide distribution.
flexsweep scan \
--vcf_path data/vcf/ \
--out_prefix results/YRI \
--stats ihs,nsl,h12,lassip \
--recombination_map data/decode_sexavg_2019.txt.gz \
--nthreads 4
CLI reference
flexsweep scan [OPTIONS]
Option |
Default |
Description |
|---|---|---|
|
required |
Directory of |
|
required |
Output prefix. Writes |
|
required |
Comma-separated stat keys, e.g. |
|
201 |
SNP-count window size for SNP-mode stats. |
|
10 |
SNP step size for SNP-mode stats. |
|
1000000 |
Physical window size (bp) for bp-mode stats. |
|
10000 |
Physical step size (bp) for bp-mode stats. |
|
auto |
|
|
0.05 |
Minimum minor allele frequency for iHS and nSL. |
|
50000 |
Focal window size (bp) for per-SNP stats (dind, hapdaf_o, hapdaf_s, s_ratio, high_freq, low_freq). |
|
None |
TSV recombination map (chr, start, end, cm_mb, cm). Enables genetic-distance windows for T3 stats. |
|
50 |
Number of equal-frequency DAF bins for normalization. |
|
None |
Number of recombination rate bins for joint DAF × recomb
normalization. Set to 10 to match Johnson et al. Requires
|
|
100000 |
saltiLASSI spatial decay cutoff in bp. |
|
10 |
K truncation for LASSI/saltiLASSI (number of HFS classes). |
|
4 |
Sweep spectral model for LASSI/saltiLASSI (1–5; 4 = Gaussian decay). |
|
50 |
SNP window size for RAiSD. |
|
1 |
Number of parallel workers. |
Examples:
# iHS + nSL on a single chromosome
flexsweep scan \
--vcf_path YRI.chr22.vcf.gz \
--out_prefix results/YRI.chr22 \
--stats ihs,nsl
# H12 + saltiLASSI + RAiSD with custom SNP window
flexsweep scan \
--vcf_path YRI.chr22.vcf.gz \
--out_prefix results/YRI.chr22 \
--stats h12,lassip,raisd \
--w_size 400 \
--nthreads 4
# SFS statistics using physical bp windows
flexsweep scan \
--vcf_path YRI.chr22.vcf.gz \
--out_prefix results/YRI.chr22 \
--stats tajima_d,fay_wu_h,zeng_e,omega
# DIND + HapDAF with a larger focal window
flexsweep scan \
--vcf_path YRI.chr22.vcf.gz \
--out_prefix results/YRI.chr22 \
--stats dind,hapdaf_o,hapdaf_s \
--window_size 100000 \
--recombination_map data/decode_sexavg_2019.txt.gz
# Genome-wide scan — joint DAF × recomb normalization
flexsweep scan \
--vcf_path data/vcf/ \
--out_prefix results/YRI \
--stats ihs,nsl,h12,lassip \
--recombination_map data/decode_sexavg_2019.txt.gz \
--n_r_bins 10 \
--nthreads 4
Python API
from flexsweep.scan import scan, available_stats, stat_params
# List all available stat keys
print(available_stats())
# Inspect default parameters for all stats
stat_params()
# Inspect a single stat — shows rank_col, resolution, window_mode,
# default_window, default_step, shared_params, and stat_params
stat_params("raisd")
# {'raisd': {'rank_col': 'mu_total', 'resolution': 'window',
# 'window_mode': 'snp', 'default_window': '50 SNPs',
# 'default_step': '10 SNPs',
# 'stat_params': {'window_size': 50}, ...}}
stat_params("hscan")
# {'hscan': {'rank_col': 'hscan', 'resolution': 'snp',
# 'window_mode': 'n/a (per-SNP stat)',
# 'stat_params': {'max_gap': 200000, 'dist_mode': 0,
# 'hscan_step': 1}, ...}}
# Basic scan
results = scan(
"data/vcf/",
"results/YRI",
stats=["ihs", "nsl", "h12", "lassip"],
min_maf=0.05,
recombination_map="data/decode_sexavg_2019.txt.gz",
nthreads=4,
)
# results["ihs"] → Polars DataFrame, SNP resolution, ihs_pvalue column
# results["lassip"] → Polars DataFrame, window resolution, Lambda_pvalue column
# Joint DAF × recomb normalization
results = scan(
"data/vcf/",
"results/YRI",
stats=["ihs", "nsl"],
recombination_map="data/decode_sexavg_2019.txt.gz",
n_daf_bins=50,
n_r_bins=10,
nthreads=4,
)
# Per-stat parameter overrides via config dict
results = scan(
"data/vcf/",
"results/YRI",
stats=["lassip", "raisd", "hscan"],
config={
"lassip": {"max_extend": 5e4, "K_truncation": 15},
"raisd": {"window_size": 100},
"hscan": {"hscan_step": 5, "max_gap": 100_000},
},
nthreads=4,
)
Output format
Each statistic writes one tab-separated file {out_prefix}.{stat}.txt:
Column |
Description |
|---|---|
|
Chromosome/contig name from VCF. |
|
Genomic position (bp). SNP stats: SNP position. Window stats: centre of window. |
|
Derived allele frequency (iHS, nSL, DIND, HapDAF, s_ratio, iSAFE). |
|
Number of SNPs in window (window stats only). |
|
Main statistic value (after DAF normalization for iHS/nSL/dind/hapdaf/ s_ratio). |
|
Empirical p-value: rank(−value) / N_valid. Range (0, 1]; smaller =
more extreme. Signed stats (iHS, nSL, Tajima’s D, Fay-Wu H, Zeng E)
are ranked by |
Additional columns vary by statistic (e.g., h2_h1 for garud, m and
A for lassip, mu_var, mu_sfs, mu_ld for raisd).
Scan results visualization
Use plot_scan from flexsweep.utils to generate Manhattan-style or
regional zoom plots directly from scan() output or saved TSV files.
Genome-wide plot:
from flexsweep.utils import plot_scan
# When passing a scan() dict, stat_cols is resolved automatically
# from STAT_REGISTRY (e.g. "raisd" → "mu_total", "lassi" → "T_m").
# No need to specify stat_cols.
# Single statistic — raw values, top 1% highlighted
plot_scan({"ihs": results["ihs"]}, out="results/YRI.ihs.png")
# Single statistic — empirical p-values (-log10 scale)
plot_scan({"ihs": results["ihs"]}, pvalue=True, out="results/YRI.ihs.png")
# Stacked multi-statistic panels — rank columns resolved automatically
plot_scan(
{k: results[k] for k in ["ihs", "h12", "lassip", "raisd"]},
pvalue=True,
out="results/YRI.multi.png",
)
# Plots: ihs, h12, Lambda (lassip), mu_total (raisd)
Regional zoom plot (raw + p-value side by side):
plot_scan(
{k: results[k] for k in ["ihs", "lassip"]},
chrom="22",
center=17_000_000,
window_bp=500_000,
out="results/YRI.zoom.png",
)
From saved TSV files (stat_cols must be provided explicitly):
plot_scan(
["results/YRI.ihs.txt", "results/YRI.lassip.txt"],
stat_cols=["ihs", "Lambda"], # must match column name in file
pvalue=True,
out="results/YRI.multi.png",
)
plot_scan parameters:
Parameter |
Default |
Description |
|---|---|---|
|
required |
|
|
None |
Stat column name(s). Required when |
|
False |
If True, plot \(-\log_{10}(p_{\mathrm{emp}})\) with threshold lines at p = 0.01 and p = 0.001. If False, plot raw values with outliers highlighted. |
|
0.01 |
Fraction of loci highlighted as outliers in raw mode. |
|
None |
Chromosome for zoom mode. Provide together with |
|
None |
Centre position (bp) for zoom mode. |
|
500000 |
Half-window in bp for zoom mode (±500 kb around |
|
None |
Save path. If None, shows interactively. |
|
None |
Figure size tuple. Defaults to (14, 4) genome-wide, (10, 2.5×n) zoom. |
|
False |
Share y-axis across stacked panels. |
|
None |
List of |
References
- iHS
Voight, B.F., Kudaravalli, S., Wen, X. and Pritchard, J.K. (2006) A map of recent positive selection in the human genome. PLOS Biology, 4, e72.
- nSL
Ferrer-Admetlla, A., Liang, M., Korneliussen, T. and Nielsen, R. (2014) On detecting incomplete soft or hard selective sweeps using haplotype structure. Molecular Biology and Evolution, 31, 1275–1286.
- iSAFE
Akbari, A., Vitti, J.J., Iranmehr, A., Bakhtiari, M., Sabeti, P.C., Mirarab, S. and Bafna, V. (2018) Identifying the favored mutation in a positive selective sweep. Nature Methods, 15, 183–185.
- DIND
Barreiro, L.B., Henriques, R., Soares, M.J., Oliveira, J., Gasche, C., … and Quintana-Murci, L. (2009) Evolutionary dynamics of human Toll-like receptors and their different contributions to host defense. PLOS Genetics, 5, e1000562.
- HapDAF-s/o, s_ratio, high_freq, low_freq
Lauterbur, M.E., Munch, K. and Enard, D. (2023) Versatile detection of diverse selective sweeps with Flex-sweep. Molecular Biology and Evolution, 40, msad139.
- H12, H2/H1
Garud, N.R., Messer, P.W., Buzbas, E.O. and Petrov, D.A. (2015) Recent selective sweeps in North American Drosophila melanogaster show signatures of soft sweeps. PLOS Genetics, 11, e1005004.
- h-scan
Schlamp, et al. (2016) Evaluating the performance of selection scans to detect selective sweeps in domestic dogs.
- LASSI
Harris, A. and DeGiorgio. (2020) A Likelihood Approach for Uncovering Selective Sweep Signatures from haplotype Data
- saltiLASSI
DeGiorgio, M. and Szpiech, Z.A. (2022) A spatially aware likelihood test to detect sweeps from haplotype distributions. PLOS Genetics, 18, e1010134.
- RAiSD
Alachiotis, N. and Pavlidis, P. (2018) RAiSD detects positive selection based on multiple signatures of a selective sweep and SNP vectors. Communications Biology, 1, 79.
- ω (omega)
Kim, Y. and Nielsen, R. (2004) Linkage disequilibrium as a signature of selective sweeps. Genetics, 167, 1513–1524.
- ZnS
Kelly, J.K. (1997) A test of neutrality based on interlocus associations. Genetics, 146, 1197–1206.
- Tajima’s D
Tajima, F. (1989) Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics, 123, 585–595.
- Fay-Wu H
Fay, J.C. and Wu, C.I. (2000) Hitchhiking under positive Darwinian selection. Genetics, 155, 1405–1413.
- Zeng E
Zeng, K., Fu, Y.X., Shi, S. and Wu, C.I. (2006) Statistical tests for detecting positive selection by utilizing high-frequency variants. Genetics, 174, 1431–1439.
- Fu-Li D, F
Fu, Y.X. and Li, W.H. (1993) Statistical tests of neutrality of mutations. Genetics, 133, 693–709.
- Beta (balancing selection)
Siewert, K.M. and Voight, B.F. (2020) BetaScan2: Standardized statistics to detect balancing selection utilizing substitution data. Genome Biology and Evolution, 12, evaa013.
- NCD1
Bitarello, B.D., de Filippo, C., Teixeira, J.C., Schmidt, J.M., Kleinert, P., Meyer, D. and Andrés, A.M. (2018) Signatures of long-term balancing selection in human genomes. The American Journal of Human Genetics, 102, 725–742.
- Outlier approach
Akey, J.M. (2009) Constructing genomic maps of positive selection in humans: where do we go from here? Genome Research, 19, 711–722.