Outlier scan

flexsweep scan is a standalone positive-selection scan independent of the CNN/DANN pipeline. It computes one or more selection statistics from VCF data and assigns each locus an empirical p-value based on the genome-wide distribution of that statistic. No neutral simulations are required.

How it works

Each statistic is computed at its natural resolution — per-SNP statistics (iHS, nSL, etc.) produce one value per polymorphic site; sliding-window statistics (H12, LASSI, etc.) produce one value per window. Once all contigs are processed, raw values from the entire genome are pooled and ranked together to produce genome-wide empirical p-values.

Empirical p-values

For each statistic an empirical p-value is assigned following the empirical outlier approach (Akey 2009):

\[p_i = \frac{\mathrm{rank}(-x_i)}{N_{\mathrm{valid}}}\]

where rank is computed on the negative value so that the largest statistic gets the smallest p-value (rank 1 → p ≈ 0, outlier). \(N_{\mathrm{valid}}\) is the count of non-NaN loci. NaN values are excluded from the ranking and do not contribute to \(N_{\mathrm{valid}}\).

For signed statistics (iHS, nSL, Tajima’s D, Fay-Wu H, Zeng E), ranking is done on \(|x_i|\) before convert to negative, so that extreme values at both tails are flagged as outliers.

The output column is named {stat}_pvalue. A value close to 0 means the locus is among the most extreme in the genome for that statistic.

Note

The empirical p-value is not an analytical p-value. It reflects the position of a locus within this genome’s distribution; it does not correspond to a controlled false-positive rate.

Available statistics

Per-SNP statistics

One score per polymorphic site. No sliding window.

Key	Rank column	Description
`ihs`	`ihs`	Integrated haplotype score (Voight et al. 2006). Detects incomplete hard sweeps via extended haplotype homozygosity. Normalized within DAF bins. Configurable: `min_maf` (0.05), `include_edges` (False), `gap_scale` (20000), `max_gap` (200000).
`nsl`	`nsl`	Number of segregating sites by length (Ferrer-Admetlla et al. 2014). Robust alternative to iHS; no genetic map required. Normalized within DAF bins. Configurable: `min_maf` (0.05).
`isafe`	`isafe`	Identifying the favored allele in a sweep (Akbari et al. 2018). Pinpoints the causal mutation within a detected sweep region. Runs on non-overlapping regions. Configurable: `region_size_bp` (1000000), `isafe_window` (300), `isafe_step` (150), `top_k` (1), `max_rank` (15).
`dind`	`dind`	Derived intra-allelic nucleotide diversity ratio (Barreiro et al. 2009). Configurable: `window_size` (50000), `min_focal_freq` (0.25), `max_focal_freq` (0.95).
`high_freq`	`high_freq`	Frequency of high-frequency derived variants in a focal window (Lauterbur et al. 2023). Configurable: `window_size` (50000), `min_focal_freq` (0.25), `max_focal_freq` (0.95).
`low_freq`	`low_freq`	Frequency of low-frequency derived variants in a focal window (Lauterbur et al. 2023). Configurable: `window_size` (50000), `min_focal_freq` (0.25), `max_focal_freq` (0.95).
`s_ratio`	`s_ratio`	Ratio of segregating sites on derived vs. ancestral haplotypes (Lauterbur et al. 2023). Configurable: `window_size` (50000), `min_focal_freq` (0.25), `max_focal_freq` (0.95).
`hapdaf_o`	`hapdaf_o`	Haplotype-derived allele frequency, other background (Lauterbur et al. 2023). Configurable: `window_size` (50000), `min_focal_freq` (0.25), `max_focal_freq` (0.95), `max_ancest_freq` (0.25), `min_tot_freq` (0.25).
`hapdaf_s`	`hapdaf_s`	Haplotype-derived allele frequency, sweep background (Lauterbur et al. 2023). Stricter ancestral-frequency thresholds than `hapdaf_o`. Configurable: `window_size` (50000), `min_focal_freq` (0.25), `max_focal_freq` (0.95), `max_ancest_freq` (0.10), `min_tot_freq` (0.10).
`hscan`	`hscan`	Average pairwise haplotype homozygosity tract length H(x) (Messer 2015). Measures the mean shared haplotype block length across all sample pairs; detects hard and soft sweeps. Configurable: `max_gap` (200000), `dist_mode` (0), `hscan_step` (1). (Use `hscan_step` (not `step`) to control scan resolution. `step` is a shared SNP-window parameter and is not used by hscan)

Sliding SNP-window statistics

One score per window of a fixed number of SNPs.

Key	Rank column	Default window	Default step	Description
`haf`	`haf`	201 SNPs	10 SNPs	Haplotype allele frequency (Ronen et al. 2015). Mean pairwise haplotype similarity across a SNP window. Configurable via shared `w_size` and `step`.
`h12`	`h12`	200 SNPs	10 SNPs	H12 haplotype homozygosity (Garud et al. 2015). Combines the two most common haplotype frequencies. Configurable via shared `w_size` and `step`.
`garud`	`h12`	200 SNPs	10 SNPs	Full Garud statistics: H1, H12, H2/H1 (Garud et al. 2015). Configurable via shared `w_size` and `step`.
`lassi`	`T_m`	201 SNPs	10 SNPs	Composite likelihood sweep scan using the haplotype frequency spectrum (DeGiorgio et al. 2014). Configurable: `K_truncation` (10), `sweep_mode` (4), and shared `w_size`, `step`.
`lassip`	`Lambda`	201 SNPs	10 SNPs	Spatially-aware saltiLASSI (DeGiorgio & Szpiech 2022). Configurable: `K_truncation` (10), `sweep_mode` (4), `max_extend` (100000), `n_A` (100), and shared `w_size`, `step`.
`raisd`	`mu_total`	50 SNPs	1 SNP	RAiSD μ composite statistic combining SFS, SNP density variation, and LD (Alachiotis & Pavlidis 2018). Configurable: `window_size` (50). Additional output columns: `mu_var`, `mu_sfs`, `mu_ld`.

Sliding bp-window statistics

One score per physical window. All configurable via shared w_size_bp (default 1 Mb) and step_bp (default 10 kb), except omega, zns, beta, and ncd which have their own narrower defaults.

Key	Rank column	Default window	Default step	Description
`tajima_d`	`tajima_d`	1 Mb	10 kb	Tajima’s D (Tajima 1989). SFS-based test; negative values signal directional selection. Signed stat — ranked by `abs(value)`.
`pi`	`pi`	1 Mb	10 kb	Nucleotide diversity θ_π.
`theta_w`	`theta_w`	1 Mb	10 kb	Watterson’s θ_W.
`fay_wu_h`	`fay_wu_h`	1 Mb	10 kb	Fay & Wu’s H (Fay & Wu 2000). Sensitive to high-frequency derived alleles. Signed stat — ranked by `abs(value)`.
`zeng_e`	`zeng_e`	1 Mb	10 kb	Zeng’s E (Zeng et al. 2006). Signed stat — ranked by `abs(value)`.
`achaz_y`	`achaz_y`	1 Mb	10 kb	Achaz Y (Achaz 2009). Robust to sequencing errors.
`fuli_d`	`fuli_d`	1 Mb	10 kb	Fu & Li’s D (Fu & Li 1993).
`fuli_d_star`	`fuli_d_star`	1 Mb	10 kb	Fu & Li’s D* (no outgroup required).
`fuli_f`	`fuli_f`	1 Mb	10 kb	Fu & Li’s F (Fu & Li 1993).
`fuli_f_star`	`fuli_f_star`	1 Mb	10 kb	Fu & Li’s F* (no outgroup required).
`neutrality`	`tajima_d`	1 Mb	10 kb	Composite: Tajima’s D, π, θ_W, Fay-Wu H in one pass. Ranked by `abs(tajima_d)`.
`omega`	`omega_max`	100 kb	10 kb	Kim & Nielsen’s ω (Kim & Nielsen 2004). LD patterns around a putative sweep centre.
`zns`	`zns`	100 kb	10 kb	Kelly’s Z_nS (Kelly 1997). Mean pairwise r² across all SNP pairs in a window.
`beta`	`beta1`	50 kb	5 kb	Beta1 statistic for balancing selection (Siewert & Voight 2017). Configurable: `m` (0.1).
`ncd`	`ncd1`	3 kb	1.5 kb	NCD1 for balancing selection (Bitarello et al. 2018). Configurable: `tf` (0.5), `w` (3000), `minIS` (2).

Window mode

SNP-count windows are required for H12, LASSI, saltiLASSI, and RAiSD — physical windows confound SNP density with haplotype diversity for those statistics. SFS-based statistics (Tajima’s D, Fay-Wu H, etc.) use physical bp windows by default.

With the default window_mode="auto", each statistic uses its built-in mode. Pass window_mode="snp" or window_mode="bp" to force a uniform mode across all window statistics.

Normalization

Statistics sensitive to allele frequency (iHS, nSL, DIND, high_freq, low_freq, s_ratio, hapdaf_o, hapdaf_s) are z-scored within genome-wide DAF bins before p-values are computed. This removes the frequency-dependent bias that would otherwise cause high-frequency SNPs to dominate outlier lists.

DAF-only normalization (default):

Compute 50 equal-frequency DAF bin edges over the genome-wide DAF distribution.
Assign each SNP to a bin.
Within each bin: z-score = (value − mean) / std. Bins with fewer than 2 SNPs are left as NaN.

Joint DAF × recombination rate normalization (when recombination_map and n_r_bins are both set):

The genome is additionally stratified by recombination rate. Each SNP is assigned to a (DAF bin, recomb rate bin) cell and z-scored within that cell. This further reduces false positives in low-recombination regions (Johnson et al. approach). To enable it, pass both recombination_map and n_r_bins (typically n_r_bins=10):

results = scan(
    "data/vcf/",
    "results/YRI",
    stats=["ihs", "nsl"],
    recombination_map="data/decode_sexavg_2019.txt.gz",
    n_daf_bins=50,
    n_r_bins=10,
)

Note

Passing only recombination_map without n_r_bins uses DAF-only normalization with genetic-distance windows for T3 stats (dind, hapdaf, s_ratio). Set n_r_bins explicitly to enable joint normalization.

Multi-contig usage

Pass a directory to --vcf_path to process all *.vcf.gz / *.bcf.gz files. The scan uses a two-step approach:

1. Each contig is processed independently; raw unranked values are produced for every requested statistic.

2. Results from all contigs are concatenated per statistic. Normalization and empirical p-values are computed across all contigs together, ensuring p-values reflect the true genome-wide distribution.

flexsweep scan \
    --vcf_path data/vcf/ \
    --out_prefix results/YRI \
    --stats ihs,nsl,h12,lassip \
    --recombination_map data/decode_sexavg_2019.txt.gz \
    --nthreads 4

CLI reference

flexsweep scan [OPTIONS]

Option	Default	Description
`--vcf_path PATH`	required	Directory of `*.vcf.gz` files (one per chromosome/contig).
`--out_prefix PREFIX`	required	Output prefix. Writes `{PREFIX}.{stat}.txt` for each stat.
`--stats LIST`	required	Comma-separated stat keys, e.g. `ihs,nsl,h12,lassip`.
`--w_size INT`	201	SNP-count window size for SNP-mode stats.
`--step INT`	10	SNP step size for SNP-mode stats.
`--w_size_bp INT`	1000000	Physical window size (bp) for bp-mode stats.
`--step_bp INT`	10000	Physical step size (bp) for bp-mode stats.
`--window_mode`	auto	`auto` uses per-stat defaults; `snp` forces SNP-count windows; `bp` forces physical bp windows for all window stats.
`--min_maf FLOAT`	0.05	Minimum minor allele frequency for iHS and nSL.
`--window_size INT`	50000	Focal window size (bp) for per-SNP stats (dind, hapdaf_o, hapdaf_s, s_ratio, high_freq, low_freq).
`--recombination_map PATH`	None	TSV recombination map (chr, start, end, cm_mb, cm). Enables genetic-distance windows for T3 stats.
`--n_daf_bins INT`	50	Number of equal-frequency DAF bins for normalization.
`--n_r_bins INT`	None	Number of recombination rate bins for joint DAF × recomb normalization. Set to 10 to match Johnson et al. Requires `--recombination_map`.
`--max_extend FLOAT`	100000	saltiLASSI spatial decay cutoff in bp.
`--K_truncation INT`	10	K truncation for LASSI/saltiLASSI (number of HFS classes).
`--sweep_mode INT`	4	Sweep spectral model for LASSI/saltiLASSI (1–5; 4 = Gaussian decay).
`--raisd_window INT`	50	SNP window size for RAiSD.
`--nthreads INT`	1	Number of parallel workers.

Examples:

# iHS + nSL on a single chromosome
flexsweep scan \
    --vcf_path YRI.chr22.vcf.gz \
    --out_prefix results/YRI.chr22 \
    --stats ihs,nsl

# H12 + saltiLASSI + RAiSD with custom SNP window
flexsweep scan \
    --vcf_path YRI.chr22.vcf.gz \
    --out_prefix results/YRI.chr22 \
    --stats h12,lassip,raisd \
    --w_size 400 \
    --nthreads 4

# SFS statistics using physical bp windows
flexsweep scan \
    --vcf_path YRI.chr22.vcf.gz \
    --out_prefix results/YRI.chr22 \
    --stats tajima_d,fay_wu_h,zeng_e,omega

# DIND + HapDAF with a larger focal window
flexsweep scan \
    --vcf_path YRI.chr22.vcf.gz \
    --out_prefix results/YRI.chr22 \
    --stats dind,hapdaf_o,hapdaf_s \
    --window_size 100000 \
    --recombination_map data/decode_sexavg_2019.txt.gz

# Genome-wide scan — joint DAF × recomb normalization
flexsweep scan \
    --vcf_path data/vcf/ \
    --out_prefix results/YRI \
    --stats ihs,nsl,h12,lassip \
    --recombination_map data/decode_sexavg_2019.txt.gz \
    --n_r_bins 10 \
    --nthreads 4

Python API

from flexsweep.scan import scan, available_stats, stat_params

# List all available stat keys
print(available_stats())

# Inspect default parameters for all stats
stat_params()

# Inspect a single stat — shows rank_col, resolution, window_mode,
# default_window, default_step, shared_params, and stat_params
stat_params("raisd")
# {'raisd': {'rank_col': 'mu_total', 'resolution': 'window',
#            'window_mode': 'snp', 'default_window': '50 SNPs',
#            'default_step': '10 SNPs',
#            'stat_params': {'window_size': 50}, ...}}

stat_params("hscan")
# {'hscan': {'rank_col': 'hscan', 'resolution': 'snp',
#            'window_mode': 'n/a (per-SNP stat)',
#            'stat_params': {'max_gap': 200000, 'dist_mode': 0,
#                            'hscan_step': 1}, ...}}

# Basic scan
results = scan(
    "data/vcf/",
    "results/YRI",
    stats=["ihs", "nsl", "h12", "lassip"],
    min_maf=0.05,
    recombination_map="data/decode_sexavg_2019.txt.gz",
    nthreads=4,
)
# results["ihs"]    → Polars DataFrame, SNP resolution, ihs_pvalue column
# results["lassip"] → Polars DataFrame, window resolution, Lambda_pvalue column

# Joint DAF × recomb normalization
results = scan(
    "data/vcf/",
    "results/YRI",
    stats=["ihs", "nsl"],
    recombination_map="data/decode_sexavg_2019.txt.gz",
    n_daf_bins=50,
    n_r_bins=10,
    nthreads=4,
)

# Per-stat parameter overrides via config dict
results = scan(
    "data/vcf/",
    "results/YRI",
    stats=["lassip", "raisd", "hscan"],
    config={
        "lassip": {"max_extend": 5e4, "K_truncation": 15},
        "raisd":  {"window_size": 100},
        "hscan":  {"hscan_step": 5, "max_gap": 100_000},
    },
    nthreads=4,
)

Output format

Each statistic writes one tab-separated file {out_prefix}.{stat}.txt:

Column	Description
`chrom`	Chromosome/contig name from VCF.
`pos`	Genomic position (bp). SNP stats: SNP position. Window stats: centre of window.
`daf`	Derived allele frequency (iHS, nSL, DIND, HapDAF, s_ratio, iSAFE).
`n_snps`	Number of SNPs in window (window stats only).
`{stat_col}`	Main statistic value (after DAF normalization for iHS/nSL/dind/hapdaf/ s_ratio).
`{stat_col}_pvalue`	Empirical p-value: rank(−value) / N_valid. Range (0, 1]; smaller = more extreme. Signed stats (iHS, nSL, Tajima’s D, Fay-Wu H, Zeng E) are ranked by `abs(value)`.

Additional columns vary by statistic (e.g., h2_h1 for garud, m and A for lassip, mu_var, mu_sfs, mu_ld for raisd).

Scan results visualization

Use plot_scan from flexsweep.utils to generate Manhattan-style or regional zoom plots directly from scan() output or saved TSV files.

Genome-wide plot:

from flexsweep.utils import plot_scan

# When passing a scan() dict, stat_cols is resolved automatically
# from STAT_REGISTRY (e.g. "raisd" → "mu_total", "lassi" → "T_m").
# No need to specify stat_cols.

# Single statistic — raw values, top 1% highlighted
plot_scan({"ihs": results["ihs"]}, out="results/YRI.ihs.png")

# Single statistic — empirical p-values (-log10 scale)
plot_scan({"ihs": results["ihs"]}, pvalue=True, out="results/YRI.ihs.png")

# Stacked multi-statistic panels — rank columns resolved automatically
plot_scan(
    {k: results[k] for k in ["ihs", "h12", "lassip", "raisd"]},
    pvalue=True,
    out="results/YRI.multi.png",
)
# Plots: ihs, h12, Lambda (lassip), mu_total (raisd)

Regional zoom plot (raw + p-value side by side):

plot_scan(
    {k: results[k] for k in ["ihs", "lassip"]},
    chrom="22",
    center=17_000_000,
    window_bp=500_000,
    out="results/YRI.zoom.png",
)

From saved TSV files (stat_cols must be provided explicitly):

plot_scan(
    ["results/YRI.ihs.txt", "results/YRI.lassip.txt"],
    stat_cols=["ihs", "Lambda"],   # must match column name in file
    pvalue=True,
    out="results/YRI.multi.png",
)

plot_scan parameters:

Parameter	Default	Description
`stats`	required	`dict` from `scan()`, a single TSV path, or a list of TSV paths.
`stat_cols`	None	Stat column name(s). Required when `stats` is a file path / list. Defaults to all keys when `stats` is a dict.
`pvalue`	False	If True, plot \(-\log_{10}(p_{\mathrm{emp}})\) with threshold lines at p = 0.01 and p = 0.001. If False, plot raw values with outliers highlighted.
`top_pct`	0.01	Fraction of loci highlighted as outliers in raw mode.
`chrom`	None	Chromosome for zoom mode. Provide together with `center`.
`center`	None	Centre position (bp) for zoom mode.
`window_bp`	500000	Half-window in bp for zoom mode (±500 kb around `center`).
`out`	None	Save path. If None, shows interactively.
`figsize`	None	Figure size tuple. Defaults to (14, 4) genome-wide, (10, 2.5×n) zoom.
`sharey`	False	Share y-axis across stacked panels.
`threshold_lines`	None	List of `(y_value, linestyle, label)` for horizontal lines in p-value mode. Pass `[]` to suppress defaults.

References

iHS: Voight, B.F., Kudaravalli, S., Wen, X. and Pritchard, J.K. (2006) A map of recent positive selection in the human genome. PLOS Biology, 4, e72.
nSL: Ferrer-Admetlla, A., Liang, M., Korneliussen, T. and Nielsen, R. (2014) On detecting incomplete soft or hard selective sweeps using haplotype structure. Molecular Biology and Evolution, 31, 1275–1286.
iSAFE: Akbari, A., Vitti, J.J., Iranmehr, A., Bakhtiari, M., Sabeti, P.C., Mirarab, S. and Bafna, V. (2018) Identifying the favored mutation in a positive selective sweep. Nature Methods, 15, 183–185.
DIND: Barreiro, L.B., Henriques, R., Soares, M.J., Oliveira, J., Gasche, C., … and Quintana-Murci, L. (2009) Evolutionary dynamics of human Toll-like receptors and their different contributions to host defense. PLOS Genetics, 5, e1000562.
HapDAF-s/o, s_ratio, high_freq, low_freq: Lauterbur, M.E., Munch, K. and Enard, D. (2023) Versatile detection of diverse selective sweeps with Flex-sweep. Molecular Biology and Evolution, 40, msad139.
H12, H2/H1: Garud, N.R., Messer, P.W., Buzbas, E.O. and Petrov, D.A. (2015) Recent selective sweeps in North American Drosophila melanogaster show signatures of soft sweeps. PLOS Genetics, 11, e1005004.
h-scan: Schlamp, et al. (2016) Evaluating the performance of selection scans to detect selective sweeps in domestic dogs.
LASSI: Harris, A. and DeGiorgio. (2020) A Likelihood Approach for Uncovering Selective Sweep Signatures from haplotype Data
saltiLASSI: DeGiorgio, M. and Szpiech, Z.A. (2022) A spatially aware likelihood test to detect sweeps from haplotype distributions. PLOS Genetics, 18, e1010134.
RAiSD: Alachiotis, N. and Pavlidis, P. (2018) RAiSD detects positive selection based on multiple signatures of a selective sweep and SNP vectors. Communications Biology, 1, 79.
ω (omega): Kim, Y. and Nielsen, R. (2004) Linkage disequilibrium as a signature of selective sweeps. Genetics, 167, 1513–1524.
ZnS: Kelly, J.K. (1997) A test of neutrality based on interlocus associations. Genetics, 146, 1197–1206.
Tajima’s D: Tajima, F. (1989) Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics, 123, 585–595.
Fay-Wu H: Fay, J.C. and Wu, C.I. (2000) Hitchhiking under positive Darwinian selection. Genetics, 155, 1405–1413.
Zeng E: Zeng, K., Fu, Y.X., Shi, S. and Wu, C.I. (2006) Statistical tests for detecting positive selection by utilizing high-frequency variants. Genetics, 174, 1431–1439.
Fu-Li D, F: Fu, Y.X. and Li, W.H. (1993) Statistical tests of neutrality of mutations. Genetics, 133, 693–709.
Beta (balancing selection): Siewert, K.M. and Voight, B.F. (2020) BetaScan2: Standardized statistics to detect balancing selection utilizing substitution data. Genome Biology and Evolution, 12, evaa013.
NCD1: Bitarello, B.D., de Filippo, C., Teixeira, J.C., Schmidt, J.M., Kleinert, P., Meyer, D. and Andrés, A.M. (2018) Signatures of long-term balancing selection in human genomes. The American Journal of Human Genetics, 102, 725–742.
Outlier approach: Akey, J.M. (2009) Constructing genomic maps of positive selection in humans: where do we go from here? Genome Research, 19, 711–722.