Outlier scan

flexsweep scan is a standalone positive-selection scan independent of the CNN/DANN pipeline. It computes one or more selection statistics from VCF data and assigns each locus an empirical p-value based on the genome-wide distribution of that statistic. No neutral simulations are required.

How it works

Each statistic is computed at its natural resolution — per-SNP statistics (iHS, nSL, etc.) produce one value per polymorphic site; sliding-window statistics (H12, LASSI, etc.) produce one value per window. Once all contigs are processed, raw values from the entire genome are pooled and ranked together to produce genome-wide empirical p-values.

Empirical p-values

For each statistic an empirical p-value is assigned following the empirical outlier approach (Akey 2009):

\[p_i = \frac{\mathrm{rank}(-x_i)}{N_{\mathrm{valid}}}\]

where rank is computed on the negative value so that the largest statistic gets the smallest p-value (rank 1 → p ≈ 0, outlier). \(N_{\mathrm{valid}}\) is the count of non-NaN loci. NaN values are excluded from the ranking and do not contribute to \(N_{\mathrm{valid}}\).

For signed statistics (iHS, nSL, Tajima’s D, Fay-Wu H, Zeng E), ranking is done on \(|x_i|\) before convert to negative, so that extreme values at both tails are flagged as outliers.

The output column is named {stat}_pvalue. A value close to 0 means the locus is among the most extreme in the genome for that statistic.

Note

The empirical p-value is not an analytical p-value. It reflects the position of a locus within this genome’s distribution; it does not correspond to a controlled false-positive rate.

Available statistics

Per-SNP statistics

One score per polymorphic site. No sliding window.

Key

Rank column

Description

ihs

ihs

Integrated haplotype score (Voight et al. 2006). Detects incomplete hard sweeps via extended haplotype homozygosity. Normalized within DAF bins. Configurable: min_maf (0.05), include_edges (False), gap_scale (20000), max_gap (200000).

nsl

nsl

Number of segregating sites by length (Ferrer-Admetlla et al. 2014). Robust alternative to iHS; no genetic map required. Normalized within DAF bins. Configurable: min_maf (0.05).

isafe

isafe

Identifying the favored allele in a sweep (Akbari et al. 2018). Pinpoints the causal mutation within a detected sweep region. Runs on non-overlapping regions. Configurable: region_size_bp (1000000), isafe_window (300), isafe_step (150), top_k (1), max_rank (15).

dind

dind

Derived intra-allelic nucleotide diversity ratio (Barreiro et al. 2009). Configurable: window_size (50000), min_focal_freq (0.25), max_focal_freq (0.95).

high_freq

high_freq

Frequency of high-frequency derived variants in a focal window (Lauterbur et al. 2023). Configurable: window_size (50000), min_focal_freq (0.25), max_focal_freq (0.95).

low_freq

low_freq

Frequency of low-frequency derived variants in a focal window (Lauterbur et al. 2023). Configurable: window_size (50000), min_focal_freq (0.25), max_focal_freq (0.95).

s_ratio

s_ratio

Ratio of segregating sites on derived vs. ancestral haplotypes (Lauterbur et al. 2023). Configurable: window_size (50000), min_focal_freq (0.25), max_focal_freq (0.95).

hapdaf_o

hapdaf_o

Haplotype-derived allele frequency, other background (Lauterbur et al. 2023). Configurable: window_size (50000), min_focal_freq (0.25), max_focal_freq (0.95), max_ancest_freq (0.25), min_tot_freq (0.25).

hapdaf_s

hapdaf_s

Haplotype-derived allele frequency, sweep background (Lauterbur et al. 2023). Stricter ancestral-frequency thresholds than hapdaf_o. Configurable: window_size (50000), min_focal_freq (0.25), max_focal_freq (0.95), max_ancest_freq (0.10), min_tot_freq (0.10).

hscan

hscan

Average pairwise haplotype homozygosity tract length H(x) (Messer 2015). Measures the mean shared haplotype block length across all sample pairs; detects hard and soft sweeps. Configurable: max_gap (200000), dist_mode (0), hscan_step (1).

(Use hscan_step (not step) to control scan resolution.

step is a shared SNP-window parameter and is not used by hscan)

Sliding SNP-window statistics

One score per window of a fixed number of SNPs.

Key

Rank column

Default window

Default step

Description

haf

haf

201 SNPs

10 SNPs

Haplotype allele frequency (Ronen et al. 2015). Mean pairwise haplotype similarity across a SNP window. Configurable via shared w_size and step.

h12

h12

200 SNPs

10 SNPs

H12 haplotype homozygosity (Garud et al. 2015). Combines the two most common haplotype frequencies. Configurable via shared w_size and step.

garud

h12

200 SNPs

10 SNPs

Full Garud statistics: H1, H12, H2/H1 (Garud et al. 2015). Configurable via shared w_size and step.

lassi

T_m

201 SNPs

10 SNPs

Composite likelihood sweep scan using the haplotype frequency spectrum (DeGiorgio et al. 2014). Configurable: K_truncation (10), sweep_mode (4), and shared w_size, step.

lassip

Lambda

201 SNPs

10 SNPs

Spatially-aware saltiLASSI (DeGiorgio & Szpiech 2022). Configurable: K_truncation (10), sweep_mode (4), max_extend (100000), n_A (100), and shared w_size, step.

raisd

mu_total

50 SNPs

1 SNP

RAiSD μ composite statistic combining SFS, SNP density variation, and LD (Alachiotis & Pavlidis 2018). Configurable: window_size (50). Additional output columns: mu_var, mu_sfs, mu_ld.

Sliding bp-window statistics

One score per physical window. All configurable via shared w_size_bp (default 1 Mb) and step_bp (default 10 kb), except omega, zns, beta, and ncd which have their own narrower defaults.

Key

Rank column

Default window

Default step

Description

tajima_d

tajima_d

1 Mb

10 kb

Tajima’s D (Tajima 1989). SFS-based test; negative values signal directional selection. Signed stat — ranked by abs(value).

pi

pi

1 Mb

10 kb

Nucleotide diversity θπ.

theta_w

theta_w

1 Mb

10 kb

Watterson’s θW.

fay_wu_h

fay_wu_h

1 Mb

10 kb

Fay & Wu’s H (Fay & Wu 2000). Sensitive to high-frequency derived alleles. Signed stat — ranked by abs(value).

zeng_e

zeng_e

1 Mb

10 kb

Zeng’s E (Zeng et al. 2006). Signed stat — ranked by abs(value).

achaz_y

achaz_y

1 Mb

10 kb

Achaz Y (Achaz 2009). Robust to sequencing errors.

fuli_d

fuli_d

1 Mb

10 kb

Fu & Li’s D (Fu & Li 1993).

fuli_d_star

fuli_d_star

1 Mb

10 kb

Fu & Li’s D* (no outgroup required).

fuli_f

fuli_f

1 Mb

10 kb

Fu & Li’s F (Fu & Li 1993).

fuli_f_star

fuli_f_star

1 Mb

10 kb

Fu & Li’s F* (no outgroup required).

neutrality

tajima_d

1 Mb

10 kb

Composite: Tajima’s D, π, θW, Fay-Wu H in one pass. Ranked by abs(tajima_d).

omega

omega_max

100 kb

10 kb

Kim & Nielsen’s ω (Kim & Nielsen 2004). LD patterns around a putative sweep centre.

zns

zns

100 kb

10 kb

Kelly’s ZnS (Kelly 1997). Mean pairwise r² across all SNP pairs in a window.

beta

beta1

50 kb

5 kb

Beta1 statistic for balancing selection (Siewert & Voight 2017). Configurable: m (0.1).

ncd

ncd1

3 kb

1.5 kb

NCD1 for balancing selection (Bitarello et al. 2018). Configurable: tf (0.5), w (3000), minIS (2).

Window mode

SNP-count windows are required for H12, LASSI, saltiLASSI, and RAiSD — physical windows confound SNP density with haplotype diversity for those statistics. SFS-based statistics (Tajima’s D, Fay-Wu H, etc.) use physical bp windows by default.

With the default window_mode="auto", each statistic uses its built-in mode. Pass window_mode="snp" or window_mode="bp" to force a uniform mode across all window statistics.

Normalization

Statistics sensitive to allele frequency (iHS, nSL, DIND, high_freq, low_freq, s_ratio, hapdaf_o, hapdaf_s) are z-scored within genome-wide DAF bins before p-values are computed. This removes the frequency-dependent bias that would otherwise cause high-frequency SNPs to dominate outlier lists.

DAF-only normalization (default):

  1. Compute 50 equal-frequency DAF bin edges over the genome-wide DAF distribution.

  2. Assign each SNP to a bin.

  3. Within each bin: z-score = (value − mean) / std. Bins with fewer than 2 SNPs are left as NaN.

Joint DAF × recombination rate normalization (when recombination_map and n_r_bins are both set):

The genome is additionally stratified by recombination rate. Each SNP is assigned to a (DAF bin, recomb rate bin) cell and z-scored within that cell. This further reduces false positives in low-recombination regions (Johnson et al. approach). To enable it, pass both recombination_map and n_r_bins (typically n_r_bins=10):

results = scan(
    "data/vcf/",
    "results/YRI",
    stats=["ihs", "nsl"],
    recombination_map="data/decode_sexavg_2019.txt.gz",
    n_daf_bins=50,
    n_r_bins=10,
)

Note

Passing only recombination_map without n_r_bins uses DAF-only normalization with genetic-distance windows for T3 stats (dind, hapdaf, s_ratio). Set n_r_bins explicitly to enable joint normalization.

Multi-contig usage

Pass a directory to --vcf_path to process all *.vcf.gz / *.bcf.gz files. The scan uses a two-step approach:

1. Each contig is processed independently; raw unranked values are produced for every requested statistic.

2. Results from all contigs are concatenated per statistic. Normalization and empirical p-values are computed across all contigs together, ensuring p-values reflect the true genome-wide distribution.

flexsweep scan \
    --vcf_path data/vcf/ \
    --out_prefix results/YRI \
    --stats ihs,nsl,h12,lassip \
    --recombination_map data/decode_sexavg_2019.txt.gz \
    --nthreads 4

CLI reference

flexsweep scan [OPTIONS]

Option

Default

Description

--vcf_path PATH

required

Directory of *.vcf.gz files (one per chromosome/contig).

--out_prefix PREFIX

required

Output prefix. Writes {PREFIX}.{stat}.txt for each stat.

--stats LIST

required

Comma-separated stat keys, e.g. ihs,nsl,h12,lassip.

--w_size INT

201

SNP-count window size for SNP-mode stats.

--step INT

10

SNP step size for SNP-mode stats.

--w_size_bp INT

1000000

Physical window size (bp) for bp-mode stats.

--step_bp INT

10000

Physical step size (bp) for bp-mode stats.

--window_mode

auto

auto uses per-stat defaults; snp forces SNP-count windows; bp forces physical bp windows for all window stats.

--min_maf FLOAT

0.05

Minimum minor allele frequency for iHS and nSL.

--window_size INT

50000

Focal window size (bp) for per-SNP stats (dind, hapdaf_o, hapdaf_s, s_ratio, high_freq, low_freq).

--recombination_map PATH

None

TSV recombination map (chr, start, end, cm_mb, cm). Enables genetic-distance windows for T3 stats.

--n_daf_bins INT

50

Number of equal-frequency DAF bins for normalization.

--n_r_bins INT

None

Number of recombination rate bins for joint DAF × recomb normalization. Set to 10 to match Johnson et al. Requires --recombination_map.

--max_extend FLOAT

100000

saltiLASSI spatial decay cutoff in bp.

--K_truncation INT

10

K truncation for LASSI/saltiLASSI (number of HFS classes).

--sweep_mode INT

4

Sweep spectral model for LASSI/saltiLASSI (1–5; 4 = Gaussian decay).

--raisd_window INT

50

SNP window size for RAiSD.

--nthreads INT

1

Number of parallel workers.

Examples:

# iHS + nSL on a single chromosome
flexsweep scan \
    --vcf_path YRI.chr22.vcf.gz \
    --out_prefix results/YRI.chr22 \
    --stats ihs,nsl

# H12 + saltiLASSI + RAiSD with custom SNP window
flexsweep scan \
    --vcf_path YRI.chr22.vcf.gz \
    --out_prefix results/YRI.chr22 \
    --stats h12,lassip,raisd \
    --w_size 400 \
    --nthreads 4

# SFS statistics using physical bp windows
flexsweep scan \
    --vcf_path YRI.chr22.vcf.gz \
    --out_prefix results/YRI.chr22 \
    --stats tajima_d,fay_wu_h,zeng_e,omega

# DIND + HapDAF with a larger focal window
flexsweep scan \
    --vcf_path YRI.chr22.vcf.gz \
    --out_prefix results/YRI.chr22 \
    --stats dind,hapdaf_o,hapdaf_s \
    --window_size 100000 \
    --recombination_map data/decode_sexavg_2019.txt.gz

# Genome-wide scan — joint DAF × recomb normalization
flexsweep scan \
    --vcf_path data/vcf/ \
    --out_prefix results/YRI \
    --stats ihs,nsl,h12,lassip \
    --recombination_map data/decode_sexavg_2019.txt.gz \
    --n_r_bins 10 \
    --nthreads 4

Python API

from flexsweep.scan import scan, available_stats, stat_params

# List all available stat keys
print(available_stats())

# Inspect default parameters for all stats
stat_params()

# Inspect a single stat — shows rank_col, resolution, window_mode,
# default_window, default_step, shared_params, and stat_params
stat_params("raisd")
# {'raisd': {'rank_col': 'mu_total', 'resolution': 'window',
#            'window_mode': 'snp', 'default_window': '50 SNPs',
#            'default_step': '10 SNPs',
#            'stat_params': {'window_size': 50}, ...}}

stat_params("hscan")
# {'hscan': {'rank_col': 'hscan', 'resolution': 'snp',
#            'window_mode': 'n/a (per-SNP stat)',
#            'stat_params': {'max_gap': 200000, 'dist_mode': 0,
#                            'hscan_step': 1}, ...}}

# Basic scan
results = scan(
    "data/vcf/",
    "results/YRI",
    stats=["ihs", "nsl", "h12", "lassip"],
    min_maf=0.05,
    recombination_map="data/decode_sexavg_2019.txt.gz",
    nthreads=4,
)
# results["ihs"]    → Polars DataFrame, SNP resolution, ihs_pvalue column
# results["lassip"] → Polars DataFrame, window resolution, Lambda_pvalue column

# Joint DAF × recomb normalization
results = scan(
    "data/vcf/",
    "results/YRI",
    stats=["ihs", "nsl"],
    recombination_map="data/decode_sexavg_2019.txt.gz",
    n_daf_bins=50,
    n_r_bins=10,
    nthreads=4,
)

# Per-stat parameter overrides via config dict
results = scan(
    "data/vcf/",
    "results/YRI",
    stats=["lassip", "raisd", "hscan"],
    config={
        "lassip": {"max_extend": 5e4, "K_truncation": 15},
        "raisd":  {"window_size": 100},
        "hscan":  {"hscan_step": 5, "max_gap": 100_000},
    },
    nthreads=4,
)

Output format

Each statistic writes one tab-separated file {out_prefix}.{stat}.txt:

Column

Description

chrom

Chromosome/contig name from VCF.

pos

Genomic position (bp). SNP stats: SNP position. Window stats: centre of window.

daf

Derived allele frequency (iHS, nSL, DIND, HapDAF, s_ratio, iSAFE).

n_snps

Number of SNPs in window (window stats only).

{stat_col}

Main statistic value (after DAF normalization for iHS/nSL/dind/hapdaf/ s_ratio).

{stat_col}_pvalue

Empirical p-value: rank(−value) / N_valid. Range (0, 1]; smaller = more extreme. Signed stats (iHS, nSL, Tajima’s D, Fay-Wu H, Zeng E) are ranked by abs(value).

Additional columns vary by statistic (e.g., h2_h1 for garud, m and A for lassip, mu_var, mu_sfs, mu_ld for raisd).

Scan results visualization

Use plot_scan from flexsweep.utils to generate Manhattan-style or regional zoom plots directly from scan() output or saved TSV files.

Genome-wide plot:

from flexsweep.utils import plot_scan

# When passing a scan() dict, stat_cols is resolved automatically
# from STAT_REGISTRY (e.g. "raisd" → "mu_total", "lassi" → "T_m").
# No need to specify stat_cols.

# Single statistic — raw values, top 1% highlighted
plot_scan({"ihs": results["ihs"]}, out="results/YRI.ihs.png")

# Single statistic — empirical p-values (-log10 scale)
plot_scan({"ihs": results["ihs"]}, pvalue=True, out="results/YRI.ihs.png")

# Stacked multi-statistic panels — rank columns resolved automatically
plot_scan(
    {k: results[k] for k in ["ihs", "h12", "lassip", "raisd"]},
    pvalue=True,
    out="results/YRI.multi.png",
)
# Plots: ihs, h12, Lambda (lassip), mu_total (raisd)

Regional zoom plot (raw + p-value side by side):

plot_scan(
    {k: results[k] for k in ["ihs", "lassip"]},
    chrom="22",
    center=17_000_000,
    window_bp=500_000,
    out="results/YRI.zoom.png",
)

From saved TSV files (stat_cols must be provided explicitly):

plot_scan(
    ["results/YRI.ihs.txt", "results/YRI.lassip.txt"],
    stat_cols=["ihs", "Lambda"],   # must match column name in file
    pvalue=True,
    out="results/YRI.multi.png",
)

plot_scan parameters:

Parameter

Default

Description

stats

required

dict from scan(), a single TSV path, or a list of TSV paths.

stat_cols

None

Stat column name(s). Required when stats is a file path / list. Defaults to all keys when stats is a dict.

pvalue

False

If True, plot \(-\log_{10}(p_{\mathrm{emp}})\) with threshold lines at p = 0.01 and p = 0.001. If False, plot raw values with outliers highlighted.

top_pct

0.01

Fraction of loci highlighted as outliers in raw mode.

chrom

None

Chromosome for zoom mode. Provide together with center.

center

None

Centre position (bp) for zoom mode.

window_bp

500000

Half-window in bp for zoom mode (±500 kb around center).

out

None

Save path. If None, shows interactively.

figsize

None

Figure size tuple. Defaults to (14, 4) genome-wide, (10, 2.5×n) zoom.

sharey

False

Share y-axis across stacked panels.

threshold_lines

None

List of (y_value, linestyle, label) for horizontal lines in p-value mode. Pass [] to suppress defaults.

References

iHS

Voight, B.F., Kudaravalli, S., Wen, X. and Pritchard, J.K. (2006) A map of recent positive selection in the human genome. PLOS Biology, 4, e72.

nSL

Ferrer-Admetlla, A., Liang, M., Korneliussen, T. and Nielsen, R. (2014) On detecting incomplete soft or hard selective sweeps using haplotype structure. Molecular Biology and Evolution, 31, 1275–1286.

iSAFE

Akbari, A., Vitti, J.J., Iranmehr, A., Bakhtiari, M., Sabeti, P.C., Mirarab, S. and Bafna, V. (2018) Identifying the favored mutation in a positive selective sweep. Nature Methods, 15, 183–185.

DIND

Barreiro, L.B., Henriques, R., Soares, M.J., Oliveira, J., Gasche, C., … and Quintana-Murci, L. (2009) Evolutionary dynamics of human Toll-like receptors and their different contributions to host defense. PLOS Genetics, 5, e1000562.

HapDAF-s/o, s_ratio, high_freq, low_freq

Lauterbur, M.E., Munch, K. and Enard, D. (2023) Versatile detection of diverse selective sweeps with Flex-sweep. Molecular Biology and Evolution, 40, msad139.

H12, H2/H1

Garud, N.R., Messer, P.W., Buzbas, E.O. and Petrov, D.A. (2015) Recent selective sweeps in North American Drosophila melanogaster show signatures of soft sweeps. PLOS Genetics, 11, e1005004.

h-scan

Schlamp, et al. (2016) Evaluating the performance of selection scans to detect selective sweeps in domestic dogs.

LASSI

Harris, A. and DeGiorgio. (2020) A Likelihood Approach for Uncovering Selective Sweep Signatures from haplotype Data

saltiLASSI

DeGiorgio, M. and Szpiech, Z.A. (2022) A spatially aware likelihood test to detect sweeps from haplotype distributions. PLOS Genetics, 18, e1010134.

RAiSD

Alachiotis, N. and Pavlidis, P. (2018) RAiSD detects positive selection based on multiple signatures of a selective sweep and SNP vectors. Communications Biology, 1, 79.

ω (omega)

Kim, Y. and Nielsen, R. (2004) Linkage disequilibrium as a signature of selective sweeps. Genetics, 167, 1513–1524.

ZnS

Kelly, J.K. (1997) A test of neutrality based on interlocus associations. Genetics, 146, 1197–1206.

Tajima’s D

Tajima, F. (1989) Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics, 123, 585–595.

Fay-Wu H

Fay, J.C. and Wu, C.I. (2000) Hitchhiking under positive Darwinian selection. Genetics, 155, 1405–1413.

Zeng E

Zeng, K., Fu, Y.X., Shi, S. and Wu, C.I. (2006) Statistical tests for detecting positive selection by utilizing high-frequency variants. Genetics, 174, 1431–1439.

Fu-Li D, F

Fu, Y.X. and Li, W.H. (1993) Statistical tests of neutrality of mutations. Genetics, 133, 693–709.

Beta (balancing selection)

Siewert, K.M. and Voight, B.F. (2020) BetaScan2: Standardized statistics to detect balancing selection utilizing substitution data. Genome Biology and Evolution, 12, evaa013.

NCD1

Bitarello, B.D., de Filippo, C., Teixeira, J.C., Schmidt, J.M., Kleinert, P., Meyer, D. and Andrés, A.M. (2018) Signatures of long-term balancing selection in human genomes. The American Journal of Human Genetics, 102, 725–742.

Outlier approach

Akey, J.M. (2009) Constructing genomic maps of positive selection in humans: where do we go from here? Genome Research, 19, 711–722.