API Reference

Simulation 

class flexsweep.Simulator(sample_size, demes, output_folder, mutation_rate={'dist': 'uniform', 'max': 2e-08, 'min': 5e-09}, recombination_rate={'dist': 'exponential', 'max': 1e-07, 'mean': 1e-08, 'min': 1e-09}, locus_length=1200000, discoal_path='/home/docs/checkouts/readthedocs.org/user_builds/flexsweep/checkouts/latest/flexsweep/data/discoal', num_simulations=250000, saf=[0, 0.1], eaf=[0.5, 1], s=[0.01, 0.05], time=[0, 5000], nthreads=1, ne=10000, fixed_ratio=0.1, split=False)[source]

Bases: object

Initialize discoal coalescent simulations.

The simulator draws per-replicate mutation (\(\mu\)) and recombination (\(r\)) rates, scales them to \(\theta = 4N_eL\mu\) and \(\rho = 4N_eLr\), and prepares neutral and sweep configurations to be executed via the discoal binary.

check_inputs()[source]

Validate inputs and prepare output directories.

Ensures mutation and recombination distributions are dictionaries, creates subdirectories for sweep and neutral simulations, and reads demography from the provided demes YAML.

Returns:: str – Discoal demographic flags string (e.g., “ -en <t> 0 <size> …”).

create_params()[source]

Generate and save parameter sets for neutral and sweep simulations into a Polars DataFrame

Steps:

Draw μ and ρ, compute θ = 4NeLμ and ρ = 4NeLr.
Sample sweep times (uniform between time[0], time[1]).
Sample selection coefficients between s[0] and s[1].
Partition simulations into hard/soft and complete/incomplete sweeps.
Save all parameters to ‘<output_folder>/params.txt.gz’.

Returns:

pl.DataFrame –

Parameters with columns: [‘iter’, ‘mu’, ‘r’, ‘eaf’, ‘saf’, ‘s’, ‘t’, ‘model’].

neutral(theta, rho, discoal_demes, _iter=1)[source]

Run a single neutral simulation.

Parameters:

theta (float) – 4NeLμ.
rho (float) – 4NeLr.
discoal_demes (str) – Demography string (from read_demes).
_iter (int, default=1) – Iteration index.

Returns:

str – Path to output gzipped ms file.

random_distribution(num)[source]

Draw mutation (μ) and recombination (ρ) rates.

Parameters:: num (int) – Number of draws.
Returns:: tuple[np.ndarray, np.ndarray] – Arrays (mu, rho) each of length num.

Supported distributions:

‘uniform’: Uniform[min, max]
‘exponential’: Exponential(mean), truncated to [min, max]
‘truncnorm’: Truncated normal with mean/std/min/max
‘fixed’: Placeholder only (currently not implemented)

Notes

Values are in per-basepair units.

read_demes()[source]

Parse the demes YAML and build the discoal -en demography string.

Behavior:

Loads the first deme and reverses epochs (oldest first).
Sets self.ne_0 to initial population size.
Converts epoch end times to 4Ne units and start sizes to Ne ratios.
Produces strings of the form ‘ -en <time> 0 <size>’.

Returns:: str – Concatenated discoal ‘-en’ flags

simulate()[source]

Run neutral and sweep simulations via discoal.

Returns:: dict[str, list[str]] – Paths to gzipped ms files, with keys ‘neutral’ and ‘sweep’.

simulate_batch()[source]

Run neutral and sweep simulations via discoal, in batches.

Returns:: dict[str, list[str]] – Paths to gzipped ms files, with keys ‘neutral’ and ‘sweep’.

sweep(theta, rho, f_t, f_i, t, s, discoal_demes, _iter=1)[source]

Run a single sweep simulation

Parameters:

theta (float) – 4NeLμ.
rho (float) – 4NeLr.
f_t (float, default=1) – Terminal frequency (completed sweep if =1).
f_i (float, default=0) – Initial frequency (0 = hard sweep).
t (float) – Sweep onset time in 4Ne generations.
s (float) – Selection coefficient scaled by 2Ne.
discoal_demes (str) – Demography string.
_iter (int, default=1) – Iteration index.

Returns:

str – Path to output gzipped ms file.

Data I/O 

class flexsweep.data.Data(data, region=None, samples=None, recombination_map=None, window_size=1200000, step=10000, nthreads=1, sequence_length=1200000, in_memory=False)[source]

Bases: object

genome_reader(region, samples)[source]

Read a genomic region from a VCF/BCF and return haplotypes and rec_map given the region interval.

Parameters:

region (str) – Region string ‘CHR:START-END’.
samples (list[str] | np.ndarray) – Sample subset to read.

Returns:

dict[str, tuple[np.ndarray, np.ndarray] | None] – {region: (hap, rec_map_subset)} or {region: None} if no biallelic sites. - hap: (S x N) int8 array of phased haplotypes (derived-allele indicators). - rec_map_subset: (S x 4) array [chrom, idx, genetic_pos(cm or proxy), physical_pos].

Notes

Filters to biallelic variants via scikit-allel’s AlleleCounts.is_biallelic_01.
If no recombination map is supplied, genetic distance is set to physical pos and later remapped to a relative [1..sequence_length] coordinate system.

get_cm(df_rec_map, positions)[source]

Interpolate genetic distances (cM) for given physical positions.

Parameters:

df_rec_map (pl.DataFrame) –

Recombination map with numeric columns where:
columns[1] holds physical coordinate (bp) and columns[-1] holds cumulative genetic distance (cM).
positions (np.ndarray[int]) – Physical positions to interpolate.

Returns:

np.ndarray[float] – Interpolated cumulative cM values (negative values clamped to 0).

Notes

Uses linear interpolation with extrapolation at ends.

read_simulations()[source]

Load paths (or parsed data if in_memory=True) for discoal simulations.

Expects a directory structure:: <self.data>/neutral/.ms.gz <self.data>/sweep/.ms.gz <self.data>/params.txt.gz

Returns:

tuple[OrderedDict, pl.DataFrame] – sims: OrderedDict({‘neutral’: <list|array>, ‘sweep’: <list|array>}) params: polars DataFrame with columns [‘model’, ‘s’, ‘t’, ‘saf’, ‘eaf’].

Raises:

AssertionError – If self.data is not a string.
ValueError – If required ‘neutral’ or ‘sweep’ subfolders are missing or empty.

read_vcf()[source]

Slide windows across a single-contig VCF/BCF and collect per-window data haplotypes

Returns:

dict | OrderedDict –

If in_memory=True:: {‘sweep’: list[[hap, rec_map, zeros(4)]], ‘region’: list[str]} where each element corresponds to a window with data.
If in_memory=False:: OrderedDict({‘sweep’: <vcf_path>, ‘region’: list[str]}) with region strings to be used downstream for lazy reading.

Raises:

IOError – If the file extension is not one of [.vcf, .vcf.gz, .bcf, .bcf.gz].
FileNotFoundError – If the VCF/BCF is not tabix-indexed (<path>.tbi is missing).

Notes

Determines contig name/length by reading last line of the vcf file.
Generates overlapping windows of size window_size and step step.
Uses multiprocessing when in_memory=True to read per-window haplotypes.

Feature vectors 

flexsweep.fv.summary_statistics(data_dir, vcf=False, nthreads=1, windows=[100000], step=100000.0, step_vcf=10000.0, locus_length=1200000, recombination_map=None, r_bins=None, min_rate=0.0, suffix=None, func=None, save_stats=False, only_normalize=False, stats=None)[source]

Compute summary statistics to create needed feature vectors for CNN training/prediction from either simulated or VCF/BCF input data.

This function dispatches automatically to the appropriate backend depending on the vcf flag. When vcf=False (default) it processes mdiscoal simulations; when vcf=True it processes VCF/BCF files.

Parameters:

data_dir (str) – Input directory. When vcf=False, the root folder of simulation outputs containing subdirectories neutral/, sweep/, and params.txt.gz. When vcf=True, a directory of bgzipped VCF/BCF files.
vcf (bool) – Whether to use the VCF/BCF processing pipeline (True) or the simulation pipeline (False). Default: False.
nthreads (int) – Number of threads. Default: 1.
windows (list) – List of window sizes (in base pairs) to compute summary statistics. Supports multi-scale: e.g. [50000, 100000, 500000]. Default: [100000].
step (int) – Step size (in base pairs) for sliding windows. Together with locus_length, determines the center positions: centers = range(step//2, locus_length - step//2 + step, step). Default: 1e5.
locus_length (int) – Total locus length in base pairs. Default: 1200000.
recombination_map (str) – Optional path to a recombination map. If None, physical distances are used as a proxy for genetic distances. Default: None.
suffix (str) – Optional suffix appended to output file names. Default: None.
func (callable) – Function used internally for computing summary statistics per replicate or genomic window. See calculate_stats_simulations or calculate_stats_vcf_flat

Returns:

Feature-vector DataFrame combining all computed summary statistics. Downstream pipelines may also write Parquet and normalization artifacts (e.g., fvs.parquet, empirical_bins.pickle).

Return type:

polars.DataFrame

Raises:

FileNotFoundError – If input files or directories are missing.
ValueError – If an input file or directory is malformed or inconsistent.

Notes:

Internally dispatches to _process_vcf() or _process_sims() depending on the vcf flag.
Center positions are derived automatically from locus_length and step as [step//2, locus_length - step//2]. For each center, stats are computed at every size in windows, producing n_centers × len(windows) rows per replicate.
Supports automatic normalization of features using empirical bins.
Designed for use as a top-level wrapper in the Flexsweep feature-vector generation pipeline.

Examples:

From simulated data (discoal/ms format):

>>> df = summary_statistics("./simulations", nthreads=8)

Multi-scale windows:

>>> df = summary_statistics("./simulations", windows=[50000, 100000, 500000], nthreads=8)

From VCF data with recombination map:

>>> df = summary_statistics(
...     "./vcf_data",
...     vcf=True,
...     recombination_map="recomb_map.csv",
...     nthreads=8
... )

flexsweep.fv._process_sims(data_dir, nthreads, windows=None, step=100000.0, r_bins=None, suffix=None, func=None, save_stats=False, locus_length=1200000, stats=None)[source]

Process ms files from simulation to compute, normalize, and estimate feature vectors.

It scans a directory for simulation outputs (neutral and sweep), computes summary statistics for each replicate, normalizes the results between classes, and exports Parquet feature vectors to traing the CNN.

Parameters:

data_dir (str) – Directory containing simulation output files (neutral and sweep). Expected substructure and file naming conventions follow the Flexsweep.Simulator class output (e.g., data_dir/sweeps/ and data_dir/neutral/).
nthreads (int) – Number of threads.
windows (list) – List of window sizes (in base pairs) to compute summary statistics.
step (int) – Step size (in base pairs) for sliding windows.
suffix (str) – Optional suffix appended to output file names.
func (callable) – Function to estimate summary statistics. See calculate_stats_simulations.

Returns:

A pair of Polars DataFrames:

df_pred: normalized feature vectors for CNN training.
df_pred_raw: raw feature vectors.

Return type:

tuple[polars.DataFrame, polars.DataFrame]

Raises:

FileNotFoundError – If no simulation files are found under data_dir.
ValueError – If simulation data are malformed or incompatible with the expected format.

Notes:

Writes the following files to data_dir:
- fvs{suffix}.parquet – normalized feature vectors.
- fvs_raw{suffix}.parquet – raw feature vectors.
- empirical_bins{suffix}.pickle – empirical normalization bins.
Neutral and sweep simulations are processed jointly to derive shared empirical normalization bins.
Designed for Flexsweep.Simulators class simulations folders but can handle compatible structures with proper naming.

flexsweep.fv._process_vcf(data_dir, nthreads, windows=None, step=100000.0, step_vcf=10000, locus_length=1200000, recombination_map=None, r_bins=None, min_rate=None, suffix=None, func=None, save_stats=False, stats=None)[source]

flexsweep.fv.calculate_stats_simulations(hap_data, _iter=1, center=None, windows=[100000], step=100000.0, locus_length=1200000, stats=None)[source]

flexsweep.fv.calculate_stats_vcf_flat(vcf_file, region, center=None, windows=[100000], step=100000.0, _iter=1, recombination_map=None, nthreads=1, locus_length=1200000, stats=None, stat_func=None, stat_cols=None, compute_snp_stats=True, parallel_manager=None, isafe_region_size=2000000)[source]

Compute per-locus-window summary statistics from a VCF with O(N) window work.

Instead of computing stats for every (locus_window × center) pair — which would recompute each overlapping sub-window up to numSubWins times — this function:

Enumerates the unique set of absolute (center, window_size) positions across all locus windows.
Dispatches one joblib task per unique pair (via batch_windowed_stats_flat).
iSAFE is fully supported via non-overlapping region tiling. It divides the chromosome into non-overlapping regions of isafe_region_size bp (default 2 Mb) and runs run_isafe once per region. SNP scores are absolute-position keyed so they join correctly to any locus window that contains them.
Assembles output rows by looking up cached results per locus window.

Parameters:: isafe_region_size (int) – Size in bp of non-overlapping regions used to compute iSAFE. Valid range 1e6–5e6. Default 2e6.

flexsweep.fv.normalize_neutral(d_stats_neutral, vcf=False, df_r_bins=None)[source]

Calculates the expected mean and standard deviation of summary statistics from neutral simulations, used for normalization in downstream analyses.

This function processes a DataFrame of neutral simulation statistics, bins the values based on frequency, and computes the mean (expected) and standard deviation for each bin. These statistics are used as a baseline to normalize sweep or neutral simulations

Parameters:

df_stats_neutral (list or pandas.DataFrame) – A list or concatenated pandas DataFrame containing the neutral simulation statistics. The DataFrame should contain frequency data and various summary statistics, including H12 and HAF, across multiple windows and bins.

Returns:

expected (pandas.DataFrame) – A DataFrame containing the mean (expected) values of the summary statistics for each frequency bin. The frequency bins are the index, and the columns are the summary statistics.
stdev (pandas.DataFrame) – A DataFrame containing the standard deviation of the summary statistics for each frequency bin. The frequency bins are the index, and the columns are the summary statistics.

Notes

The function first concatenates the neutral statistics, if provided as a list, and bins the values by frequency using the bin_values function.
It computes both the mean and standard deviation for each frequency bin, which can later be used to normalize observed statistics (e.g., from sweeps).
The summary statistics processed exclude window-specific statistics such as “h12” and “haf.”

flexsweep.fv.normalize_stats(stats_values, bins, region=None, center=[500000.0, 700000.0], windows=[50000, 100000, 200000, 500000, 1000000], step=10000.0, parallel_manager=None, nthreads=1, vcf=False, df_r_bins=None, locus_length=1200000)[source]

flexsweep.fv.normalize_cut_raw(snps_values, bins, center=[500000.0, 700000.0], windows=[50000, 100000, 200000, 500000, 1000000], step=10000, df_r_bins=None)[source]: Sims-only refactor: - join df_r onto SNP and window tables (by iter) BEFORE binning/normalization - normalize using neutral bins conditional on (freq_bins, r_bins) when available - drop r_bins from outputs so feature dimension stays unchanged

flexsweep.fv.resolve_stats(stats)[source]

Partition a user stat list into window_cols, snp_groups, compute_isafe, and snp_cols.

Returns (window_cols, snp_groups, compute_isafe, snp_cols).

snp_cols is the set of actual DataFrame column names to keep in SNP output, or None when stats is None (default = keep all).

When stats is None, returns defaults matching the full flexsweep pipeline.

Outlier scan 

flexsweep.scan.scan(vcf_path, out_prefix, stats, config=None, w_size=201, step=10, w_size_bp=1000000, step_bp=10000, min_maf=0.05, recombination_map=None, n_daf_bins=50, n_r_bins=None, nthreads=1, window_mode='auto', **kwargs) → dict[str, DataFrame][source]

Standalone outlier scan from a directory of per-chromosome VCF files.

Uses a global task pool across ALL VCF files simultaneously: all chromosomes are pre-loaded, then one Parallel(n_jobs=nthreads) call processes every stat × chromosome combination. This fully exploits nthreads regardless of how many chromosomes are present.

Parameters:

vcf_path – Directory containing *.vcf.gz (or *.bcf.gz) files, one per chromosome/contig. Must be a directory; single-file input is not supported.
out_prefix – Output file prefix. Writes {out_prefix}.{stat}.txt for each stat.
stats – List of stat keys to compute. See available_stats() for options. Per-SNP: ihs, nsl, isafe, dind, high_freq, low_freq, s_ratio, hapdaf_o, hapdaf_s. SNP-window: h12, garud, lassi, lassip, raisd. bp-window: tajima_d, pi, theta_w, fay_wu_h, zeng_e, achaz_y, fuli_f, fuli_f_star, fuli_d, fuli_d_star, neutrality, omega, beta, ncd.
config – Per-stat parameter overrides, e.g. {"raisd": {"window_size": 100}, "lassip": {"max_extend": 5e4}}. Overrides w_size, step, and any kwargs for that stat only.
w_size – SNP-count window size for SNP-mode sliding-window stats (default 201).
step – SNP step for SNP-mode sliding-window stats (default 10).
w_size_bp – Physical window size in bp for bp-mode stats (default 1 Mb).
step_bp – Physical step size in bp for bp-mode stats (default 10 kb).
min_maf – Minimum minor allele frequency for iHS and nSL (default 0.05).
recombination_map – Path to recombination map TSV (chr, start, end, cm_mb, cm). If provided: genetic distances used for T3 stat windows, and frequency-sensitive stats (iHS, nSL, dind, …) are normalized by joint (DAF × recomb_rate) bins (Johnson et al. approach).
n_daf_bins – Number of equal-frequency DAF bins for normalization (default 50).
n_r_bins – Number of equal-frequency recombination rate bins for joint (DAF × r_bins) normalization. None (default) → DAF-only normalization. Set to 10 (Johnson et al.) to enable joint normalization when recombination_map is provided.
nthreads – Total worker threads for the global task pool (default 1). Window-batchable stats split each chromosome into nthreads window batches so every thread stays busy.
window_mode – Override window mode for all sliding-window stats. “auto” (default) uses per-stat defaults from STAT_REGISTRY. “snp” forces SNP-count windows for all window stats. “bp” forces physical bp windows for all window stats.
kwargs – Shared overrides forwarded to all stats: max_extend, K_truncation, sweep_mode, raisd_window, tf, etc.

Returns:

dict[str, polars.DataFrame] – Keys are stat names; each DataFrame has a {rank_col}_pvalue column. Files written to {out_prefix}.{stat}.txt (tab-separated).

flexsweep.scan.available_stats() → list[str][source]: Return list of all available stat keys.

flexsweep.scan.stat_params(stat_key: str | None = None) → dict[source]

Return parameter documentation for scan stats.

Parameters:: stat_key (str, optional) – A specific stat key (e.g. "ihs", "hscan"). If None, returns the full table for all stats.
Returns:: dict – Keys are stat names. Each value is a dict with keys rank_col, resolution, window_mode, default_window, default_step, shared_params, and stat_params.

Notes

Shared params always injected by scan() for every stat: w_size (201), step (10), w_size_bp (1000000), step_bp (10000), min_maf (0.05), window_mode (“auto”).

Per-stat params are passed via config={"stat": {"param": value}}.

Examples

>>> from flexsweep.scan import stat_params
>>> stat_params("hscan")
>>> stat_params()   # full table

flexsweep.scan.empirical_pvalues(df: DataFrame, stat_col: str, abs_rank: bool = False) → DataFrame[source]

Empirical p-value following the empirical outlier approach (Akey 2009).

p = rank(−value, na.last=keep) / N_valid Small p → outlier/candidate (locus in extreme upper tail of the empirical distribution). NaN → null in output (excluded from ranking and from N_valid denominator).

abs_rank=True: rank by abs(value) before negating — for signed stats where large magnitude in either direction signals selection (iHS, nSL, Tajima’s D).

CNN / DANN 

class flexsweep.cnn.CNN(*args: Any, **kwargs: Any)[source]: Bases:

Enrichment 

flexsweep.enrichment.run_enrichment(sweep_files: list, gene_set: str, factors_file: str, annotation_file: str, populations: list, groups: list, thresholds: list, pop_interest: str = 'All', cluster_distance: int = 500000, n_runs: int = 10, tolerance: float = 0.05, min_distance: int = 1250000, flip: bool = False, max_rep: int = 25, nthreads: int = 1, n_shuffles: int = 8, shuffling_segs: int = 2, bootstrap_dir: str = None, distance_file: str = None) → list[source]

Run the full gene-set sweep enrichment pipeline.

Steps:

Load gene set and derive valid gene universe (factors ∩ sweep_files[0]).
Bootstrap control sets matched on confounding factors (or load pre-computed sets from bootstrap_dir).
For each sweep file: count sweep overlaps, shuffle genome for null distribution, and estimate FDR.

Parameters:

sweep_files – Paths to sweep rank files (gene_id + per-population rank columns, tab- or space-separated, optionally gzipped).
gene_set – TSV with gene_id and yes/no label columns (no header). Genes labelled yes are VIPs.
factors_file – TSV confounding factors file (gene_id + factor columns, no header).
annotation_file – BED gene coordinates file (0-based, no header).
populations – Population codes matching sweep file column order.
groups – Group label per population (same length as populations).
thresholds – Rank cutoffs for enrichment curve (e.g. [6000, …, 20]).
pop_interest – Population, group name, or 'All' for FDR scope.
cluster_distance – Max bp between genes to count as neighbours.
n_runs – Bootstrap batches.
tolerance – Allowed ± fraction deviation in factor averages for control gene matching.
min_distance – Minimum bp distance from VIPs for control eligibility.
flip – Flip test direction when len(VIPs) > len(controls) (increases power).
max_rep – Max average resamples per control gene across bootstrap sets.
nthreads – Parallel workers (joblib).
n_shuffles – FDR shuffle replicates (must be a multiple of 8).
shuffling_segs – Genes per genomic shuffle segment.
bootstrap_dir – Folder with pre-computed VIPs/ and nonVIPs/ sub-dirs. When non-empty, the bootstrap step is skipped.
distance_file – Optional pre-computed distance TSV (gene_id t distance, no header). Passed to run_bootstrap_nomatchomega to bypass internal distance computation and match Perl’s control pool exactly.

Returns:

List of FDR DataFrames, one per entry in sweep_files.

flexsweep.enrichment.run_bootstrap_nomatchomega(case_genes: list, factors_file: str, annotation_file: str, runs_number: int, tolerance: float, min_dist: float, flip: bool, max_rep: int, seed: int = None, nthreads: int = 1, control_genes: list = None, distance_file: str = None, skip_factor_idx: int = 8, offset_factor_idx: int = 27, offset_value: float = 2.0, n_batches: int = 10)[source]

Step 6: nomatchomega_fast variant of the matched bootstrap control set generator.

Implements the behaviour of bootstrap_test_script_nomatchomega_fast.pl:

Factor skip_factor_idx (default 8, omega/dN/dS) is unconstrained.
Factor offset_factor_idx (default 27, Perl column 28) gets +offset_value.
n_batches sequential internal batches per parallel call, with continuous state (used_counts, current_means, fake_seed persist across batches), matching Perl’s within-process continuous state.
Target for batch r: 100 + 11 * vip_number * r (grows each batch).
Total control sets = runs_number * n_batches * 10.

Parameters:

case_genes (list of VIP gene IDs (HLA/histone excluded by caller))
factors_file (path to confounding factors table (TSV, first col gene_id))
annotation_file (path to BED gene coordinates file (chrom start end gene_id strand). Distances to nearest VIP are computed automatically via compute_distances.)
runs_number (number of independent parallel calls (≡ Perl Runs_number))
tolerance (allowed ± fraction for confounder matching (e.g. 0.05))
min_dist (minimum distance (bp) from VIPs for control eligibility)
flip (if True, swap VIP and control pools)
max_rep (max times a control gene may be sampled on average)
nthreads (number of parallel workers)
control_genes (optional explicit candidate control gene IDs. If None, all genes in factors_file that are not case genes are used. Pass the “no”-labelled genes from genes_set_file to match Perl behaviour exactly.)
distance_file (optional path to pre-computed distance TSV (gene_id t distance, no header). When provided, distances are read directly instead of calling compute_distances(). Use Perl’s distance_genes_set_file.txt for exact control pool matching.)
skip_factor_idx (0-based factor index to leave unconstrained (default 8 = Perl factor 9, dN/dS / omega).)
offset_factor_idx (0-based factor index to apply +offset_value to (default 27 = Perl column 28).)
offset_value (value added to offset_factor_idx column (default 2.0).)
n_batches (sequential internal batches per parallel call (≡ Perl Iterations_number/10; default 10).)

Returns:

df_set (pl.DataFrame — VIP genes (gene_id column))
df_bootstrap_control (pl.DataFrame — rows = control sets, cols = gene positions (sample_id + column_0..column_N))

flexsweep.enrichment.estimate_fdr(real_results: DataFrame, shuffled_sweep_files_by_shuffle: list, vip_genes: list, control_sets: list, neighbors: DataFrame, thresholds: list, populations: list, groups: list, pop_interest: str, simplified_sweeps_by_pop_fn=None, count_sweeps: bool = False, use_clust: bool = True, nthreads: int = 1, p_cutoff: float = 0.05, max_threshold: int = None, min_threshold: int = 0, min_vip_count: int = 0) → DataFrame[source]

Step 9: Estimate FDR by comparing real results to a null distribution built from genome-shuffled sweep files (output of shuffle_genome).

Matches the behaviour of estimate_FPR.pl: for each scope (population / group / “All”), the test statistic is sum(vip_count − ctrl_mean) across rows that pass the significance and threshold filters. FDR p-value = fraction of null replicates where that statistic ≥ the real statistic.

Parameters:

real_results (pl.DataFrame — output of count_sweeps_multipop on the real data (columns: scope, threshold, vip_count, ctrl_mean, p_value))
shuffled_sweep_files_by_shuffle (list of lists — each inner list holds paths for one shuffle replicate. Layout A: one file per population [[esn_s1, gwd_s1, …], …]. Layout B: one multi-population file per replicate [[multi_s1], [multi_s2], …].)
vip_genes (list of VIP gene IDs)
control_sets (control sets from run_bootstrap)
neighbors (pl.DataFrame from build_gene_neighbors_numpy)
thresholds (list of int thresholds)
populations (ordered population list)
groups (group label per population)
pop_interest (single pop, group name, or “All”)
simplified_sweeps_by_pop_fn (callable(sweep_file, col_index) → pl.DataFrame, or callable(sweep_file) → dict[pop, pl.DataFrame] for multi-pop files. Defaults to simplify_sweeps_gz.)
use_clust (expand by neighbours before counting)
nthreads (parallel workers)
p_cutoff (p-value threshold for “significant” rows in the test statistic (Perl: cutoff, default 0.05))
max_threshold (upper bound on rank threshold included in the statistic (Perl: limit; None = no limit))
min_threshold (lower bound on rank threshold (Perl: cutoff2, default 0))
min_vip_count (minimum vip_count for a row to contribute to the statistic (Perl: enough, default 0))

Returns:

pl.DataFrame with columns (scope, p_value, n_replicates, real_stat, max_null_stat. One row per scope. real_stat = sum(vip_count − ctrl_mean) for significant rows. p_value = fraction of null replicates where null_stat ≥ real_stat.)

flexsweep.enrichment.count_sweeps_singlepop(vip_genes: list, control_sets: list, simplified_sweeps: DataFrame, neighbors: DataFrame, thresholds: list, use_clust: bool = True, count_sweeps: bool = False, neighbor_col: str = None) → DataFrame[source]

Count sweeps overlapping VIPs vs control sets for a single population.

Parameters:

vip_genes (list of VIP gene ID strings)
control_sets (list of lists of control gene IDs (one list per control set))
simplified_sweeps (pl.DataFrame from simplify_sweeps_gz (gene_id, selected_cutoff))
neighbors (pl.DataFrame from build_gene_neighbors_numpy (gene_id, neighbors))
thresholds (sorted list of int rank thresholds)
use_clust (expand gene sets by neighbours before counting (de-duplication))
count_sweeps (if True, count distinct sweep events (connected components of the neighbour graph within sweep genes) rather than individual genes.)
neighbor_col (column name for neighbours (auto-detected if None))

Returns:

pl.DataFrame with columns (threshold, vip_count, ctrl_mean, ctrl_ci_lo, ctrl_ci_hi, ratio, ci_lo_ratio, ci_hi_ratio, p_value)

flexsweep.enrichment.count_sweeps_multipop(vip_genes: list, control_sets: list, simplified_sweeps_by_pop: dict, neighbors: DataFrame, thresholds: list, populations: list, groups: list, pop_interest: str, count_sweeps: bool = False, use_clust: bool = True, nthreads: int = 1) → DataFrame[source]

Step 7: Multi-population sweep counting with group aggregation and adaptive depth.

Populations are processed in parallel. For a group or ‘All’, the sweep sets of member populations are unioned before counting (each sweep counted once per group, not once per population). Adaptive depth escalates from 100 → 1000 → all control sets depending on observed p-values.

Parameters:

vip_genes (list of VIP gene IDs)
control_sets (list of lists of control gene IDs (up to 10 000))
simplified_sweeps_by_pop (dict mapping population name → pl.DataFrame (output of simplify_sweeps_gz per pop))
neighbors (pl.DataFrame from build_gene_neighbors_numpy)
thresholds (list of int rank thresholds)
populations (ordered list of population codes)
groups (group label for each population (same length))
pop_interest (single pop name, group name, or “All”)
count_sweeps (if True count distinct sweep events (connected components of the neighbour graph within sweep genes); if False count genes in sweeps (default; matches Perl behaviour))
use_clust (expand gene sets by neighbours before counting (de-duplication; maps to Perl Count_sweeps parameter))
nthreads (parallel workers)

Returns:

pl.DataFrame with columns (scope, threshold, vip_count, ctrl_mean, ctrl_ci_lo, ctrl_ci_hi, ratio, ci_lo_ratio, ci_hi_ratio, p_value. scope is the population name, group name, or “All”.)

Ancestral polarization 

Flexsweep includes a Rust CLI (flexsweep-polarize) for annotating the ancestral allele in a VCF from a multi-species alignment (MAF). The Python wrappers below invoke the compiled binary; the binary must be built first with build_rust_polarization() or installed via conda-forge.

Workflow:

Sort the MAF by reference contig and position (run_sort_maf).
Polarize the VCF using the sorted MAF and one or more outgroup species (run_polarize). The binary streams both files together and writes a bgzipped VCF with REF/ALT swapped where the outgroup consensus supports the alternate allele as ancestral.

Polarization methods (--method):

parsimony — majority vote over outgroup alleles; no model fitting.
jc — Jukes–Cantor substitution model.
kimura — Kimura two-parameter model (default).
r6 — General reversible model (6 free parameters); slowest, most accurate.

Visualization 

flexsweep.utils.plot_scan(stats, stat_cols=None, pvalue: bool = False, top_pct: float = 0.01, threshold_lines: list | None = None, out: str | None = None, figsize: tuple | None = None, title: str | None = None, sharey: bool = False, chrom: str | None = None, center: int | None = None, window_bp: int = 500000) → Figure[source]

Genome-wide or regional scan plot — single stat or stacked multi-stat.

Accepts output from scan() (dict) or paths to {prefix}.{stat}.txt files.

Genome-wide mode (default, chrom=None): Single stat → 1-panel Manhattan. Multiple stats → stacked panels, shared x-axis. pvalue=False plots raw values with bipolar colouring for signed stats. pvalue=True plots -log10(p_emp) with threshold lines at p = 0.01 / 0.001.

Zoom mode (chrom + center provided): n rows × 2 columns: raw stat (left) | -log10(p_emp) (right). Filtered to [center ± window_bp]. Orange dashed line marks the centre.

Parameters:

stats (dict | str | list[str]) –
- dict from scan() — keys are stat names, values are DataFrames.
- str — path to a single {stat}.txt scan output file.
- list[str] — paths to multiple scan output files.
stat_cols (str | list[str], optional) – Stat column name(s). Required when stats is a file path / list of paths. When stats is a dict, defaults to all keys.
pvalue (bool) – Genome-wide mode only. If True, plot -log10(p_emp) with y=[0,10] and threshold lines. If False, plot raw stat values with bipolar colouring.
top_pct (float) – Fraction highlighted as outliers in genome-wide raw mode.
threshold_lines (list of (y_value, linestyle, label), optional) – Horizontal lines. pvalue=True default: y=2 dashed (p=0.01), y=3 solid. Pass [] to suppress.
out (str, optional) – Save path. If None, shows interactively.
figsize (tuple, optional) – Genome-wide default: (14, 4) / (14, 3×n). Zoom default: (10, 2.5×n).
title (str, optional) – Title for single-stat genome-wide plots.
sharey (bool) – Share y-axis across panels in genome-wide mode (default False).
chrom (str, optional) – Chromosome for zoom mode (e.g. "22" or "chr22").
center (int, optional) – Centre position (bp) for zoom mode.
window_bp (int) – Half-window size in bp for zoom mode (default 500 000 = ±500 kb).

flexsweep.utils.plot_manhattan(prediction, recombination_map: str | None = None, eps: float = 1e-10, chr_col: str | None = None, pos_col: str | None = None, p_col: str | None = None, log_transform: bool = True, threshold_lines: list | None = None, figsize: tuple = (14, 5), out: str | None = None, title: str | None = None, chrom: str | int | None = None, center: int | None = None, window_bp: int = 5000000, chr_prefix_pattern: str = '^chr')[source]

flexsweep.utils.plot_diversity(data_dir, figsize=None, title=None, out=None, nthreads=1)[source]

flexsweep.utils.plot_sweep_density(prediction, output_path=None)[source]

Histogram of per-window sweep probability split by chromosome.

Parameters:

prediction (pl.DataFrame | str) – Output of CNN.predict() or path to parquet/CSV with columns chr, start, end, prob_sweep.
output_path (str, optional) – If given, saves the figure as SVG.

Returns:

fig (matplotlib.figure.Figure)

flexsweep.utils.plot_fv_pca(train_data, empirical_data, subsample=5000, output_path=None)[source]

PCA of the feature vector matrix colored by neutral/sweep label.

Parameters:

train_data (str | pl.DataFrame) – Path to fvs*.parquet or already-loaded DataFrame. Must have a ‘model’ column.
subsample (int) – Max rows to use (avoids slow PCA on very large datasets). Default 5000.
output_path (str, optional) – If given, saves SVG.

Returns:

fig (matplotlib.figure.Figure)

flexsweep.utils.plot_stat_distributions(train_data, empirical_data=None, stats=['dind', 'dist_kurtosis', 'dist_skew', 'dist_var', 'h1', 'h12', 'h2_h1', 'haf', 'hapdaf_o', 'hapdaf_s', 'high_freq', 'ihs', 'isafe', 'k_counts', 'low_freq', 'max_fda', 'nsl', 'omega_max', 'pi', 's_ratio', 'tajima_d', 'theta_h', 'theta_w', 'zns'], output_path=None)[source]

Violin plots of feature stats split by neutral/sweep (and optionally empirical).

Parameters:

train_data (str | pl.DataFrame) – Training fvs*.parquet — must have a ‘model’ column.
empirical_data (str | pl.DataFrame, optional) – Empirical fvs*.parquet — no ‘model’ column; plotted as a third distribution.
stats (list[str], optional) – Stat base names to plot (e.g. [‘pi’, ‘h12’, ‘ihs’]). Default: one representative column per unique stat base name.
output_path (str, optional) – Save path for SVG.

Returns:

fig (matplotlib.figure.Figure)

flexsweep.utils.rank_probabilities(prediction, feature_coordinates, rank_distance=False, k=111)[source]

Goal: match the original pybedtools/bedtools output exactly, including tie order.

Strategy for exact tie-order:

Create an explicit, deterministic per-gene input order (gene_order) from the sorted gene table (chr,start). This mirrors the order bedtools processes A.
Carry gene_order through the pipeline and use it as the final tie-breaker in the final sort (stable + deterministic).

Strategy for bedtools closest -k parity:

We emulate -k K by sorting hits per gene by (d, chrom_pred, start_pred, end_pred) and taking the first K.
This gives deterministic selection when multiple windows share the same distance.

Population genetics statistics 

These are the low-level functions called internally by the feature-vector pipeline and the scan module. They can also be called directly on numpy arrays.

Site Frequency Spectrum (SFS)

flexsweep.fv.neutrality_stats(ac, positions)[source]

# Numba unified call to compute SFS neutrality statistics. Trying to avoid numba overhead as much as possible

Returns a length-12 array:: [0] tajima_d [1] theta_h (scaled by sequence length if positions provided) [2] h_raw (Fay & Wu’s H = pi - theta_h, unscaled) [3] h_norm (normalized Fay & Wu’s H) [8] pi (mean pairwise diversity, absolute) [9] theta_w (Watterson’s theta, absolute) [10] theta_w_per_base (nan if < 2 positions) [11] pi_per_base (nan if < 2 positions)

flexsweep.fv.theta_watterson(ac, positions)[source]

flexsweep.fv.sfs_nb(dac, n)[source]

Site-frequency spectrum (SFS) from derived-allele counts.

Parameters:

dac (numpy.ndarray) – 1D array of derived allele counts per site, values in [0..n].
n (int) – Total number of chromosomes. If n <= 0, it is inferred as max(dac).

Returns:

Integer array of length n+1; sfs[k] is the number of sites with k derived copies.

Return type:

numpy.ndarray

flexsweep.fv.theta_pi(ac)[source]

Per-site nucleotide diversity (π) from allele counts.

For each site j, computes pi_j = 2 * a_j * (n - a_j) / [n * (n - 1)], where a_j is the derived allele count and n is the total number of chromosomes.

Parameters:: ac (numpy.ndarray) – Allele counts array of shape (S, 2) (ancestral, derived), with constant n across sites.
Returns:: Array of per-site π values of length S.
Return type:: numpy.ndarray

flexsweep.fv.tajima_d(ac, min_sites=3)[source]

Tajima’s D from allele counts.

Compares the mean pairwise difference (sum of per-site π) to the Watterson estimator based on the number of segregating sites. Returns nan if the number of segregating sites is below min_sites.

Parameters:

ac (numpy.ndarray) – Allele counts array of shape (S, 2) (ancestral, derived), with constant n across sites.
min_sites (int) – Minimum required number of segregating sites. Default 3.

Returns:

Tajima’s D as a float (nan if insufficient sites).

Return type:

float

flexsweep.fv.achaz_y(ac)[source]

Achaz’s Y neutrality test (standardized).

Unfolded/polarized form — requires ancestral-state information (outgroup or polarization). Excludes ξ₁ (derived singletons) from both estimators.

Reference: Achaz 2008, Appendix B, Equations B28–B30. f = (n-2) / (n·(a_n-1)); Var[Y] = α_n·θ + β_n·θ²

Parameters:: ac (numpy.ndarray) – Allele counts array of shape (S, 2) (ancestral, derived), constant n. Derived allele counts in ac[:,1] define the unfolded SFS.
Returns:: Standardized Achaz’s Y as a float; returns nan if n < 3 or if there are no segregating sites excluding singletons.
Return type:: float

flexsweep.fv.achaz_y_star(ac)[source]

Achaz’s Y* neutrality test (standardized).

Folded form — does not require ancestral-state polarization. Excludes η₁ = ξ₁ + ξ_{n-1} (minor-allele singletons at frequency 1/n or (n-1)/n) from both the π and S estimators.

Reference: Achaz 2008, Appendix B, Equations B19–B21. f* = (n-3) / (a_n·(n-1) - n); Var[Y*] = α*_n·θ + β*_n·θ²

Parameters:: ac (numpy.ndarray) – Allele counts array of shape (S, 2) (ancestral, derived), constant n.
Returns:: Standardized Achaz Y* as a float; nan if n < 4 or no valid sites.
Return type:: float

flexsweep.fv.achaz_t(ac, decay=0.9)[source]

Achaz’s T_Ω neutrality test (Achaz 2009 Eq. 9).

Unfolded/polarized SFS. Uses exponential weight ω₁ᵢ = e^{−0.9·i} vs uniform ω₂ᵢ = 1. Sensitive to excess low-frequency polymorphisms such as those produced by severe bottlenecks.

Variance coefficients αₙ/βₙ are precomputed once per sample size via _achaz_t_coeffs (lru_cache), so per-window cost is O(n).

α = 0.9 is empirical (Achaz 2009 p.254): gives positive Ωᵢ only for i/n ≤ 0.13. θ² estimated as S(S-1)/(a₁²+a₂) following Fu (1995).

Parameters:

ac (numpy.ndarray) – Allele counts array of shape (S, 2) (ancestral, derived), constant n. Derived allele counts in ac[:,1] define the unfolded SFS.
decay (float) – Exponential decay α (default 0.9 per Achaz 2009).

Returns:

Standardized T_Ω as a float; nan if n < 3 or no segregating sites.

Return type:

float

flexsweep.fv.fay_wu_h_norm(ac, positions=None)[source]

Fay & Wu’s H and its normalized form (single-population, infinite sites).

Computes:

theta_h: estimator that upweights high-frequency derived alleles.
h = pi - theta_h (Fay & Wu’s H).
h_norm: normalized H using variance terms.

Parameters:

ac (numpy.ndarray) – Allele counts array of shape (S, 2) (ancestral, derived), constant n.
positions (numpy.ndarray) – Optional positions (length S). If provided, theta_h is divided by the accessible span positions[-1] - (positions[0] - 1).

Returns:

Tuple (theta_h, h, h_norm) as floats.

Return type:

tuple[float, float, float]

flexsweep.fv.zeng_e(ac)[source]

Zeng’s E statistic (single-population, infinite sites), standardized.

Contrasts Watterson’s estimator with a linear SFS component related to high-frequency derived signal. Useful alongside Tajima’s D and Fay & Wu’s H.

Parameters:: ac (numpy.ndarray) – Allele counts array of shape (S, 2) (ancestral, derived), constant n.
Returns:: Standardized Zeng’s E as a float.
Return type:: float

flexsweep.fv.fuli_f_star(ac)[source]

Fu and Li’s F* (starred) statistic (no outgroup required).

Focuses on deviations in the singleton class of the (folded) SFS, contrasting singleton abundance with diversity (π). The starred form does not require ancestral state polarization.

Parameters:: ac (numpy.ndarray) – Allele counts array of shape (S, 2) (ancestral, derived), constant n.
Returns:: Fu & Li’s F* as a float.
Return type:: float

flexsweep.fv.fuli_f(ac)[source]

Fu and Li’s F statistic (polarized).

Uses singleton counts and diversity (π); typically assumes derived states are known (e.g., via outgroup).

Parameters:: ac (numpy.ndarray) – Allele counts array of shape (S, 2) (ancestral, derived), constant n.
Returns:: Fu & Li’s F as a float.
Return type:: float

flexsweep.fv.fuli_d_star(ac)[source]

Fu and Li’s D* (starred) statistic (no outgroup required).

Compares the number of segregating sites against singleton counts in the folded spectrum. The starred form does not require ancestral state polarization.

Parameters:: ac (numpy.ndarray) – Allele counts array of shape (S, 2) (ancestral, derived), constant n.
Returns:: Fu & Li’s D* as a float.
Return type:: float

flexsweep.fv.fuli_d(ac)[source]

Fu and Li’s D statistic (polarized form).

Uses the total number of segregating sites and singletons; typically assumes derived states are known (e.g., via outgroup) to define singletons.

Parameters:: ac (numpy.ndarray) – Allele counts array of shape (S, 2) (ancestral, derived), constant n.
Returns:: Fu & Li’s D as a float.
Return type:: float

Haplotype-based 

flexsweep.fv.ihs_ihh(h, pos, map_pos=None, min_ehh=0.05, min_maf=0.05, include_edges=False, gap_scale=20000, max_gap=200000, is_accessible=None, use_threads=False)[source]

Compute iHS (integrated Haplotype Score) and delta iHH from haplotypes.

The routine integrates EHH (extended haplotype homozygosity) on both sides of each focal SNP to obtain iHH for ancestral and derived alleles, and then reports iHS (log ratio) and the absolute difference in iHH (delta_ihh).

Parameters:

h (numpy.ndarray) – Haplotype matrix of shape (n_snps, n_haplotypes) with 0/1 values, where rows are SNPs and columns are haplotypes.
pos (numpy.ndarray) – Physical positions for SNPs (length n_snps). Used for gap handling and, when map_pos is None, for integration spacing.
map_pos (numpy.ndarray) – Optional genetic map positions (same length as pos). If provided, integration uses these coordinates instead of pos. Default None.
min_ehh (float) – Minimum EHH value to include in the integration. Default 0.05.
min_maf (float) – Minimum minor-allele frequency required to compute iHS at a SNP. Default 0.05.
include_edges (bool) – If True, permit edge SNPs to contribute even when EHH dips below min_ehh. Default False.
gap_scale (int) – Scaling used for gaps between consecutive SNPs when integrating over physical distance (ignored if map_pos is provided). Default 20000.
max_gap (int) – Maximum gap allowed when integrating; larger gaps are capped to max_gap to avoid overweighting sparse regions. Default 200000.
is_accessible (numpy.ndarray) – Optional boolean mask (length n_snps) indicating accessible SNPs. If None, all SNPs are considered accessible. Default None.
use_threads (bool) – Enable threaded computation in downstream primitives when available. Default False.

Returns:

Polars DataFrame with columns: positions (physical position), daf (derived allele frequency), ihs (log iHH ratio), and delta_ihh (absolute difference between derived and ancestral iHH).

Return type:

polars.DataFrame

Raises:

ValueError – Propagated if inputs are inconsistent in length or malformed.

Note

SNPs that fail the MAF threshold or have invalid iHS values are omitted from the returned table.

flexsweep.fv.haf_top(hap, pos, cutoff=0.1, start=None, stop=None, window_size=None, n_snps=None)[source]

Compute the upper-tail HAF (Haplotype Allele Frequency) summary in a region.

Rows of hap are SNPs, columns are haplotypes. HAF values are computed per SNP from haplotypes, then restricted to the specified genomic region (start/stop or window_size) if given. The HAF values are sorted and the top portion after trimming by cutoff is summed.

Parameters:

hap (numpy.ndarray) – Haplotype matrix of shape (n_snps, n_haplotypes) with 0/1 values.
pos (numpy.ndarray) – Physical positions for SNPs (length n_snps).
cutoff (float) – Proportion used for tail trimming. For example, 0.1 trims the lowest 10% and the highest 10% before summing the remaining HAF values. Default 0.1.
start (float) – Optional region start position (inclusive). Default None.
stop (float) – Optional region end position (inclusive). Default None.
window_size (int) – Optional window size in base pairs centered by the caller’s convention. If provided, it can be used to define the region when start/stop are not specified. Default None.
n_snps (int) – Optional limit on the number of SNPs considered by certain strategies (implementation-dependent). Default 1001.

Returns:

Upper-tail HAF summary as a single float after trimming by cutoff.

Return type:

float

Raises:

ValueError – Propagated if inputs are malformed or if no SNPs fall within the region.

Note

If neither start/stop nor window_size is provided, the computation uses all SNPs in hap/pos.

flexsweep.fv.garud_h(h)[source]

Compute the H1, H12, H123 and H2/H1 statistics for detecting signatures of soft sweeps, as defined in Garud et al. (2015).

Parameters:

h (array_like, int, shape (n_variants, n_haplotypes)) – Haplotype array.

Returns:

h1 (float) – H1 statistic (sum of squares of haplotype frequencies).
h12 (float) – H12 statistic (sum of squares of haplotype frequencies, combining the two most common haplotypes into a single frequency).
h123 (float) – H123 statistic (sum of squares of haplotype frequencies, combining the three most common haplotypes into a single frequency).
h2_h1 (float) – H2/H1 statistic, indicating the “softness” of a sweep.

flexsweep.fv.h12_enard(hap, positions, focal_coord=600000, n_snps=None, window_size=500000, min_derived_freq=0.05, similarity_threshold=0.8, top_k=10)[source]

Estimate Garud’s H12, H2/H1, H1 around a focal coordinate, grouping haplotypes that are at least a given identity threshold (default 80%).

The method builds a count-based, symmetric SNP window centered at focal_coord, constructs a haplotype matrix H for the selected SNPs, collapses identical haplotypes (columns), orders the unique haplotypes by frequency (descending) and lexicographic order (ascending), selects a set of representative haplotypes, and then merges representatives into similarity groups whenever the column-wise identity meets or exceeds similarity_threshold (0.8 by default). Haplotype group frequencies are then used to compute the H12 family of statistics.

Identity between two haplotype columns is defined as:

\[\text{identity} = \frac{\#(1,1)}{\#(1,1) + \#(1,0) + \#(0,1)}\]

(i.e., matches on the derived allele over all non-equal-or-derived comparisons).

Parameters:

hap (numpy.ndarray) – Haplotype matrix of shape (L_total, n) with 0/1 values (ancestral/derived). Rows are SNPs; columns are haplotypes.
positions (numpy.ndarray) – 1D array (length L_total) of genomic coordinates (int64) aligned to hap rows.
focal_coord (int) – Genomic coordinate used to center the SNP window. Default 600000.
n_snps (int) – Target number of SNPs for the window (the focal SNP, if present, is excluded from the returned set). Default 1001.
min_derived_freq (float) – Minimum derived-allele frequency required for a SNP to enter the window (inclusive; upper bound is 1.0). Default 0.05.
similarity_threshold (float) – Column similarity threshold for grouping haplotypes. Two representative columns are merged if their identity fraction (formula above) is ≥ this value. Default 0.8 (80% identity).
top_k (int) – Limit controlling how many unique-haplotype representatives are considered before grouping. Default 10.

Returns:

Tuple (H12, H2_H1, H2) as floats. If no usable SNPs or groups are found, returns (0.0, 0.0, 0.0).

Return type:

tuple[float, float, float]

\[\begin{split}H1 = \sum_g p_g^2,\qquad H12 = (p_1 + p_2)^2,\qquad H2 = H1 - p_1^2,\qquad \frac{H2}{H1} = \begin{cases} (H1 - p_1^2)/H1, & H1 \ne 0,\\ 0, & H1 = 0~. \end{cases}\end{split}\]

Notes

The SNP window is built by balancing sites to the left and right of focal_coord by proximity, after applying the derived-frequency filter [min_derived_freq, 1.0], and excluding the focal position itself.
Unique haplotypes are detected via hashing of H columns; sample-to-group frequencies \(p_g\) are computed after the identity-based grouping step.
The default behavior corresponds to H12 with 80% identity grouping in the haplotype matrix, which can increase robustness by merging highly similar haplotypes.

flexsweep.fv.hscan(hap, pos, focal_pos=None, max_gap=1000000000, dist_mode=0, step=1, left_bound=0, right_bound=1000000000, return_pairs=False)[source]

flexsweep.fv.run_isafe(hap, positions, max_freq=1, min_region_size_bp=49000, min_region_size_ps=300, ignore_gaps=True, window=300, step=150, top_k=1, max_rank=15)[source]

Estimate iSAFE or SAFE on a genomic region following Flex-sweep default values.

The function removes monomorphic SNPs, then checks region size. If num_snps <= min_region_size_ps or positions.max() - positions.min() < min_region_size_bp, it computes SAFE; otherwise it computes iSAFE using the provided sliding-window settings. Results are returned as a Polars DataFrame with columns positions (bp), daf (derived allele frequency), and isafe (score). Variants with daf >= max_freq are filtered out.

Parameters:

hap (numpy.ndarray) – Haplotype matrix of shape (n_snps, n_haplotypes) with 0/1 values (ancestral/derived).
positions (numpy.ndarray) – 1D array of physical coordinates (length n_snps) aligned to hap rows.
max_freq (float) – Maximum allowed derived allele frequency in the output (daf < max_freq). Default 1 (no filter).
min_region_size_bp (int) – Minimum region span in base pairs required to run iSAFE. Default 49000.
min_region_size_ps (int) – Minimum number of polymorphic SNPs required to run iSAFE. Default 300.
ignore_gaps (bool) – Reserved for gap handling; currently not used. Default True.
window (int) – iSAFE sliding window size (number of SNPs or bp, depending on the downstream implementation). Default 300.
step (int) – iSAFE step between windows. Default 150.
top_k (int) – iSAFE parameter controlling the number of top candidates per window. Default 1.
max_rank (int) – iSAFE parameter controlling the maximum rank to track. Default 15.

Returns:

Polars DataFrame with columns positions (int), daf (float), and isafe (float), sorted by position and filtered to daf < max_freq. If the region is small, the isafe column contains SAFE scores.

Return type:

polars.DataFrame

Note

Monomorphic sites are removed using (1 - f) * f > 0, where f is the derived allele frequency per SNP. When computing iSAFE, the function passes window, step, top_k, and max_rank to the underlying implementation.

flexsweep.fv.isafe(hap, pos, w_size=300, w_step=150, top_k=1, max_rank=15)[source]

Derived-background diversity 

flexsweep.fv.dind_high_low(hap, ac, rec_map, max_ancest_freq=0.25, min_tot_freq=0, min_focal_freq=0.25, max_focal_freq=0.95, window_size=50000, genetic_distance=False)[source]

Compute DIND, highfreq, and lowfreq statistics per focal SNP.

Parameters:

hap (numpy.ndarray) – Haplotype matrix (n_snps, n_samples) with 0/1 values.
ac (numpy.ndarray) – Allele counts (n_snps, 2) as [ancestral, derived].
rec_map (numpy.ndarray) – Map array; penultimate column is the window coordinate.
max_ancest_freq (float) – Threshold used in high/low frequency components. Default 0.25.
min_tot_freq (float) – Unused here (kept for API symmetry). Default 0.
min_focal_freq (float) – Minimum focal derived frequency. Default 0.25.
max_focal_freq (float) – Maximum focal derived frequency. Default 0.95.
window_size (int) – Window size in coordinate physical units from rec_map[:, -2]. Default 50000.
genetic_distance (bool) – Unused here (kept for API symmetry).

Returns:

DataFrame with columns positions, daf, dind, high_freq, low_freq.

Return type:

polars.DataFrame

flexsweep.fv.s_ratio(hap, ac, rec_map, max_ancest_freq=1, min_tot_freq=0, min_focal_freq=0.25, max_focal_freq=0.95, window_size=50000, genetic_distance=False)[source]

Compute the S-ratio statistic for each focal SNP.

For each focal SNP (derived frequency in [min_focal_freq, max_focal_freq]), neighbors within window_size are summarized by indicators of intermediate frequency on the derived and ancestral partitions, and the ratio of their counts is reported.

Parameters:

hap (numpy.ndarray) – Haplotype matrix (n_snps, n_samples) with 0/1 values.
ac (numpy.ndarray) – Allele counts (n_snps, 2) as [ancestral, derived].
rec_map (numpy.ndarray) – Map array; penultimate column is the window coordinate.
max_ancest_freq (float) – Maximum ancestral-partition frequency threshold. Default 1.
min_tot_freq (float) – Minimum neighbor total derived frequency. Default 0.
min_focal_freq (float) – Minimum focal derived frequency. Default 0.25.
max_focal_freq (float) – Maximum focal derived frequency. Default 0.95.
window_size (int) – Window size in coordinate units of rec_map[:, -2]. Default 50000.
genetic_distance (bool) – Unused here (kept for API symmetry).

Returns:

DataFrame with columns positions, daf, s_ratio.

Return type:

polars.DataFrame

flexsweep.fv.hapdaf_o(hap, ac, rec_map, max_ancest_freq=0.25, min_tot_freq=0.25, min_focal_freq=0.25, max_focal_freq=0.95, window_size=50000, genetic_distance=False)[source]

Compute hapDAF-o for each focal SNP.

hapDAF-o averages f_d^2 - f_a^2 over neighbors that satisfy (f_d > f_a) & (f_a <= max_ancest_freq) & (f_tot >= min_tot_freq).

Parameters:

hap (numpy.ndarray) – Haplotype matrix (n_snps, n_samples) with 0/1 values.
ac (numpy.ndarray) – Allele counts (n_snps, 2) as [ancestral, derived].
rec_map (numpy.ndarray) – Map array; penultimate column is the window coordinate.
max_ancest_freq (float) – Ancestral partition frequency threshold. Default 0.25.
min_tot_freq (float) – Minimum neighbor total derived frequency. Default 0.25.
min_focal_freq (float) – Minimum focal derived frequency. Default 0.25.
max_focal_freq (float) – Maximum focal derived frequency. Default 0.95.
window_size (int) – Window size in coordinate units of rec_map[:, -2]. Default 50000.
genetic_distance (bool) – Unused here (kept for API symmetry).

Returns:

DataFrame with columns positions, daf, hapdaf_o.

Return type:

polars.DataFrame

flexsweep.fv.hapdaf_s(hap, ac, rec_map, max_ancest_freq=0.1, min_tot_freq=0.1, min_focal_freq=0.25, max_focal_freq=0.95, window_size=50000, genetic_distance=False)[source]

Compute hapDAF-s for each focal SNP.

hapDAF-s is the same construction as hapDAF-o but uses more stringent thresholds (e.g., smaller max_ancest_freq and min_tot_freq).

Parameters:

hap (numpy.ndarray) – Haplotype matrix (n_snps, n_samples) with 0/1 values.
ac (numpy.ndarray) – Allele counts (n_snps, 2) as [ancestral, derived].
rec_map (numpy.ndarray) – Map array; penultimate column is the window coordinate.
max_ancest_freq (float) – Ancestral partition frequency threshold. Default 0.1.
min_tot_freq (float) – Minimum neighbor total derived frequency. Default 0.1.
min_focal_freq (float) – Minimum focal derived frequency. Default 0.25.
max_focal_freq (float) – Maximum focal derived frequency. Default 0.95.
window_size (int) – Window size in coordinate units of rec_map[:, -2]. Default 50000.
genetic_distance (bool) – Unused here (kept for API symmetry).

Returns:

DataFrame with columns positions, daf, hapdaf_s.

Return type:

polars.DataFrame

flexsweep.fv.fast_sq_freq_pairs(hap, ac, rec_map, min_focal_freq=0.25, max_focal_freq=0.95, window_size=50000)[source]

flexsweep.fv.dind_high_low_from_pairs(sq_freqs, info, max_ancest_freq=0.25, min_tot_freq=0)[source]

flexsweep.fv.s_ratio_from_pairs(sq_freqs, max_ancest_freq=1, min_tot_freq=0)[source]

flexsweep.fv.hapdaf_from_pairs(sq_freqs, max_ancest_freq, min_tot_freq)[source]

Unified hapdaf_o and hapdaf_s — bodies were identical, only defaults differ.

Call as:

hapdaf_o = hapdaf_from_pairs(sq_freqs, max_ancest_freq=0.25, min_tot_freq=0.25)
hapdaf_s = hapdaf_from_pairs(sq_freqs, max_ancest_freq=0.10, min_tot_freq=0.10)

Removed dead args: hap, snps_indices (were passed to hapdaf_o but never used).

Linkage disequilibrium 

flexsweep.fv.Ld(hap, as_float32=True) → tuple[source]

Compute Kelly’s ZnS (mean pairwise \(r^2\)) and omega_max from an LD matrix.

The input r_2 is a square matrix of pairwise linkage disequilibrium values \(r^2\) among SNPs within a window. If mask is provided, the computation is restricted to the subset of indices where mask is True.

ZnS is defined as:

\[\mathrm{ZnS} = \frac{\sum_{i<j} r^2_{ij}}{\binom{S}{2}},\]

where \(S\) is the number of SNPs after masking.

The function also returns omega_max (Kim & Nielsen, 2004), computed via omega_linear_correct_mask(), which scans split points and compares the average LD within versus between the two partitions.

Parameters:

r_2 (numpy.ndarray) – Square matrix (S × S) of pairwise \(r^2\) values. The routine treats it as symmetric; values on and below the diagonal are ignored for ZnS.
mask (numpy.ndarray) – Optional boolean vector of length S to select a subset of SNPs. Default None.

Returns:

Tuple (zns, omega_max) as floats.

Return type:

tuple[float, float]

flexsweep.fv.r2(locus_A: ndarray, locus_B: ndarray) → float[source]

Compute the squared correlation coefficient \(r^2\) between two biallelic loci.

Given two 0/1 vectors of equal length (haplotypes across samples), this function computes:

\[D = P_{11} - p_A p_B,\quad r^2 = \frac{D^2}{p_A (1-p_A)\, p_B (1-p_B)},\]

where \(p_A\) and \(p_B\) are the allele-1 frequencies at loci A and B, and \(P_{11}\) is the empirical joint frequency that both loci equal 1.

Parameters:

locus_A (numpy.ndarray) – 1D array of 0/1 alleles for locus A (dtype int8 expected by Numba signature).
locus_B (numpy.ndarray) – 1D array of 0/1 alleles for locus B (same length and dtype as locus_A).

Returns:

The \(r^2\) value as a float.

Return type:

float

Note

If either locus is monomorphic (denominator zero), the result may be inf or nan depending on arithmetic; callers typically filter such sites beforehand.

flexsweep.fv.compute_r2_matrix_upper(hap, as_float32=False)[source]: r² via pre-scaled BLAS matmul. Avoids the outer-product subtraction step of the original compute_r2_matrix_upper, saving one O(S²) allocation.

flexsweep.fv.omega_linear_correct(r2_matrix)[source]

Compute \(\omega_\text{max}\) (Kim & Nielsen, 2004) from an \(r^2\) matrix.

The statistic compares the average LD within two partitions (left/right of a split) to the average LD between the partitions. For a split index \(\ell\) on a sequence of length \(S\), define:

\[\begin{split}\begin{aligned} &\text{within-left} &&= \sum_{0 \le i < j < \ell} r^2_{ij},\\ &\text{within-right} &&= \sum_{\ell \le i < j < S} r^2_{ij},\\ &\text{between} &&= \sum_{0 \le i < \ell} \sum_{\ell \le j < S} r^2_{ij}, \end{aligned}\end{split}\]

and the means are obtained by dividing by the corresponding pair counts \(\binom{\ell}{2}\), \(\binom{S-\ell}{2}\), and \(\ell(S-\ell)\). The omega score at \(\ell\) is:

\[\omega(\ell) = \frac{\dfrac{\text{within-left}}{\binom{\ell}{2}} + \dfrac{\text{within-right}}{\binom{S-\ell}{2}}} {\dfrac{\text{between}}{\ell(S-\ell)}}.\]

This function scans admissible \(\ell\) and returns the maximum value.

Parameters:

_r2 (numpy.ndarray) – Square matrix (S × S) of pairwise \(r^2\) values. Only the upper triangle (i < j) is required to hold valid values.
mask (numpy.ndarray) – Optional boolean vector selecting a subset of SNP indices to consider. Default None (use all SNPs).

Returns:

The maximum omega value over all candidate split points.

Return type:

float

Notes:

Modification from https://github.com/kr-colab/diploSHIC/blob/master/diploshic/utils.c taking advantages of numpy vectorized operations. Very small windows (S < 3) return 0.0.

Composite sweep statistics 

flexsweep.fv.LASSI_spectrum_and_Kspectrum(input_data, K_truncation=10, window=110, step=5)[source]

Compute haplotype count and frequency spectra within sliding windows.

Parameters: - hap (numpy.ndarray): Array of haplotypes where each column represents an individual and each row represents a SNP. - pos (numpy.ndarray): Array of SNP positions. - K_truncation (int): Number of haplotypes to consider. - window (int): Size of the sliding window. - step (int): Step size for sliding the window.

Returns: - K_count (numpy.ndarray): Haplotype count spectra for each window. - K_spectrum (numpy.ndarray): Haplotype frequency spectra for each window. - windows_centers (numpy.ndarray): Centers of the sliding windows.

flexsweep.fv.T_m_statistic_fast(K_counts, K_neutral, windows, K_truncation, sweep_mode=4, _iter=0)[source]

flexsweep.fv.compute_t_m(sim_list, K_truncation=10, w_size=201, w_step=10, K_neutral=None, sweep_mode=4, center=[50000.0, 1150000.0], windows=[100000], step=100000, nthreads=1, params=None, parallel_manager=None)[source]

Compute LASSI-style T and m-hat over a set of simulations.

The function builds truncated haplotype-frequency spectra per window, estimates a neutral spectrum if not provided, scores each window with T and m, and then reduces the scan to fixed physical windows around the specified centers. If params are provided, they are attached and the result may be pivoted to feature vectors format.

Parameters:

sim_list (sequence) – Iterable of simulation items consumable by LASSI_spectrum_and_Kspectrum.
K_truncation (int) – Number of top haplotype counts retained in the truncated spectrum. Default 5.
w_size (int) – Sliding window size in SNPs used to build K-spectra. Default 110.
step (int) – Step in SNPs between consecutive windows. Default 5.
K_neutral (array-like or None) – Precomputed neutral truncated spectrum; if None, estimated via neut_average. Optional.
windows (list[int]) – Physical window widths (bp) for cut_t_m_argmax. Default [50000, 100000, 200000, 500000, 1000000].
center (list[int]) – Inclusive physical range (bp) defining centers. Default [500000, 700000].
nthreads (int) – Number of joblib workers. Default 1.
params (array-like or None) – Optional parameter matrix aligned to sim_list with columns [s, t, f_i, f_t].
parallel_manager (joblib.Parallel or None) – Existing joblib.Parallel to reuse; if None, a new one is created.

Returns:

(t_m_cut, K_neutral)

Return type:

tuple

Notes:

T is a log-likelihood ratio comparing sweep-distorted vs neutral truncated spectra. m is the estimated number of sweeping haplotypes (1 = hard; >1 = soft), upper-bounded by K_truncation.

flexsweep.fv.Lambda_statistic_fast(K_counts, K_neutral, positions, K_truncation, n_A=101, sweep_mode=4, nthreads=1, max_extend=100000.0, _iter=0)[source]

Compute saltiLASSI Λ statistics for all windows, returning a Polars DataFrame.

Uses geometric mixture formula matching C++ lassip (lassip-winstats.cpp L265-267): L1 - L0 = Σ_i α_i · dF[mi, ei, i] where dF[mi, ei, i] = sweep_likelihood(i, m, ε) − easy_likelihood(i) is precomputed once.

Precomputes dF single-threaded (384 KB, fits in L2 cache), then distributes target windows across joblib loky workers. dF serializes cheaply (384 KB); each worker JITs _lassip_chunk once then processes all its assigned windows. Threading backend does not parallelize because numba @njit(parallel=False) does not release the GIL.

precompute: O(n_windows × K × n_ε) — all log() calls happen here. hot path: O(n_windows² × n_A × K × n_ε) — multiply-adds only, within max_extend.

Parameters:

K_counts (np.ndarray, shape (n_windows, K_truncation))
K_neutral (np.ndarray, shape (K_truncation,))
positions (np.ndarray, shape (n_windows,))
K_truncation (int)
n_A (int) – Number of log-spaced A values. Default 101 (matches C++ lassip 101-point grid).
sweep_mode (int) – Sweep haplotype redistribution model (1-5). Default 4 (exponential squared).
nthreads (int) – Number of joblib threads. Default 1 (single-threaded).
max_extend (float) – Maximum bp distance from target window to include in composite. Default 1e5 (100 kb), matching C++ lassip DEFAULT_MAX_EXTEND_BP. Use np.inf to sum all windows.
_iter (int) – Replicate index attached as ‘iter’ column.

Returns:

pl.DataFrame with columns (window_lassip, Lambda, m, A, frequency, iter. A column stores 1/actual_A (C++ lassip: maxA = 1.0/exp(A_loop)).)

flexsweep.fv.run_lassip(hap_data, K_truncation=10, w_size=201, step=10, K_neutral=None, n_A=100, sweep_mode=4, nthreads=1)[source]

Run saltiLASSI on a single haplotype dataset (VCF or ms format).

Extends run_lassi by computing the spatially-aware Λ statistic (DeGiorgio & Szpiech 2022) instead of the per-window T statistic. Reuses LASSI_spectrum_and_Kspectrum and neut_average unchanged.

Algorithm and performance vs C++ lassip

saltiLASSI composites LASSI per-window log-likelihoods with a spatial decay kernel:

L1(m,ε,A,z*) = Σ_i α_i · sweep_lik(i,m,ε) + (1−α_i) · null_lik(i) L1 − L0 = Σ_i α_i · dF[mi,ei,i] ← dot product, no log()

where α_i = exp(−A · abs(z_i − z*)) and dF[mi,ei,i] = sweep_lik(i,m,ε) − null_lik(i) is precomputed once in O(n_windows × K × n_ε) before the hot path.

Aspect	C++ lassip	This implementation
Mixture formula Hot-path inner work log() calls total q / dF working set Cache behaviour sweep_mode Spatial cutoff Parallelism Observed 1-thread speed	Geometric (paper uses arithmetic) K log() + K mul-add per source window ~23 B for 480 windows n_win × n_ε × m × K × 8B ≈ 38 MB L3 misses (38 MB) Fixed at compile time (oopt flag) MAX_EXTEND hard distance limit OpenMP (process forks) ~120 s for 480 windows (39K loci)	Geometric — identical ✓ 1 mul-add per source window ~48 K (precompute only) K × n_ε × n_win × 8B ≈ 384 KB L2 resident (384 KB) Runtime parameter 1–5 Sums all windows (no cutoff) joblib loky (separate processes) ~2–3 s for 480 windows

The ~40–60× single-thread speedup comes from three compounding effects: 1. dF precomputation collapses K from the hot path → 10× fewer operations per inner step. 2. 384 KB dF fits in L2 cache; C++ 38 MB q array thrashes L3. 3. Pure multiply-add hot path is SIMD-vectorizable; log() calls are not.

Complexity: O(n_windows² × n_A × K × n_ε). For n_windows≈480 on a 1.2 Mb locus, this is ~2.3 B multiply-adds ≈ 2–3 s single-thread; ~0.3 s with 8 threads. Use a larger step (e.g. step=50) to reduce n_windows and quadratically cut runtime.

Parameters:

hap_data (str or tuple) – Path to VCF/ms file, or (hap_int, position_masked) tuple.
K_truncation (int) – Number of top haplotypes in truncated spectrum. Default 10.
w_size (int) – Sliding window size in SNPs for building K-spectra. Default 201.
step (int) – Step in SNPs between consecutive K-spectrum windows. Default 10.
K_neutral (np.ndarray or None) – Pre-computed neutral spectrum. Estimated via neut_average if None.
n_A (int) – Number of log-spaced A grid points (spatial decay rates). Default 100.
sweep_mode (int) – Sweep haplotype redistribution model passed to sweep_likelihood (1–5). Default 4.
nthreads (int) – Number of joblib threads for the per-target phase. Default 1.

Returns:

pl.DataFrame with columns (window_lassip, Lambda, m, A, frequency, iter)

flexsweep.fv.mu_stat(hap, snp_positions, window_size=50)[source]

Compute RAiSD composite sweep score \(\mu\) over overlapping SNP windows.

For each sliding window of window_size consecutive SNPs (step = 1 SNP), this routine evaluates three components and their product:

mu_var – reduction-of-variation component (computed by compute_mu_var()), scaled by the region length.
mu_sfs – site-frequency-spectrum skew component (from compute_mu_sfs()), standardized using Watterson’s harmonic correction (_harmonic_sums(n)[0]).
mu_ld – linkage-disequilibrium contrast component (from compute_mu_ld()) using the supplied \(r^2\) matrix.
mu_total – composite statistic mu_var * mu_sfs * mu_ld.

The window center coordinate is recorded as the midpoint between the first and last SNP positions in the window. Results are returned as a Polars DataFrame with one row per window.

Parameters:

hap (numpy.ndarray) – Haplotype matrix of shape (S, n) with 0/1 alleles (rows = SNPs, columns = haplotypes or chromosomes).
snp_positions (numpy.ndarray) – Monotonically increasing physical positions of length S (aligned to rows of hap).
window_size (int) – Number of consecutive SNPs per sliding window; defaults to 50 (RAiSD’s -w default).

Returns:

A Polars DataFrame with columns * positions (int): window center (bp, midpoint of first/last SNP in window) * mu_var (float): variation component * mu_sfs (float): SFS component * mu_ld (float): LD component * mu_total (float): composite score mu_var * mu_sfs * mu_ld

Return type:

polars.DataFrame

Notes:

D_ln = (snp_positions[-1] + 1) - snp_positions[0] is the physical span of the full input region (bp), used to scale mu_var.
theta_w_correction = _harmonic_sums(n)[0] = Watterson’s harmonic sum (1 + 1/2 + … + 1/(n-1)), precomputed once for all windows.
get_pattern_ids is called once on the full haplotype matrix before the window loop; pattern IDs are indexed per-window by start/end.
Windows advance by one SNP (maximally overlapping). For S SNPs and window size W, the output has S - W + 1 rows.

See also:

compute_mu_var(), compute_mu_sfs(), compute_mu_ld(), get_pattern_ids()

flexsweep.fv.run_raisd(hap_data, window_size=50)[source]

Public entry point for the RAiSD mu statistic.

Accepts either a VCF/VCF.gz file path or an ms/discoal simulation text file and returns the per-SNP-window mu scores computed by mu_stat.

Input dispatch:

Tries genome_reader (allel-based VCF reader). This is the primary path for real data.
On any exception, falls back to parse_ms_numpy (ms/discoal text format). This allows the same function to be used for both empirical and simulated data without separate entry points.

Parameters:

hap_data – Path to a VCF/VCF.gz file or an ms/discoal output file.
window_size – Number of consecutive SNPs per sliding window (default 50, matching RAiSD’s -w default).

Returns:

polars.DataFrame – Same schema as mu_stat — columns [positions, mu_var, mu_sfs, mu_ld, mu_total], one row per window.

Balancing selection 

flexsweep.fv.ncd1(position_masked, freqs, tf=0.5, w=3000, minIS=2)[source]

flexsweep.fv.run_beta_window(ac, position_masked, p=2, m=0.1, w=None, theta=None)[source]