API Reference
Simulation
- class flexsweep.Simulator(sample_size, demes, output_folder, mutation_rate={'dist': 'uniform', 'max': 2e-08, 'min': 5e-09}, recombination_rate={'dist': 'exponential', 'max': 1e-07, 'mean': 1e-08, 'min': 1e-09}, locus_length=1200000, discoal_path='/home/docs/checkouts/readthedocs.org/user_builds/flexsweep/checkouts/latest/flexsweep/data/discoal', num_simulations=250000, saf=[0, 0.1], eaf=[0.5, 1], s=[0.01, 0.05], time=[0, 5000], nthreads=1, ne=10000, fixed_ratio=0.1, split=False)[source]
Bases:
objectInitialize discoal coalescent simulations.
The simulator draws per-replicate mutation (\(\mu\)) and recombination (\(r\)) rates, scales them to \(\theta = 4N_eL\mu\) and \(\rho = 4N_eLr\), and prepares neutral and sweep configurations to be executed via the
discoalbinary.- check_inputs()[source]
Validate inputs and prepare output directories.
Ensures mutation and recombination distributions are dictionaries, creates subdirectories for sweep and neutral simulations, and reads demography from the provided demes YAML.
- Returns:
str – Discoal demographic flags string (e.g., “ -en <t> 0 <size> …”).
- create_params()[source]
Generate and save parameter sets for neutral and sweep simulations into a Polars DataFrame
- Steps:
Draw μ and ρ, compute θ = 4NeLμ and ρ = 4NeLr.
Sample sweep times (uniform between time[0], time[1]).
Sample selection coefficients between s[0] and s[1].
Partition simulations into hard/soft and complete/incomplete sweeps.
Save all parameters to ‘<output_folder>/params.txt.gz’.
- Returns:
pl.DataFrame –
- Parameters with columns
[‘iter’, ‘mu’, ‘r’, ‘eaf’, ‘saf’, ‘s’, ‘t’, ‘model’].
- neutral(theta, rho, discoal_demes, _iter=1)[source]
Run a single neutral simulation.
- Parameters:
theta (float) – 4NeLμ.
rho (float) – 4NeLr.
discoal_demes (str) – Demography string (from read_demes).
_iter (int, default=1) – Iteration index.
- Returns:
str – Path to output gzipped ms file.
- random_distribution(num)[source]
Draw mutation (μ) and recombination (ρ) rates.
- Parameters:
num (int) – Number of draws.
- Returns:
tuple[np.ndarray, np.ndarray] – Arrays (mu, rho) each of length num.
- Supported distributions:
‘uniform’: Uniform[min, max]
‘exponential’: Exponential(mean), truncated to [min, max]
‘truncnorm’: Truncated normal with mean/std/min/max
‘fixed’: Placeholder only (currently not implemented)
Notes
Values are in per-basepair units.
- read_demes()[source]
Parse the demes YAML and build the discoal -en demography string.
- Behavior:
Loads the first deme and reverses epochs (oldest first).
Sets self.ne_0 to initial population size.
Converts epoch end times to 4Ne units and start sizes to Ne ratios.
Produces strings of the form ‘ -en <time> 0 <size>’.
- Returns:
str – Concatenated discoal ‘-en’ flags
- simulate()[source]
Run neutral and sweep simulations via discoal.
- Returns:
dict[str, list[str]] – Paths to gzipped ms files, with keys ‘neutral’ and ‘sweep’.
- simulate_batch()[source]
Run neutral and sweep simulations via discoal, in batches.
- Returns:
dict[str, list[str]] – Paths to gzipped ms files, with keys ‘neutral’ and ‘sweep’.
- sweep(theta, rho, f_t, f_i, t, s, discoal_demes, _iter=1)[source]
Run a single sweep simulation
- Parameters:
theta (float) – 4NeLμ.
rho (float) – 4NeLr.
f_t (float, default=1) – Terminal frequency (completed sweep if =1).
f_i (float, default=0) – Initial frequency (0 = hard sweep).
t (float) – Sweep onset time in 4Ne generations.
s (float) – Selection coefficient scaled by 2Ne.
discoal_demes (str) – Demography string.
_iter (int, default=1) – Iteration index.
- Returns:
str – Path to output gzipped ms file.
Data I/O
- class flexsweep.data.Data(data, region=None, samples=None, recombination_map=None, window_size=1200000, step=10000, nthreads=1, sequence_length=1200000, in_memory=False)[source]
Bases:
object- genome_reader(region, samples)[source]
Read a genomic region from a VCF/BCF and return haplotypes and rec_map given the region interval.
- Parameters:
region (str) – Region string ‘CHR:START-END’.
samples (list[str] | np.ndarray) – Sample subset to read.
- Returns:
dict[str, tuple[np.ndarray, np.ndarray] | None] – {region: (hap, rec_map_subset)} or {region: None} if no biallelic sites. - hap: (S x N) int8 array of phased haplotypes (derived-allele indicators). - rec_map_subset: (S x 4) array [chrom, idx, genetic_pos(cm or proxy), physical_pos].
Notes
Filters to biallelic variants via scikit-allel’s AlleleCounts.is_biallelic_01.
If no recombination map is supplied, genetic distance is set to physical pos and later remapped to a relative [1..sequence_length] coordinate system.
- get_cm(df_rec_map, positions)[source]
Interpolate genetic distances (cM) for given physical positions.
- Parameters:
df_rec_map (pl.DataFrame) –
- Recombination map with numeric columns where:
columns[1] holds physical coordinate (bp) and columns[-1] holds cumulative genetic distance (cM).
positions (np.ndarray[int]) – Physical positions to interpolate.
- Returns:
np.ndarray[float] – Interpolated cumulative cM values (negative values clamped to 0).
Notes
Uses linear interpolation with extrapolation at ends.
- read_simulations()[source]
Load paths (or parsed data if in_memory=True) for discoal simulations.
- Expects a directory structure:
<self.data>/neutral/.ms.gz <self.data>/sweep/.ms.gz <self.data>/params.txt.gz
- Returns:
tuple[OrderedDict, pl.DataFrame] – sims: OrderedDict({‘neutral’: <list|array>, ‘sweep’: <list|array>}) params: polars DataFrame with columns [‘model’, ‘s’, ‘t’, ‘saf’, ‘eaf’].
- Raises:
AssertionError – If self.data is not a string.
ValueError – If required ‘neutral’ or ‘sweep’ subfolders are missing or empty.
- read_vcf()[source]
Slide windows across a single-contig VCF/BCF and collect per-window data haplotypes
- Returns:
dict | OrderedDict –
- If in_memory=True:
{‘sweep’: list[[hap, rec_map, zeros(4)]], ‘region’: list[str]} where each element corresponds to a window with data.
- If in_memory=False:
OrderedDict({‘sweep’: <vcf_path>, ‘region’: list[str]}) with region strings to be used downstream for lazy reading.
- Raises:
IOError – If the file extension is not one of [.vcf, .vcf.gz, .bcf, .bcf.gz].
FileNotFoundError – If the VCF/BCF is not tabix-indexed (<path>.tbi is missing).
Notes
Determines contig name/length by reading last line of the vcf file.
Generates overlapping windows of size window_size and step step.
Uses multiprocessing when in_memory=True to read per-window haplotypes.
Feature vectors
- flexsweep.fv.summary_statistics(data_dir, vcf=False, nthreads=1, windows=[100000], step=100000.0, step_vcf=10000.0, locus_length=1200000, recombination_map=None, r_bins=None, min_rate=0.0, suffix=None, func=None, save_stats=False, only_normalize=False, stats=None)[source]
Compute summary statistics to create needed feature vectors for CNN training/prediction from either simulated or VCF/BCF input data.
This function dispatches automatically to the appropriate backend depending on the
vcfflag. Whenvcf=False(default) it processes mdiscoal simulations; whenvcf=Trueit processes VCF/BCF files.- Parameters:
data_dir (str) – Input directory. When
vcf=False, the root folder of simulation outputs containing subdirectoriesneutral/,sweep/, andparams.txt.gz. Whenvcf=True, a directory of bgzipped VCF/BCF files.vcf (bool) – Whether to use the VCF/BCF processing pipeline (
True) or the simulation pipeline (False). Default:False.nthreads (int) – Number of threads. Default:
1.windows (list) – List of window sizes (in base pairs) to compute summary statistics. Supports multi-scale: e.g.
[50000, 100000, 500000]. Default:[100000].step (int) – Step size (in base pairs) for sliding windows. Together with
locus_length, determines the center positions:centers = range(step//2, locus_length - step//2 + step, step). Default:1e5.locus_length (int) – Total locus length in base pairs. Default:
1200000.recombination_map (str) – Optional path to a recombination map. If
None, physical distances are used as a proxy for genetic distances. Default:None.suffix (str) – Optional suffix appended to output file names. Default:
None.func (callable) – Function used internally for computing summary statistics per replicate or genomic window. See
calculate_stats_simulationsorcalculate_stats_vcf_flat
- Returns:
Feature-vector DataFrame combining all computed summary statistics. Downstream pipelines may also write Parquet and normalization artifacts (e.g.,
fvs.parquet,empirical_bins.pickle).- Return type:
polars.DataFrame
- Raises:
FileNotFoundError – If input files or directories are missing.
ValueError – If an input file or directory is malformed or inconsistent.
- Notes:
Internally dispatches to
_process_vcf()or_process_sims()depending on thevcfflag.Center positions are derived automatically from
locus_lengthandstepas[step//2, locus_length - step//2]. For each center, stats are computed at every size inwindows, producingn_centers × len(windows)rows per replicate.Supports automatic normalization of features using empirical bins.
Designed for use as a top-level wrapper in the Flexsweep feature-vector generation pipeline.
- Examples:
From simulated data (discoal/ms format):
>>> df = summary_statistics("./simulations", nthreads=8)
Multi-scale windows:
>>> df = summary_statistics("./simulations", windows=[50000, 100000, 500000], nthreads=8)
From VCF data with recombination map:
>>> df = summary_statistics( ... "./vcf_data", ... vcf=True, ... recombination_map="recomb_map.csv", ... nthreads=8 ... )
- flexsweep.fv._process_sims(data_dir, nthreads, windows=None, step=100000.0, r_bins=None, suffix=None, func=None, save_stats=False, locus_length=1200000, stats=None)[source]
Process ms files from simulation to compute, normalize, and estimate feature vectors.
It scans a directory for simulation outputs (neutral and sweep), computes summary statistics for each replicate, normalizes the results between classes, and exports Parquet feature vectors to traing the CNN.
- Parameters:
data_dir (str) – Directory containing simulation output files (neutral and sweep). Expected substructure and file naming conventions follow the Flexsweep.Simulator class output (e.g.,
data_dir/sweeps/anddata_dir/neutral/).nthreads (int) – Number of threads.
windows (list) – List of window sizes (in base pairs) to compute summary statistics.
step (int) – Step size (in base pairs) for sliding windows.
suffix (str) – Optional suffix appended to output file names.
func (callable) – Function to estimate summary statistics. See
calculate_stats_simulations.
- Returns:
- A pair of Polars DataFrames:
df_pred: normalized feature vectors for CNN training.
df_pred_raw: raw feature vectors.
- Return type:
tuple[polars.DataFrame, polars.DataFrame]
- Raises:
FileNotFoundError – If no simulation files are found under
data_dir.ValueError – If simulation data are malformed or incompatible with the expected format.
- Notes:
- Writes the following files to
data_dir: fvs{suffix}.parquet– normalized feature vectors.fvs_raw{suffix}.parquet– raw feature vectors.empirical_bins{suffix}.pickle– empirical normalization bins.
- Writes the following files to
Neutral and sweep simulations are processed jointly to derive shared empirical normalization bins.
Designed for Flexsweep.Simulators class simulations folders but can handle compatible structures with proper naming.
- flexsweep.fv._process_vcf(data_dir, nthreads, windows=None, step=100000.0, step_vcf=10000, locus_length=1200000, recombination_map=None, r_bins=None, min_rate=None, suffix=None, func=None, save_stats=False, stats=None)[source]
- flexsweep.fv.calculate_stats_simulations(hap_data, _iter=1, center=None, windows=[100000], step=100000.0, locus_length=1200000, stats=None)[source]
- flexsweep.fv.calculate_stats_vcf_flat(vcf_file, region, center=None, windows=[100000], step=100000.0, _iter=1, recombination_map=None, nthreads=1, locus_length=1200000, stats=None, stat_func=None, stat_cols=None, compute_snp_stats=True, parallel_manager=None, isafe_region_size=2000000)[source]
Compute per-locus-window summary statistics from a VCF with O(N) window work.
Instead of computing stats for every (locus_window × center) pair — which would recompute each overlapping sub-window up to numSubWins times — this function:
Enumerates the unique set of absolute (center, window_size) positions across all locus windows.
Dispatches one joblib task per unique pair (via batch_windowed_stats_flat).
iSAFE is fully supported via non-overlapping region tiling. It divides the chromosome into non-overlapping regions of isafe_region_size bp (default 2 Mb) and runs run_isafe once per region. SNP scores are absolute-position keyed so they join correctly to any locus window that contains them.
Assembles output rows by looking up cached results per locus window.
- Parameters:
isafe_region_size (int) – Size in bp of non-overlapping regions used to compute iSAFE. Valid range 1e6–5e6. Default 2e6.
- flexsweep.fv.normalize_neutral(d_stats_neutral, vcf=False, df_r_bins=None)[source]
Calculates the expected mean and standard deviation of summary statistics from neutral simulations, used for normalization in downstream analyses.
This function processes a DataFrame of neutral simulation statistics, bins the values based on frequency, and computes the mean (expected) and standard deviation for each bin. These statistics are used as a baseline to normalize sweep or neutral simulations
- Parameters:
df_stats_neutral (list or pandas.DataFrame) – A list or concatenated pandas DataFrame containing the neutral simulation statistics. The DataFrame should contain frequency data and various summary statistics, including H12 and HAF, across multiple windows and bins.
- Returns:
expected (pandas.DataFrame) – A DataFrame containing the mean (expected) values of the summary statistics for each frequency bin. The frequency bins are the index, and the columns are the summary statistics.
stdev (pandas.DataFrame) – A DataFrame containing the standard deviation of the summary statistics for each frequency bin. The frequency bins are the index, and the columns are the summary statistics.
Notes
The function first concatenates the neutral statistics, if provided as a list, and bins the values by frequency using the bin_values function.
It computes both the mean and standard deviation for each frequency bin, which can later be used to normalize observed statistics (e.g., from sweeps).
The summary statistics processed exclude window-specific statistics such as “h12” and “haf.”
- flexsweep.fv.normalize_stats(stats_values, bins, region=None, center=[500000.0, 700000.0], windows=[50000, 100000, 200000, 500000, 1000000], step=10000.0, parallel_manager=None, nthreads=1, vcf=False, df_r_bins=None, locus_length=1200000)[source]
- flexsweep.fv.normalize_cut_raw(snps_values, bins, center=[500000.0, 700000.0], windows=[50000, 100000, 200000, 500000, 1000000], step=10000, df_r_bins=None)[source]
Sims-only refactor: - join df_r onto SNP and window tables (by iter) BEFORE binning/normalization - normalize using neutral bins conditional on (freq_bins, r_bins) when available - drop r_bins from outputs so feature dimension stays unchanged
- flexsweep.fv.resolve_stats(stats)[source]
Partition a user stat list into window_cols, snp_groups, compute_isafe, and snp_cols.
Returns
(window_cols, snp_groups, compute_isafe, snp_cols).snp_colsis the set of actual DataFrame column names to keep in SNP output, orNonewhen stats isNone(default = keep all).When stats is
None, returns defaults matching the full flexsweep pipeline.
Outlier scan
- flexsweep.scan.scan(vcf_path, out_prefix, stats, config=None, w_size=201, step=10, w_size_bp=1000000, step_bp=10000, min_maf=0.05, recombination_map=None, n_daf_bins=50, n_r_bins=None, nthreads=1, window_mode='auto', **kwargs) dict[str, DataFrame][source]
Standalone outlier scan from a directory of per-chromosome VCF files.
Uses a global task pool across ALL VCF files simultaneously: all chromosomes are pre-loaded, then one
Parallel(n_jobs=nthreads)call processes every stat × chromosome combination. This fully exploits nthreads regardless of how many chromosomes are present.- Parameters:
vcf_path – Directory containing
*.vcf.gz(or*.bcf.gz) files, one per chromosome/contig. Must be a directory; single-file input is not supported.out_prefix – Output file prefix. Writes
{out_prefix}.{stat}.txtfor each stat.stats – List of stat keys to compute. See
available_stats()for options. Per-SNP: ihs, nsl, isafe, dind, high_freq, low_freq, s_ratio, hapdaf_o, hapdaf_s. SNP-window: h12, garud, lassi, lassip, raisd. bp-window: tajima_d, pi, theta_w, fay_wu_h, zeng_e, achaz_y, fuli_f, fuli_f_star, fuli_d, fuli_d_star, neutrality, omega, beta, ncd.config – Per-stat parameter overrides, e.g.
{"raisd": {"window_size": 100}, "lassip": {"max_extend": 5e4}}. Overridesw_size,step, and any kwargs for that stat only.w_size – SNP-count window size for SNP-mode sliding-window stats (default 201).
step – SNP step for SNP-mode sliding-window stats (default 10).
w_size_bp – Physical window size in bp for bp-mode stats (default 1 Mb).
step_bp – Physical step size in bp for bp-mode stats (default 10 kb).
min_maf – Minimum minor allele frequency for iHS and nSL (default 0.05).
recombination_map – Path to recombination map TSV (chr, start, end, cm_mb, cm). If provided: genetic distances used for T3 stat windows, and frequency-sensitive stats (iHS, nSL, dind, …) are normalized by joint (DAF × recomb_rate) bins (Johnson et al. approach).
n_daf_bins – Number of equal-frequency DAF bins for normalization (default 50).
n_r_bins – Number of equal-frequency recombination rate bins for joint (DAF × r_bins) normalization.
None(default) → DAF-only normalization. Set to 10 (Johnson et al.) to enable joint normalization whenrecombination_mapis provided.nthreads – Total worker threads for the global task pool (default 1). Window-batchable stats split each chromosome into
nthreadswindow batches so every thread stays busy.window_mode – Override window mode for all sliding-window stats. “auto” (default) uses per-stat defaults from STAT_REGISTRY. “snp” forces SNP-count windows for all window stats. “bp” forces physical bp windows for all window stats.
kwargs – Shared overrides forwarded to all stats: max_extend, K_truncation, sweep_mode, raisd_window, tf, etc.
- Returns:
dict[str, polars.DataFrame] – Keys are stat names; each DataFrame has a
{rank_col}_pvaluecolumn. Files written to{out_prefix}.{stat}.txt(tab-separated).
- flexsweep.scan.stat_params(stat_key: str | None = None) dict[source]
Return parameter documentation for scan stats.
- Parameters:
stat_key (str, optional) – A specific stat key (e.g.
"ihs","hscan"). If None, returns the full table for all stats.- Returns:
dict – Keys are stat names. Each value is a dict with keys
rank_col,resolution,window_mode,default_window,default_step,shared_params, andstat_params.
Notes
Shared params always injected by
scan()for every stat:w_size(201),step(10),w_size_bp(1000000),step_bp(10000),min_maf(0.05),window_mode(“auto”).Per-stat params are passed via
config={"stat": {"param": value}}.Examples
>>> from flexsweep.scan import stat_params >>> stat_params("hscan") >>> stat_params() # full table
- flexsweep.scan.empirical_pvalues(df: DataFrame, stat_col: str, abs_rank: bool = False) DataFrame[source]
Empirical p-value following the empirical outlier approach (Akey 2009).
p = rank(−value, na.last=keep) / N_valid Small p → outlier/candidate (locus in extreme upper tail of the empirical distribution). NaN → null in output (excluded from ranking and from N_valid denominator).
abs_rank=True: rank by
abs(value)before negating — for signed stats where large magnitude in either direction signals selection (iHS, nSL, Tajima’s D).
CNN / DANN
Enrichment
- flexsweep.enrichment.run_enrichment(sweep_files: list, gene_set: str, factors_file: str, annotation_file: str, populations: list, groups: list, thresholds: list, pop_interest: str = 'All', cluster_distance: int = 500000, n_runs: int = 10, tolerance: float = 0.05, min_distance: int = 1250000, flip: bool = False, max_rep: int = 25, nthreads: int = 1, n_shuffles: int = 8, shuffling_segs: int = 2, bootstrap_dir: str = None, distance_file: str = None) list[source]
Run the full gene-set sweep enrichment pipeline.
Steps:
Load gene set and derive valid gene universe (factors ∩ sweep_files[0]).
Bootstrap control sets matched on confounding factors (or load pre-computed sets from
bootstrap_dir).For each sweep file: count sweep overlaps, shuffle genome for null distribution, and estimate FDR.
- Parameters:
sweep_files – Paths to sweep rank files (gene_id + per-population rank columns, tab- or space-separated, optionally gzipped).
gene_set – TSV with
gene_idandyes/nolabel columns (no header). Genes labelledyesare VIPs.factors_file – TSV confounding factors file (gene_id + factor columns, no header).
annotation_file – BED gene coordinates file (0-based, no header).
populations – Population codes matching sweep file column order.
groups – Group label per population (same length as populations).
thresholds – Rank cutoffs for enrichment curve (e.g. [6000, …, 20]).
pop_interest – Population, group name, or
'All'for FDR scope.cluster_distance – Max bp between genes to count as neighbours.
n_runs – Bootstrap batches.
tolerance – Allowed ± fraction deviation in factor averages for control gene matching.
min_distance – Minimum bp distance from VIPs for control eligibility.
flip – Flip test direction when
len(VIPs) > len(controls)(increases power).max_rep – Max average resamples per control gene across bootstrap sets.
nthreads – Parallel workers (joblib).
n_shuffles – FDR shuffle replicates (must be a multiple of 8).
shuffling_segs – Genes per genomic shuffle segment.
bootstrap_dir – Folder with pre-computed
VIPs/andnonVIPs/sub-dirs. When non-empty, the bootstrap step is skipped.distance_file – Optional pre-computed distance TSV (gene_id t distance, no header). Passed to
run_bootstrap_nomatchomegato bypass internal distance computation and match Perl’s control pool exactly.
- Returns:
List of FDR DataFrames, one per entry in
sweep_files.
- flexsweep.enrichment.run_bootstrap_nomatchomega(case_genes: list, factors_file: str, annotation_file: str, runs_number: int, tolerance: float, min_dist: float, flip: bool, max_rep: int, seed: int = None, nthreads: int = 1, control_genes: list = None, distance_file: str = None, skip_factor_idx: int = 8, offset_factor_idx: int = 27, offset_value: float = 2.0, n_batches: int = 10)[source]
Step 6: nomatchomega_fast variant of the matched bootstrap control set generator.
- Implements the behaviour of bootstrap_test_script_nomatchomega_fast.pl:
Factor skip_factor_idx (default 8, omega/dN/dS) is unconstrained.
Factor offset_factor_idx (default 27, Perl column 28) gets +offset_value.
n_batches sequential internal batches per parallel call, with continuous state (used_counts, current_means, fake_seed persist across batches), matching Perl’s within-process continuous state.
Target for batch r: 100 + 11 * vip_number * r (grows each batch).
Total control sets = runs_number * n_batches * 10.
- Parameters:
case_genes (list of VIP gene IDs (HLA/histone excluded by caller))
factors_file (path to confounding factors table (TSV, first col gene_id))
annotation_file (path to BED gene coordinates file (chrom start end gene_id strand). Distances to nearest VIP are computed automatically via compute_distances.)
runs_number (number of independent parallel calls (≡ Perl Runs_number))
tolerance (allowed ± fraction for confounder matching (e.g. 0.05))
min_dist (minimum distance (bp) from VIPs for control eligibility)
flip (if True, swap VIP and control pools)
max_rep (max times a control gene may be sampled on average)
nthreads (number of parallel workers)
control_genes (optional explicit candidate control gene IDs. If None, all genes in factors_file that are not case genes are used. Pass the “no”-labelled genes from genes_set_file to match Perl behaviour exactly.)
distance_file (optional path to pre-computed distance TSV (gene_id t distance, no header). When provided, distances are read directly instead of calling compute_distances(). Use Perl’s distance_genes_set_file.txt for exact control pool matching.)
skip_factor_idx (0-based factor index to leave unconstrained (default 8 = Perl factor 9, dN/dS / omega).)
offset_factor_idx (0-based factor index to apply +offset_value to (default 27 = Perl column 28).)
offset_value (value added to offset_factor_idx column (default 2.0).)
n_batches (sequential internal batches per parallel call (≡ Perl Iterations_number/10; default 10).)
- Returns:
df_set (pl.DataFrame — VIP genes (gene_id column))
df_bootstrap_control (pl.DataFrame — rows = control sets, cols = gene positions (sample_id + column_0..column_N))
- flexsweep.enrichment.estimate_fdr(real_results: DataFrame, shuffled_sweep_files_by_shuffle: list, vip_genes: list, control_sets: list, neighbors: DataFrame, thresholds: list, populations: list, groups: list, pop_interest: str, simplified_sweeps_by_pop_fn=None, count_sweeps: bool = False, use_clust: bool = True, nthreads: int = 1, p_cutoff: float = 0.05, max_threshold: int = None, min_threshold: int = 0, min_vip_count: int = 0) DataFrame[source]
Step 9: Estimate FDR by comparing real results to a null distribution built from genome-shuffled sweep files (output of shuffle_genome).
Matches the behaviour of estimate_FPR.pl: for each scope (population / group / “All”), the test statistic is sum(vip_count − ctrl_mean) across rows that pass the significance and threshold filters. FDR p-value = fraction of null replicates where that statistic ≥ the real statistic.
- Parameters:
real_results (pl.DataFrame — output of count_sweeps_multipop on the real data (columns: scope, threshold, vip_count, ctrl_mean, p_value))
shuffled_sweep_files_by_shuffle (list of lists — each inner list holds paths for one shuffle replicate. Layout A: one file per population [[esn_s1, gwd_s1, …], …]. Layout B: one multi-population file per replicate [[multi_s1], [multi_s2], …].)
vip_genes (list of VIP gene IDs)
control_sets (control sets from run_bootstrap)
neighbors (pl.DataFrame from build_gene_neighbors_numpy)
thresholds (list of int thresholds)
populations (ordered population list)
groups (group label per population)
pop_interest (single pop, group name, or “All”)
simplified_sweeps_by_pop_fn (callable(sweep_file, col_index) → pl.DataFrame, or callable(sweep_file) → dict[pop, pl.DataFrame] for multi-pop files. Defaults to simplify_sweeps_gz.)
use_clust (expand by neighbours before counting)
nthreads (parallel workers)
p_cutoff (p-value threshold for “significant” rows in the test statistic (Perl: cutoff, default 0.05))
max_threshold (upper bound on rank threshold included in the statistic (Perl: limit; None = no limit))
min_threshold (lower bound on rank threshold (Perl: cutoff2, default 0))
min_vip_count (minimum vip_count for a row to contribute to the statistic (Perl: enough, default 0))
- Returns:
pl.DataFrame with columns (scope, p_value, n_replicates, real_stat, max_null_stat. One row per scope. real_stat = sum(vip_count − ctrl_mean) for significant rows. p_value = fraction of null replicates where null_stat ≥ real_stat.)
- flexsweep.enrichment.count_sweeps_singlepop(vip_genes: list, control_sets: list, simplified_sweeps: DataFrame, neighbors: DataFrame, thresholds: list, use_clust: bool = True, count_sweeps: bool = False, neighbor_col: str = None) DataFrame[source]
Count sweeps overlapping VIPs vs control sets for a single population.
- Parameters:
vip_genes (list of VIP gene ID strings)
control_sets (list of lists of control gene IDs (one list per control set))
simplified_sweeps (pl.DataFrame from simplify_sweeps_gz (gene_id, selected_cutoff))
neighbors (pl.DataFrame from build_gene_neighbors_numpy (gene_id, neighbors))
thresholds (sorted list of int rank thresholds)
use_clust (expand gene sets by neighbours before counting (de-duplication))
count_sweeps (if True, count distinct sweep events (connected components of the neighbour graph within sweep genes) rather than individual genes.)
neighbor_col (column name for neighbours (auto-detected if None))
- Returns:
pl.DataFrame with columns (threshold, vip_count, ctrl_mean, ctrl_ci_lo, ctrl_ci_hi, ratio, ci_lo_ratio, ci_hi_ratio, p_value)
- flexsweep.enrichment.count_sweeps_multipop(vip_genes: list, control_sets: list, simplified_sweeps_by_pop: dict, neighbors: DataFrame, thresholds: list, populations: list, groups: list, pop_interest: str, count_sweeps: bool = False, use_clust: bool = True, nthreads: int = 1) DataFrame[source]
Step 7: Multi-population sweep counting with group aggregation and adaptive depth.
Populations are processed in parallel. For a group or ‘All’, the sweep sets of member populations are unioned before counting (each sweep counted once per group, not once per population). Adaptive depth escalates from 100 → 1000 → all control sets depending on observed p-values.
- Parameters:
vip_genes (list of VIP gene IDs)
control_sets (list of lists of control gene IDs (up to 10 000))
simplified_sweeps_by_pop (dict mapping population name → pl.DataFrame (output of simplify_sweeps_gz per pop))
neighbors (pl.DataFrame from build_gene_neighbors_numpy)
thresholds (list of int rank thresholds)
populations (ordered list of population codes)
groups (group label for each population (same length))
pop_interest (single pop name, group name, or “All”)
count_sweeps (if True count distinct sweep events (connected components of the neighbour graph within sweep genes); if False count genes in sweeps (default; matches Perl behaviour))
use_clust (expand gene sets by neighbours before counting (de-duplication; maps to Perl Count_sweeps parameter))
nthreads (parallel workers)
- Returns:
pl.DataFrame with columns (scope, threshold, vip_count, ctrl_mean, ctrl_ci_lo, ctrl_ci_hi, ratio, ci_lo_ratio, ci_hi_ratio, p_value. scope is the population name, group name, or “All”.)
Ancestral polarization
Flexsweep includes a Rust CLI (flexsweep-polarize) for annotating the
ancestral allele in a VCF from a multi-species alignment (MAF). The Python
wrappers below invoke the compiled binary; the binary must be built first with
build_rust_polarization() or installed via conda-forge.
Workflow:
Sort the MAF by reference contig and position (
run_sort_maf).Polarize the VCF using the sorted MAF and one or more outgroup species (
run_polarize). The binary streams both files together and writes a bgzipped VCF with REF/ALT swapped where the outgroup consensus supports the alternate allele as ancestral.
Polarization methods (--method):
parsimony— majority vote over outgroup alleles; no model fitting.jc— Jukes–Cantor substitution model.kimura— Kimura two-parameter model (default).r6— General reversible model (6 free parameters); slowest, most accurate.
Visualization
- flexsweep.utils.plot_scan(stats, stat_cols=None, pvalue: bool = False, top_pct: float = 0.01, threshold_lines: list | None = None, out: str | None = None, figsize: tuple | None = None, title: str | None = None, sharey: bool = False, chrom: str | None = None, center: int | None = None, window_bp: int = 500000) Figure[source]
Genome-wide or regional scan plot — single stat or stacked multi-stat.
Accepts output from
scan()(dict) or paths to{prefix}.{stat}.txtfiles.Genome-wide mode (default,
chrom=None): Single stat → 1-panel Manhattan. Multiple stats → stacked panels, shared x-axis.pvalue=Falseplots raw values with bipolar colouring for signed stats.pvalue=Trueplots-log10(p_emp)with threshold lines at p = 0.01 / 0.001.Zoom mode (
chrom+centerprovided): n rows × 2 columns: raw stat (left) |-log10(p_emp)(right). Filtered to[center ± window_bp]. Orange dashed line marks the centre.- Parameters:
stats (dict | str | list[str]) –
dictfromscan()— keys are stat names, values are DataFrames.str— path to a single{stat}.txtscan output file.list[str]— paths to multiple scan output files.
stat_cols (str | list[str], optional) – Stat column name(s). Required when
statsis a file path / list of paths. Whenstatsis a dict, defaults to all keys.pvalue (bool) – Genome-wide mode only. If True, plot
-log10(p_emp)with y=[0,10] and threshold lines. If False, plot raw stat values with bipolar colouring.top_pct (float) – Fraction highlighted as outliers in genome-wide raw mode.
threshold_lines (list of (y_value, linestyle, label), optional) – Horizontal lines.
pvalue=Truedefault: y=2 dashed (p=0.01), y=3 solid. Pass[]to suppress.out (str, optional) – Save path. If None, shows interactively.
figsize (tuple, optional) – Genome-wide default: (14, 4) / (14, 3×n). Zoom default: (10, 2.5×n).
title (str, optional) – Title for single-stat genome-wide plots.
sharey (bool) – Share y-axis across panels in genome-wide mode (default False).
chrom (str, optional) – Chromosome for zoom mode (e.g.
"22"or"chr22").center (int, optional) – Centre position (bp) for zoom mode.
window_bp (int) – Half-window size in bp for zoom mode (default 500 000 = ±500 kb).
- flexsweep.utils.plot_manhattan(prediction, recombination_map: str | None = None, eps: float = 1e-10, chr_col: str | None = None, pos_col: str | None = None, p_col: str | None = None, log_transform: bool = True, threshold_lines: list | None = None, figsize: tuple = (14, 5), out: str | None = None, title: str | None = None, chrom: str | int | None = None, center: int | None = None, window_bp: int = 5000000, chr_prefix_pattern: str = '^chr')[source]
- flexsweep.utils.plot_sweep_density(prediction, output_path=None)[source]
Histogram of per-window sweep probability split by chromosome.
- Parameters:
prediction (pl.DataFrame | str) – Output of CNN.predict() or path to parquet/CSV with columns chr, start, end, prob_sweep.
output_path (str, optional) – If given, saves the figure as SVG.
- Returns:
fig (matplotlib.figure.Figure)
- flexsweep.utils.plot_fv_pca(train_data, empirical_data, subsample=5000, output_path=None)[source]
PCA of the feature vector matrix colored by neutral/sweep label.
- Parameters:
train_data (str | pl.DataFrame) – Path to fvs*.parquet or already-loaded DataFrame. Must have a ‘model’ column.
subsample (int) – Max rows to use (avoids slow PCA on very large datasets). Default 5000.
output_path (str, optional) – If given, saves SVG.
- Returns:
fig (matplotlib.figure.Figure)
- flexsweep.utils.plot_stat_distributions(train_data, empirical_data=None, stats=['dind', 'dist_kurtosis', 'dist_skew', 'dist_var', 'h1', 'h12', 'h2_h1', 'haf', 'hapdaf_o', 'hapdaf_s', 'high_freq', 'ihs', 'isafe', 'k_counts', 'low_freq', 'max_fda', 'nsl', 'omega_max', 'pi', 's_ratio', 'tajima_d', 'theta_h', 'theta_w', 'zns'], output_path=None)[source]
Violin plots of feature stats split by neutral/sweep (and optionally empirical).
- Parameters:
train_data (str | pl.DataFrame) – Training fvs*.parquet — must have a ‘model’ column.
empirical_data (str | pl.DataFrame, optional) – Empirical fvs*.parquet — no ‘model’ column; plotted as a third distribution.
stats (list[str], optional) – Stat base names to plot (e.g. [‘pi’, ‘h12’, ‘ihs’]). Default: one representative column per unique stat base name.
output_path (str, optional) – Save path for SVG.
- Returns:
fig (matplotlib.figure.Figure)
- flexsweep.utils.rank_probabilities(prediction, feature_coordinates, rank_distance=False, k=111)[source]
Goal: match the original pybedtools/bedtools output exactly, including tie order.
- Strategy for exact tie-order:
Create an explicit, deterministic per-gene input order (gene_order) from the sorted gene table (chr,start). This mirrors the order bedtools processes A.
Carry gene_order through the pipeline and use it as the final tie-breaker in the final sort (stable + deterministic).
- Strategy for bedtools closest -k parity:
We emulate -k K by sorting hits per gene by (d, chrom_pred, start_pred, end_pred) and taking the first K.
This gives deterministic selection when multiple windows share the same distance.
Population genetics statistics
These are the low-level functions called internally by the feature-vector pipeline and the scan module. They can also be called directly on numpy arrays.
Site Frequency Spectrum (SFS)
- flexsweep.fv.neutrality_stats(ac, positions)[source]
# Numba unified call to compute SFS neutrality statistics. Trying to avoid numba overhead as much as possible
- Returns a length-12 array:
[0] tajima_d [1] theta_h (scaled by sequence length if positions provided) [2] h_raw (Fay & Wu’s H = pi - theta_h, unscaled) [3] h_norm (normalized Fay & Wu’s H) [8] pi (mean pairwise diversity, absolute) [9] theta_w (Watterson’s theta, absolute) [10] theta_w_per_base (nan if < 2 positions) [11] pi_per_base (nan if < 2 positions)
- flexsweep.fv.sfs_nb(dac, n)[source]
Site-frequency spectrum (SFS) from derived-allele counts.
- Parameters:
dac (numpy.ndarray) – 1D array of derived allele counts per site, values in
[0..n].n (int) – Total number of chromosomes. If
n <= 0, it is inferred asmax(dac).
- Returns:
Integer array of length
n+1;sfs[k]is the number of sites withkderived copies.- Return type:
numpy.ndarray
- flexsweep.fv.theta_pi(ac)[source]
Per-site nucleotide diversity (π) from allele counts.
For each site
j, computespi_j = 2 * a_j * (n - a_j) / [n * (n - 1)], wherea_jis the derived allele count andnis the total number of chromosomes.- Parameters:
ac (numpy.ndarray) – Allele counts array of shape
(S, 2)(ancestral, derived), with constantnacross sites.- Returns:
Array of per-site π values of length
S.- Return type:
numpy.ndarray
- flexsweep.fv.tajima_d(ac, min_sites=3)[source]
Tajima’s D from allele counts.
Compares the mean pairwise difference (sum of per-site π) to the Watterson estimator based on the number of segregating sites. Returns
nanif the number of segregating sites is belowmin_sites.- Parameters:
ac (numpy.ndarray) – Allele counts array of shape
(S, 2)(ancestral, derived), with constantnacross sites.min_sites (int) – Minimum required number of segregating sites. Default
3.
- Returns:
Tajima’s D as a float (
nanif insufficient sites).- Return type:
float
- flexsweep.fv.achaz_y(ac)[source]
Achaz’s Y neutrality test (standardized).
Unfolded/polarized form — requires ancestral-state information (outgroup or polarization). Excludes ξ₁ (derived singletons) from both estimators.
Reference: Achaz 2008, Appendix B, Equations B28–B30. f = (n-2) / (n·(a_n-1)); Var[Y] = α_n·θ + β_n·θ²
- Parameters:
ac (numpy.ndarray) – Allele counts array of shape
(S, 2)(ancestral, derived), constantn. Derived allele counts inac[:,1]define the unfolded SFS.- Returns:
Standardized Achaz’s Y as a float; returns
nanifn < 3or if there are no segregating sites excluding singletons.- Return type:
float
- flexsweep.fv.achaz_y_star(ac)[source]
Achaz’s Y* neutrality test (standardized).
Folded form — does not require ancestral-state polarization. Excludes η₁ = ξ₁ + ξ_{n-1} (minor-allele singletons at frequency 1/n or (n-1)/n) from both the π and S estimators.
Reference: Achaz 2008, Appendix B, Equations B19–B21. f* = (n-3) / (a_n·(n-1) - n); Var[Y*] = α*_n·θ + β*_n·θ²
- Parameters:
ac (numpy.ndarray) – Allele counts array of shape
(S, 2)(ancestral, derived), constant n.- Returns:
Standardized Achaz Y* as a float; nan if n < 4 or no valid sites.
- Return type:
float
- flexsweep.fv.achaz_t(ac, decay=0.9)[source]
Achaz’s T_Ω neutrality test (Achaz 2009 Eq. 9).
Unfolded/polarized SFS. Uses exponential weight ω₁ᵢ = e^{−0.9·i} vs uniform ω₂ᵢ = 1. Sensitive to excess low-frequency polymorphisms such as those produced by severe bottlenecks.
Variance coefficients αₙ/βₙ are precomputed once per sample size via
_achaz_t_coeffs(lru_cache), so per-window cost is O(n).α = 0.9 is empirical (Achaz 2009 p.254): gives positive Ωᵢ only for i/n ≤ 0.13. θ² estimated as S(S-1)/(a₁²+a₂) following Fu (1995).
- Parameters:
ac (numpy.ndarray) – Allele counts array of shape
(S, 2)(ancestral, derived), constant n. Derived allele counts inac[:,1]define the unfolded SFS.decay (float) – Exponential decay α (default 0.9 per Achaz 2009).
- Returns:
Standardized T_Ω as a float; nan if n < 3 or no segregating sites.
- Return type:
float
- flexsweep.fv.fay_wu_h_norm(ac, positions=None)[source]
Fay & Wu’s H and its normalized form (single-population, infinite sites).
- Computes:
theta_h: estimator that upweights high-frequency derived alleles.h = pi - theta_h(Fay & Wu’s H).h_norm: normalized H using variance terms.
- Parameters:
ac (numpy.ndarray) – Allele counts array of shape
(S, 2)(ancestral, derived), constantn.positions (numpy.ndarray) – Optional positions (length
S). If provided,theta_his divided by the accessible spanpositions[-1] - (positions[0] - 1).
- Returns:
Tuple
(theta_h, h, h_norm)as floats.- Return type:
tuple[float, float, float]
- flexsweep.fv.zeng_e(ac)[source]
Zeng’s E statistic (single-population, infinite sites), standardized.
Contrasts Watterson’s estimator with a linear SFS component related to high-frequency derived signal. Useful alongside Tajima’s D and Fay & Wu’s H.
- Parameters:
ac (numpy.ndarray) – Allele counts array of shape
(S, 2)(ancestral, derived), constantn.- Returns:
Standardized Zeng’s E as a float.
- Return type:
float
- flexsweep.fv.fuli_f_star(ac)[source]
Fu and Li’s F* (starred) statistic (no outgroup required).
Focuses on deviations in the singleton class of the (folded) SFS, contrasting singleton abundance with diversity (π). The starred form does not require ancestral state polarization.
- Parameters:
ac (numpy.ndarray) – Allele counts array of shape
(S, 2)(ancestral, derived), constantn.- Returns:
Fu & Li’s F* as a float.
- Return type:
float
- flexsweep.fv.fuli_f(ac)[source]
Fu and Li’s F statistic (polarized).
Uses singleton counts and diversity (π); typically assumes derived states are known (e.g., via outgroup).
- Parameters:
ac (numpy.ndarray) – Allele counts array of shape
(S, 2)(ancestral, derived), constantn.- Returns:
Fu & Li’s F as a float.
- Return type:
float
- flexsweep.fv.fuli_d_star(ac)[source]
Fu and Li’s D* (starred) statistic (no outgroup required).
Compares the number of segregating sites against singleton counts in the folded spectrum. The starred form does not require ancestral state polarization.
- Parameters:
ac (numpy.ndarray) – Allele counts array of shape
(S, 2)(ancestral, derived), constantn.- Returns:
Fu & Li’s D* as a float.
- Return type:
float
- flexsweep.fv.fuli_d(ac)[source]
Fu and Li’s D statistic (polarized form).
Uses the total number of segregating sites and singletons; typically assumes derived states are known (e.g., via outgroup) to define singletons.
- Parameters:
ac (numpy.ndarray) – Allele counts array of shape
(S, 2)(ancestral, derived), constantn.- Returns:
Fu & Li’s D as a float.
- Return type:
float
Haplotype-based
- flexsweep.fv.ihs_ihh(h, pos, map_pos=None, min_ehh=0.05, min_maf=0.05, include_edges=False, gap_scale=20000, max_gap=200000, is_accessible=None, use_threads=False)[source]
Compute iHS (integrated Haplotype Score) and delta iHH from haplotypes.
The routine integrates EHH (extended haplotype homozygosity) on both sides of each focal SNP to obtain iHH for ancestral and derived alleles, and then reports iHS (log ratio) and the absolute difference in iHH (
delta_ihh).- Parameters:
h (numpy.ndarray) – Haplotype matrix of shape
(n_snps, n_haplotypes)with 0/1 values, where rows are SNPs and columns are haplotypes.pos (numpy.ndarray) – Physical positions for SNPs (length
n_snps). Used for gap handling and, whenmap_posisNone, for integration spacing.map_pos (numpy.ndarray) – Optional genetic map positions (same length as
pos). If provided, integration uses these coordinates instead ofpos. DefaultNone.min_ehh (float) – Minimum EHH value to include in the integration. Default
0.05.min_maf (float) – Minimum minor-allele frequency required to compute iHS at a SNP. Default
0.05.include_edges (bool) – If
True, permit edge SNPs to contribute even when EHH dips belowmin_ehh. DefaultFalse.gap_scale (int) – Scaling used for gaps between consecutive SNPs when integrating over physical distance (ignored if
map_posis provided). Default20000.max_gap (int) – Maximum gap allowed when integrating; larger gaps are capped to
max_gapto avoid overweighting sparse regions. Default200000.is_accessible (numpy.ndarray) – Optional boolean mask (length
n_snps) indicating accessible SNPs. IfNone, all SNPs are considered accessible. DefaultNone.use_threads (bool) – Enable threaded computation in downstream primitives when available. Default
False.
- Returns:
Polars DataFrame with columns:
positions(physical position),daf(derived allele frequency),ihs(log iHH ratio), anddelta_ihh(absolute difference between derived and ancestral iHH).- Return type:
polars.DataFrame
- Raises:
ValueError – Propagated if inputs are inconsistent in length or malformed.
Note
SNPs that fail the MAF threshold or have invalid iHS values are omitted from the returned table.
- flexsweep.fv.haf_top(hap, pos, cutoff=0.1, start=None, stop=None, window_size=None, n_snps=None)[source]
Compute the upper-tail HAF (Haplotype Allele Frequency) summary in a region.
Rows of
hapare SNPs, columns are haplotypes. HAF values are computed per SNP from haplotypes, then restricted to the specified genomic region (start/stoporwindow_size) if given. The HAF values are sorted and the top portion after trimming bycutoffis summed.- Parameters:
hap (numpy.ndarray) – Haplotype matrix of shape
(n_snps, n_haplotypes)with 0/1 values.pos (numpy.ndarray) – Physical positions for SNPs (length
n_snps).cutoff (float) – Proportion used for tail trimming. For example,
0.1trims the lowest 10% and the highest 10% before summing the remaining HAF values. Default0.1.start (float) – Optional region start position (inclusive). Default
None.stop (float) – Optional region end position (inclusive). Default
None.window_size (int) – Optional window size in base pairs centered by the caller’s convention. If provided, it can be used to define the region when
start/stopare not specified. DefaultNone.n_snps (int) – Optional limit on the number of SNPs considered by certain strategies (implementation-dependent). Default
1001.
- Returns:
Upper-tail HAF summary as a single float after trimming by
cutoff.- Return type:
float
- Raises:
ValueError – Propagated if inputs are malformed or if no SNPs fall within the region.
Note
If neither
start/stopnorwindow_sizeis provided, the computation uses all SNPs inhap/pos.
- flexsweep.fv.garud_h(h)[source]
Compute the H1, H12, H123 and H2/H1 statistics for detecting signatures of soft sweeps, as defined in Garud et al. (2015).
- Parameters:
h (array_like, int, shape (n_variants, n_haplotypes)) – Haplotype array.
- Returns:
h1 (float) – H1 statistic (sum of squares of haplotype frequencies).
h12 (float) – H12 statistic (sum of squares of haplotype frequencies, combining the two most common haplotypes into a single frequency).
h123 (float) – H123 statistic (sum of squares of haplotype frequencies, combining the three most common haplotypes into a single frequency).
h2_h1 (float) – H2/H1 statistic, indicating the “softness” of a sweep.
- flexsweep.fv.h12_enard(hap, positions, focal_coord=600000, n_snps=None, window_size=500000, min_derived_freq=0.05, similarity_threshold=0.8, top_k=10)[source]
Estimate Garud’s
H12, H2/H1, H1around a focal coordinate, grouping haplotypes that are at least a given identity threshold (default 80%).The method builds a count-based, symmetric SNP window centered at
focal_coord, constructs a haplotype matrixHfor the selected SNPs, collapses identical haplotypes (columns), orders the unique haplotypes by frequency (descending) and lexicographic order (ascending), selects a set of representative haplotypes, and then merges representatives into similarity groups whenever the column-wise identity meets or exceedssimilarity_threshold(0.8by default). Haplotype group frequencies are then used to compute the H12 family of statistics.Identity between two haplotype columns is defined as:
\[\text{identity} = \frac{\#(1,1)}{\#(1,1) + \#(1,0) + \#(0,1)}\](i.e., matches on the derived allele over all non-equal-or-derived comparisons).
- Parameters:
hap (numpy.ndarray) – Haplotype matrix of shape
(L_total, n)with 0/1 values (ancestral/derived). Rows are SNPs; columns are haplotypes.positions (numpy.ndarray) – 1D array (length
L_total) of genomic coordinates (int64) aligned tohaprows.focal_coord (int) – Genomic coordinate used to center the SNP window. Default
600000.n_snps (int) – Target number of SNPs for the window (the focal SNP, if present, is excluded from the returned set). Default
1001.min_derived_freq (float) – Minimum derived-allele frequency required for a SNP to enter the window (inclusive; upper bound is
1.0). Default0.05.similarity_threshold (float) – Column similarity threshold for grouping haplotypes. Two representative columns are merged if their identity fraction (formula above) is ≥ this value. Default
0.8(80% identity).top_k (int) – Limit controlling how many unique-haplotype representatives are considered before grouping. Default
10.
- Returns:
Tuple
(H12, H2_H1, H2)as floats. If no usable SNPs or groups are found, returns(0.0, 0.0, 0.0).- Return type:
tuple[float, float, float]
\[\begin{split}H1 = \sum_g p_g^2,\qquad H12 = (p_1 + p_2)^2,\qquad H2 = H1 - p_1^2,\qquad \frac{H2}{H1} = \begin{cases} (H1 - p_1^2)/H1, & H1 \ne 0,\\ 0, & H1 = 0~. \end{cases}\end{split}\]Notes
The SNP window is built by balancing sites to the left and right of
focal_coordby proximity, after applying the derived-frequency filter[min_derived_freq, 1.0], and excluding the focal position itself.Unique haplotypes are detected via hashing of
Hcolumns; sample-to-group frequencies \(p_g\) are computed after the identity-based grouping step.The default behavior corresponds to H12 with 80% identity grouping in the haplotype matrix, which can increase robustness by merging highly similar haplotypes.
- flexsweep.fv.hscan(hap, pos, focal_pos=None, max_gap=1000000000, dist_mode=0, step=1, left_bound=0, right_bound=1000000000, return_pairs=False)[source]
- flexsweep.fv.run_isafe(hap, positions, max_freq=1, min_region_size_bp=49000, min_region_size_ps=300, ignore_gaps=True, window=300, step=150, top_k=1, max_rank=15)[source]
Estimate iSAFE or SAFE on a genomic region following Flex-sweep default values.
The function removes monomorphic SNPs, then checks region size. If
num_snps <= min_region_size_psorpositions.max() - positions.min() < min_region_size_bp, it computes SAFE; otherwise it computes iSAFE using the provided sliding-window settings. Results are returned as a Polars DataFrame with columnspositions(bp),daf(derived allele frequency), andisafe(score). Variants withdaf >= max_freqare filtered out.- Parameters:
hap (numpy.ndarray) – Haplotype matrix of shape
(n_snps, n_haplotypes)with 0/1 values (ancestral/derived).positions (numpy.ndarray) – 1D array of physical coordinates (length
n_snps) aligned tohaprows.max_freq (float) – Maximum allowed derived allele frequency in the output (
daf < max_freq). Default1(no filter).min_region_size_bp (int) – Minimum region span in base pairs required to run iSAFE. Default
49000.min_region_size_ps (int) – Minimum number of polymorphic SNPs required to run iSAFE. Default
300.ignore_gaps (bool) – Reserved for gap handling; currently not used. Default
True.window (int) – iSAFE sliding window size (number of SNPs or bp, depending on the downstream implementation). Default
300.step (int) – iSAFE step between windows. Default
150.top_k (int) – iSAFE parameter controlling the number of top candidates per window. Default
1.max_rank (int) – iSAFE parameter controlling the maximum rank to track. Default
15.
- Returns:
Polars DataFrame with columns
positions(int),daf(float), andisafe(float), sorted by position and filtered todaf < max_freq. If the region is small, theisafecolumn contains SAFE scores.- Return type:
polars.DataFrame
Note
Monomorphic sites are removed using
(1 - f) * f > 0, wherefis the derived allele frequency per SNP. When computing iSAFE, the function passeswindow,step,top_k, andmax_rankto the underlying implementation.
Derived-background diversity
- flexsweep.fv.dind_high_low(hap, ac, rec_map, max_ancest_freq=0.25, min_tot_freq=0, min_focal_freq=0.25, max_focal_freq=0.95, window_size=50000, genetic_distance=False)[source]
Compute DIND, highfreq, and lowfreq statistics per focal SNP.
- Parameters:
hap (numpy.ndarray) – Haplotype matrix
(n_snps, n_samples)with 0/1 values.ac (numpy.ndarray) – Allele counts
(n_snps, 2)as[ancestral, derived].rec_map (numpy.ndarray) – Map array; penultimate column is the window coordinate.
max_ancest_freq (float) – Threshold used in high/low frequency components. Default
0.25.min_tot_freq (float) – Unused here (kept for API symmetry). Default
0.min_focal_freq (float) – Minimum focal derived frequency. Default
0.25.max_focal_freq (float) – Maximum focal derived frequency. Default
0.95.window_size (int) – Window size in coordinate physical units from
rec_map[:, -2]. Default50000.genetic_distance (bool) – Unused here (kept for API symmetry).
- Returns:
DataFrame with columns
positions,daf,dind,high_freq,low_freq.- Return type:
polars.DataFrame
- flexsweep.fv.s_ratio(hap, ac, rec_map, max_ancest_freq=1, min_tot_freq=0, min_focal_freq=0.25, max_focal_freq=0.95, window_size=50000, genetic_distance=False)[source]
Compute the S-ratio statistic for each focal SNP.
For each focal SNP (derived frequency in
[min_focal_freq, max_focal_freq]), neighbors withinwindow_sizeare summarized by indicators of intermediate frequency on the derived and ancestral partitions, and the ratio of their counts is reported.- Parameters:
hap (numpy.ndarray) – Haplotype matrix
(n_snps, n_samples)with 0/1 values.ac (numpy.ndarray) – Allele counts
(n_snps, 2)as[ancestral, derived].rec_map (numpy.ndarray) – Map array; penultimate column is the window coordinate.
max_ancest_freq (float) – Maximum ancestral-partition frequency threshold. Default
1.min_tot_freq (float) – Minimum neighbor total derived frequency. Default
0.min_focal_freq (float) – Minimum focal derived frequency. Default
0.25.max_focal_freq (float) – Maximum focal derived frequency. Default
0.95.window_size (int) – Window size in coordinate units of
rec_map[:, -2]. Default50000.genetic_distance (bool) – Unused here (kept for API symmetry).
- Returns:
DataFrame with columns
positions,daf,s_ratio.- Return type:
polars.DataFrame
- flexsweep.fv.hapdaf_o(hap, ac, rec_map, max_ancest_freq=0.25, min_tot_freq=0.25, min_focal_freq=0.25, max_focal_freq=0.95, window_size=50000, genetic_distance=False)[source]
Compute hapDAF-o for each focal SNP.
hapDAF-o averages
f_d^2 - f_a^2over neighbors that satisfy(f_d > f_a) & (f_a <= max_ancest_freq) & (f_tot >= min_tot_freq).- Parameters:
hap (numpy.ndarray) – Haplotype matrix
(n_snps, n_samples)with 0/1 values.ac (numpy.ndarray) – Allele counts
(n_snps, 2)as[ancestral, derived].rec_map (numpy.ndarray) – Map array; penultimate column is the window coordinate.
max_ancest_freq (float) – Ancestral partition frequency threshold. Default
0.25.min_tot_freq (float) – Minimum neighbor total derived frequency. Default
0.25.min_focal_freq (float) – Minimum focal derived frequency. Default
0.25.max_focal_freq (float) – Maximum focal derived frequency. Default
0.95.window_size (int) – Window size in coordinate units of
rec_map[:, -2]. Default50000.genetic_distance (bool) – Unused here (kept for API symmetry).
- Returns:
DataFrame with columns
positions,daf,hapdaf_o.- Return type:
polars.DataFrame
- flexsweep.fv.hapdaf_s(hap, ac, rec_map, max_ancest_freq=0.1, min_tot_freq=0.1, min_focal_freq=0.25, max_focal_freq=0.95, window_size=50000, genetic_distance=False)[source]
Compute hapDAF-s for each focal SNP.
hapDAF-s is the same construction as hapDAF-o but uses more stringent thresholds (e.g., smaller
max_ancest_freqandmin_tot_freq).- Parameters:
hap (numpy.ndarray) – Haplotype matrix
(n_snps, n_samples)with 0/1 values.ac (numpy.ndarray) – Allele counts
(n_snps, 2)as[ancestral, derived].rec_map (numpy.ndarray) – Map array; penultimate column is the window coordinate.
max_ancest_freq (float) – Ancestral partition frequency threshold. Default
0.1.min_tot_freq (float) – Minimum neighbor total derived frequency. Default
0.1.min_focal_freq (float) – Minimum focal derived frequency. Default
0.25.max_focal_freq (float) – Maximum focal derived frequency. Default
0.95.window_size (int) – Window size in coordinate units of
rec_map[:, -2]. Default50000.genetic_distance (bool) – Unused here (kept for API symmetry).
- Returns:
DataFrame with columns
positions,daf,hapdaf_s.- Return type:
polars.DataFrame
- flexsweep.fv.fast_sq_freq_pairs(hap, ac, rec_map, min_focal_freq=0.25, max_focal_freq=0.95, window_size=50000)[source]
- flexsweep.fv.dind_high_low_from_pairs(sq_freqs, info, max_ancest_freq=0.25, min_tot_freq=0)[source]
- flexsweep.fv.hapdaf_from_pairs(sq_freqs, max_ancest_freq, min_tot_freq)[source]
Unified hapdaf_o and hapdaf_s — bodies were identical, only defaults differ.
Call as:
hapdaf_o = hapdaf_from_pairs(sq_freqs, max_ancest_freq=0.25, min_tot_freq=0.25) hapdaf_s = hapdaf_from_pairs(sq_freqs, max_ancest_freq=0.10, min_tot_freq=0.10)
Removed dead args: hap, snps_indices (were passed to hapdaf_o but never used).
Linkage disequilibrium
- flexsweep.fv.Ld(hap, as_float32=True) tuple[source]
Compute Kelly’s ZnS (mean pairwise \(r^2\)) and omega_max from an LD matrix.
The input
r_2is a square matrix of pairwise linkage disequilibrium values \(r^2\) among SNPs within a window. Ifmaskis provided, the computation is restricted to the subset of indices wheremaskisTrue.ZnS is defined as:
\[\mathrm{ZnS} = \frac{\sum_{i<j} r^2_{ij}}{\binom{S}{2}},\]where \(S\) is the number of SNPs after masking.
The function also returns
omega_max(Kim & Nielsen, 2004), computed viaomega_linear_correct_mask(), which scans split points and compares the average LD within versus between the two partitions.- Parameters:
r_2 (numpy.ndarray) – Square matrix (
S×S) of pairwise \(r^2\) values. The routine treats it as symmetric; values on and below the diagonal are ignored for ZnS.mask (numpy.ndarray) – Optional boolean vector of length
Sto select a subset of SNPs. DefaultNone.
- Returns:
Tuple
(zns, omega_max)as floats.- Return type:
tuple[float, float]
- flexsweep.fv.r2(locus_A: ndarray, locus_B: ndarray) float[source]
Compute the squared correlation coefficient \(r^2\) between two biallelic loci.
Given two 0/1 vectors of equal length (haplotypes across samples), this function computes:
\[D = P_{11} - p_A p_B,\quad r^2 = \frac{D^2}{p_A (1-p_A)\, p_B (1-p_B)},\]where \(p_A\) and \(p_B\) are the allele-1 frequencies at loci A and B, and \(P_{11}\) is the empirical joint frequency that both loci equal 1.
- Parameters:
locus_A (numpy.ndarray) – 1D array of 0/1 alleles for locus A (dtype
int8expected by Numba signature).locus_B (numpy.ndarray) – 1D array of 0/1 alleles for locus B (same length and dtype as
locus_A).
- Returns:
The \(r^2\) value as a float.
- Return type:
float
Note
If either locus is monomorphic (denominator zero), the result may be
infornandepending on arithmetic; callers typically filter such sites beforehand.
- flexsweep.fv.compute_r2_matrix_upper(hap, as_float32=False)[source]
r² via pre-scaled BLAS matmul. Avoids the outer-product subtraction step of the original compute_r2_matrix_upper, saving one O(S²) allocation.
- flexsweep.fv.omega_linear_correct(r2_matrix)[source]
Compute \(\omega_\text{max}\) (Kim & Nielsen, 2004) from an \(r^2\) matrix.
The statistic compares the average LD within two partitions (left/right of a split) to the average LD between the partitions. For a split index \(\ell\) on a sequence of length \(S\), define:
\[\begin{split}\begin{aligned} &\text{within-left} &&= \sum_{0 \le i < j < \ell} r^2_{ij},\\ &\text{within-right} &&= \sum_{\ell \le i < j < S} r^2_{ij},\\ &\text{between} &&= \sum_{0 \le i < \ell} \sum_{\ell \le j < S} r^2_{ij}, \end{aligned}\end{split}\]and the means are obtained by dividing by the corresponding pair counts \(\binom{\ell}{2}\), \(\binom{S-\ell}{2}\), and \(\ell(S-\ell)\). The omega score at \(\ell\) is:
\[\omega(\ell) = \frac{\dfrac{\text{within-left}}{\binom{\ell}{2}} + \dfrac{\text{within-right}}{\binom{S-\ell}{2}}} {\dfrac{\text{between}}{\ell(S-\ell)}}.\]This function scans admissible \(\ell\) and returns the maximum value.
- Parameters:
_r2 (numpy.ndarray) – Square matrix (
S×S) of pairwise \(r^2\) values. Only the upper triangle (i < j) is required to hold valid values.mask (numpy.ndarray) – Optional boolean vector selecting a subset of SNP indices to consider. Default
None(use all SNPs).
- Returns:
The maximum omega value over all candidate split points.
- Return type:
float
- Notes:
Modification from https://github.com/kr-colab/diploSHIC/blob/master/diploshic/utils.c taking advantages of numpy vectorized operations. Very small windows (
S < 3) return0.0.
Composite sweep statistics
- flexsweep.fv.LASSI_spectrum_and_Kspectrum(input_data, K_truncation=10, window=110, step=5)[source]
Compute haplotype count and frequency spectra within sliding windows.
Parameters: - hap (numpy.ndarray): Array of haplotypes where each column represents an individual and each row represents a SNP. - pos (numpy.ndarray): Array of SNP positions. - K_truncation (int): Number of haplotypes to consider. - window (int): Size of the sliding window. - step (int): Step size for sliding the window.
Returns: - K_count (numpy.ndarray): Haplotype count spectra for each window. - K_spectrum (numpy.ndarray): Haplotype frequency spectra for each window. - windows_centers (numpy.ndarray): Centers of the sliding windows.
- flexsweep.fv.T_m_statistic_fast(K_counts, K_neutral, windows, K_truncation, sweep_mode=4, _iter=0)[source]
- flexsweep.fv.compute_t_m(sim_list, K_truncation=10, w_size=201, w_step=10, K_neutral=None, sweep_mode=4, center=[50000.0, 1150000.0], windows=[100000], step=100000, nthreads=1, params=None, parallel_manager=None)[source]
Compute LASSI-style T and m-hat over a set of simulations.
The function builds truncated haplotype-frequency spectra per window, estimates a neutral spectrum if not provided, scores each window with T and m, and then reduces the scan to fixed physical windows around the specified centers. If
paramsare provided, they are attached and the result may be pivoted to feature vectors format.- Parameters:
sim_list (sequence) – Iterable of simulation items consumable by LASSI_spectrum_and_Kspectrum.
K_truncation (int) – Number of top haplotype counts retained in the truncated spectrum. Default 5.
w_size (int) – Sliding window size in SNPs used to build K-spectra. Default 110.
step (int) – Step in SNPs between consecutive windows. Default 5.
K_neutral (array-like or None) – Precomputed neutral truncated spectrum; if None, estimated via neut_average. Optional.
windows (list[int]) – Physical window widths (bp) for cut_t_m_argmax. Default [50000, 100000, 200000, 500000, 1000000].
center (list[int]) – Inclusive physical range (bp) defining centers. Default [500000, 700000].
nthreads (int) – Number of joblib workers. Default 1.
params (array-like or None) – Optional parameter matrix aligned to sim_list with columns [s, t, f_i, f_t].
parallel_manager (joblib.Parallel or None) – Existing joblib.Parallel to reuse; if None, a new one is created.
- Returns:
(t_m_cut, K_neutral)
- Return type:
tuple
- Notes:
T is a log-likelihood ratio comparing sweep-distorted vs neutral truncated spectra. m is the estimated number of sweeping haplotypes (1 = hard; >1 = soft), upper-bounded by
K_truncation.
- flexsweep.fv.Lambda_statistic_fast(K_counts, K_neutral, positions, K_truncation, n_A=101, sweep_mode=4, nthreads=1, max_extend=100000.0, _iter=0)[source]
Compute saltiLASSI Λ statistics for all windows, returning a Polars DataFrame.
Uses geometric mixture formula matching C++ lassip (lassip-winstats.cpp L265-267):
L1 - L0 = Σ_i α_i · dF[mi, ei, i]wheredF[mi, ei, i] = sweep_likelihood(i, m, ε) − easy_likelihood(i)is precomputed once.Precomputes dF single-threaded (384 KB, fits in L2 cache), then distributes target windows across joblib loky workers. dF serializes cheaply (384 KB); each worker JITs _lassip_chunk once then processes all its assigned windows. Threading backend does not parallelize because numba @njit(parallel=False) does not release the GIL.
precompute: O(n_windows × K × n_ε) — all log() calls happen here. hot path: O(n_windows² × n_A × K × n_ε) — multiply-adds only, within max_extend.
- Parameters:
K_counts (np.ndarray, shape (n_windows, K_truncation))
K_neutral (np.ndarray, shape (K_truncation,))
positions (np.ndarray, shape (n_windows,))
K_truncation (int)
n_A (int) – Number of log-spaced A values. Default 101 (matches C++ lassip 101-point grid).
sweep_mode (int) – Sweep haplotype redistribution model (1-5). Default 4 (exponential squared).
nthreads (int) – Number of joblib threads. Default 1 (single-threaded).
max_extend (float) – Maximum bp distance from target window to include in composite. Default 1e5 (100 kb), matching C++ lassip DEFAULT_MAX_EXTEND_BP. Use np.inf to sum all windows.
_iter (int) – Replicate index attached as ‘iter’ column.
- Returns:
pl.DataFrame with columns (window_lassip, Lambda, m, A, frequency, iter. A column stores 1/actual_A (C++ lassip: maxA = 1.0/exp(A_loop)).)
- flexsweep.fv.run_lassip(hap_data, K_truncation=10, w_size=201, step=10, K_neutral=None, n_A=100, sweep_mode=4, nthreads=1)[source]
Run saltiLASSI on a single haplotype dataset (VCF or ms format).
Extends run_lassi by computing the spatially-aware Λ statistic (DeGiorgio & Szpiech 2022) instead of the per-window T statistic. Reuses LASSI_spectrum_and_Kspectrum and neut_average unchanged.
Algorithm and performance vs C++ lassip
saltiLASSI composites LASSI per-window log-likelihoods with a spatial decay kernel:
L1(m,ε,A,z*) = Σ_i α_i · sweep_lik(i,m,ε) + (1−α_i) · null_lik(i) L1 − L0 = Σ_i α_i · dF[mi,ei,i] ← dot product, no log()
where α_i = exp(−A · abs(z_i − z*)) and dF[mi,ei,i] = sweep_lik(i,m,ε) − null_lik(i) is precomputed once in O(n_windows × K × n_ε) before the hot path.
Aspect
C++ lassip
This implementation
Mixture formula Hot-path inner work log() calls total q / dF working set Cache behaviour sweep_mode Spatial cutoff Parallelism Observed 1-thread speed
Geometric (paper uses arithmetic) K log() + K mul-add per source window ~23 B for 480 windows n_win × n_ε × m × K × 8B ≈ 38 MB L3 misses (38 MB) Fixed at compile time (oopt flag) MAX_EXTEND hard distance limit OpenMP (process forks) ~120 s for 480 windows (39K loci)
Geometric — identical ✓ 1 mul-add per source window ~48 K (precompute only) K × n_ε × n_win × 8B ≈ 384 KB L2 resident (384 KB) Runtime parameter 1–5 Sums all windows (no cutoff) joblib loky (separate processes) ~2–3 s for 480 windows
The ~40–60× single-thread speedup comes from three compounding effects: 1. dF precomputation collapses K from the hot path → 10× fewer operations per inner step. 2. 384 KB dF fits in L2 cache; C++ 38 MB q array thrashes L3. 3. Pure multiply-add hot path is SIMD-vectorizable; log() calls are not.
Complexity: O(n_windows² × n_A × K × n_ε). For n_windows≈480 on a 1.2 Mb locus, this is ~2.3 B multiply-adds ≈ 2–3 s single-thread; ~0.3 s with 8 threads. Use a larger step (e.g. step=50) to reduce n_windows and quadratically cut runtime.
- Parameters:
hap_data (str or tuple) – Path to VCF/ms file, or (hap_int, position_masked) tuple.
K_truncation (int) – Number of top haplotypes in truncated spectrum. Default 10.
w_size (int) – Sliding window size in SNPs for building K-spectra. Default 201.
step (int) – Step in SNPs between consecutive K-spectrum windows. Default 10.
K_neutral (np.ndarray or None) – Pre-computed neutral spectrum. Estimated via neut_average if None.
n_A (int) – Number of log-spaced A grid points (spatial decay rates). Default 100.
sweep_mode (int) – Sweep haplotype redistribution model passed to sweep_likelihood (1–5). Default 4.
nthreads (int) – Number of joblib threads for the per-target phase. Default 1.
- Returns:
pl.DataFrame with columns (window_lassip, Lambda, m, A, frequency, iter)
- flexsweep.fv.mu_stat(hap, snp_positions, window_size=50)[source]
Compute RAiSD composite sweep score \(\mu\) over overlapping SNP windows.
For each sliding window of
window_sizeconsecutive SNPs (step = 1 SNP), this routine evaluates three components and their product:mu_var – reduction-of-variation component (computed by
compute_mu_var()), scaled by the region length.mu_sfs – site-frequency-spectrum skew component (from
compute_mu_sfs()), standardized using Watterson’s harmonic correction (_harmonic_sums(n)[0]).mu_ld – linkage-disequilibrium contrast component (from
compute_mu_ld()) using the supplied \(r^2\) matrix.mu_total – composite statistic
mu_var * mu_sfs * mu_ld.
The window center coordinate is recorded as the midpoint between the first and last SNP positions in the window. Results are returned as a Polars DataFrame with one row per window.
- Parameters:
hap (numpy.ndarray) – Haplotype matrix of shape
(S, n)with 0/1 alleles (rows = SNPs, columns = haplotypes or chromosomes).snp_positions (numpy.ndarray) – Monotonically increasing physical positions of length
S(aligned to rows ofhap).window_size (int) – Number of consecutive SNPs per sliding window; defaults to
50(RAiSD’s-wdefault).
- Returns:
A Polars DataFrame with columns *
positions(int): window center (bp, midpoint of first/last SNP in window) *mu_var(float): variation component *mu_sfs(float): SFS component *mu_ld(float): LD component *mu_total(float): composite scoremu_var * mu_sfs * mu_ld- Return type:
polars.DataFrame
- Notes:
D_ln = (snp_positions[-1] + 1) - snp_positions[0]is the physical span of the full input region (bp), used to scale mu_var.theta_w_correction = _harmonic_sums(n)[0]= Watterson’s harmonic sum (1 + 1/2 + … + 1/(n-1)), precomputed once for all windows.get_pattern_idsis called once on the full haplotype matrix before the window loop; pattern IDs are indexed per-window by start/end.Windows advance by one SNP (maximally overlapping). For S SNPs and window size W, the output has S - W + 1 rows.
- See also:
compute_mu_var(),compute_mu_sfs(),compute_mu_ld(),get_pattern_ids()
- flexsweep.fv.run_raisd(hap_data, window_size=50)[source]
Public entry point for the RAiSD mu statistic.
Accepts either a VCF/VCF.gz file path or an ms/discoal simulation text file and returns the per-SNP-window mu scores computed by
mu_stat.- Input dispatch:
Tries
genome_reader(allel-based VCF reader). This is the primary path for real data.On any exception, falls back to
parse_ms_numpy(ms/discoal text format). This allows the same function to be used for both empirical and simulated data without separate entry points.
- Parameters:
hap_data – Path to a VCF/VCF.gz file or an ms/discoal output file.
window_size – Number of consecutive SNPs per sliding window (default 50, matching RAiSD’s
-wdefault).
- Returns:
polars.DataFrame – Same schema as
mu_stat— columns[positions, mu_var, mu_sfs, mu_ld, mu_total], one row per window.