Advanced usage

Custom summary statistics Select any combination of built-in statistics by passing a list of names to the stats argument of summary_statistics. The pipeline handles computation, windowing, and normalisation automatically for both simulations and VCF data.

By default (stats=None) the full Flex-sweep statistic set is used.

Normalisation follows two schemes. SNP-based statistics are normalised by frequency bin following Voight et al. 2006. Window-based statistics are Z-scored per window/centre combination:

\[\begin{split}Z\text{-score }\pi_i^{(w_1,c_1)} \;=\; \frac{\pi_i^{(w_1,c_1)} - \mu^{(w_1,c_1)}}{\sigma^{(w_1,c_1)}}, \qquad \\ \text{where }\mu^{(w_1,c_1)}=\operatorname{mean}_{i=1}^{n}\!\bigl(\pi_i^{(w_1,c_1)}\bigr)\end{split}\]

Available statistics

Window-based statistics — computed for every centre × window-size combination:

Name

Description

pi

Nucleotide diversity \(\pi\) per base pair

tajima_d

Tajima’s \(D\)

theta_w

Watterson’s \(\theta_W\) per base pair

theta_h

Fay & Wu’s \(\theta_H\) (absolute)

fay_wu_h

Normalised Fay & Wu’s \(H\)

k_counts

Number of distinct haplotypes

h1

Garud’s \(H_1\)

h12

Garud’s \(H_{12}\)

h2_h1

Garud’s \(H_2/H_1\)

zns

Kelly’s \(Z_{nS}\)

omega_max

Kim & Nielsen’s \(\omega_{max}\)

haf

Haplotype allele frequency (HAF-top)

max_fda

Maximum derived allele frequency in window

dist_var

Variance of pairwise haplotype distances

dist_skew

Skewness of pairwise haplotype distances

dist_kurtosis

Kurtosis of pairwise haplotype distances

SNP-based statistics — computed per variant and normalised by frequency bin:

Name

Description

ihs

Integrated haplotype score (iHS)

delta_ihh

\(\Delta iHH\)

nsl

Number of segregating sites by length (\(nS_L\))

isafe

iSAFE sweep-focal SNP score

dind / dind_high_low

DIND (derived-background diversity ratio)

highfreq / high_freq

High-frequency derived allele statistic

lowfreq / low_freq

Low-frequency derived allele statistic

s_ratio

S-ratio (singletons vs high-frequency variants)

hapdaf_o

HapDAF observed

hapdaf_s

HapDAF simulated

Additional statistics (Python API)

The following statistics are implemented in flexsweep.fv_v2 but are not accessible via the stats= argument. They must be called directly from the Python API (see API Reference for full signatures).

SFS-based neutrality tests:

Function

Description

achaz_y

Achaz’s \(Y\) (Achaz 2008)

zeng_e

Zeng’s \(E\) (Zeng et al. 2006)

fuli_f_star

Fu & Li’s \(F^*\) (Fu & Li 1993)

fuli_f

Fu & Li’s \(F\)

fuli_d_star

Fu & Li’s \(D^*\)

fuli_d

Fu & Li’s \(D\)

Composite sweep statistics:

Function

Description

compute_t_m

LASSI \(T\) and \(m\) statistics (Harris & DeGiorgio 2020)

mu_stat

RAiSD \(\mu\) statistic (Alachiotis & Pavlidis 2018)

Balancing selection:

Function

Description

run_beta_window

\(\beta^{(1)*}_{(std)}\) (Siewert & Voight 2020)

ncd1

NCD1 (Bitarello et al. 2018)

Selecting statistics

import flexsweep as fs

# A subset of window and SNP statistics
df = fs.summary_statistics(
    "./simulations",
    stats=["pi", "h12", "ihs", "nsl"],
    nthreads=8,
)

The same stats argument applies to VCF data:

df_vcf = fs.summary_statistics(
    "./vcf_data",
    vcf=True,
    stats=["pi", "h12", "ihs", "nsl"],
    recombination_map="recomb_map.csv",
    nthreads=8,
)

Passing an invalid stat name raises a ValueError listing all valid names.

diploS/HIC-like feature vectors

To estimate the same statistics as diploS/HIC\(\pi\), \(\theta_W\), \(\theta_H\), Fay & Wu’s \(H\), Tajima’s \(D\), distinct haplotypes, \(H_1\), \(H_{12}\), \(H_2/H_1\), \(Z_{nS}\), \(\omega_{max}\), maximum derived allele frequency, and pairwise distance moments — use the parameters below. The locus is divided into contiguous sub-windows matching the diploS/HIC subwindow approach.

From simulations:

import flexsweep as fs

diploshic_stats = [
    "pi", "fay_wu_h", "theta_h", "max_fda", "theta_w", "tajima_d",
    "k_counts", "h1", "h12", "h2_h1", "zns", "omega_max",
    "dist_var", "dist_skew", "dist_kurtosis",
]

df = fs.summary_statistics(
    "./simulations",
    stats=diploshic_stats,
    locus_length=1100000,
    step=100000,
    windows=[100000],
    nthreads=8,
)

From VCF data:

df_vcf = fs.summary_statistics(
    "./vcf_data",
    vcf=True,
    stats=diploshic_stats,
    locus_length=1100000,
    step=100000,
    step_vcf=1000000,
    windows=[100000],
    nthreads=8,
)

Locus and window settings

Three arguments control the locus geometry. They must be consistent between the simulation and VCF runs so that normalisation bins match.

Argument

Default

Description

locus_length

1200000

Total locus length in base pairs. Together with step, determines the centre grid: one centre every step bp from step//2 to locus_length step//2.

windows

[100000]

List of window sizes (bp). Multiple sizes produce a multi-scale feature vector. Each centre is evaluated at every window size, giving n_centres × len(windows) rows per replicate.

step

100000

Sliding-window step size (bp) for simulations and the within-locus centre grid for VCF.

step_vcf

10000

Step size (bp) for tiling VCF contigs into genomic windows. Independent of step.

# Multi-scale windows over the default 1.2 Mb locus
df = fs.summary_statistics(
    "./simulations",
    windows=[50000, 100000, 500000],
    nthreads=8,
)

# Larger locus with a finer step
df = fs.summary_statistics(
    "./simulations",
    locus_length=2000000,
    step=50000,
    windows=[100000],
    nthreads=8,
)

Recombination-rate stratified normalisation

By default, SNP statistics are normalised within frequency bins only. When r_bins is provided, normalisation is additionally stratified by recombination rate — a separate mean and standard deviation is computed for each recombination-rate stratum defined by the breakpoints.

r_bins accepts a list of breakpoints in cM/Mb (e.g. [1, 5, 10] creates four strata: <1, 1–5, 5–10, >10 cM/Mb). A recombination map must also be supplied. The r_bins column is dropped from the output feature matrix so the feature dimension stays unchanged regardless of whether stratification is used.

df = fs.summary_statistics(
    "./simulations",
    recombination_map="decode_like_map.csv",
    r_bins=[1, 5, 10],
    min_rate=0.01,
    nthreads=8,
)

Custom CNN

Flex-sweep is now able to work with custom CNN architectures. The API includes a CNN class able to pre-process the feature vectors while being ready to use with custom CNN implementations. By default CNN class will work with the default Flex-sweep architecture. Nonetheless, we changed the old 2D CNN behaviour so we now input statistics as channels into the 2D CNN: (batch, windows, centers, stats). If you are planning to use custom CNN architectures, please be extremely careful, you must pay attention to feature vector reshaping as needed.

import flexsweep as fs
from flexsweep.cnn import *

def cnn_finer(model_input, num_classes=1):
    """
    Changing filter and kernels sizes to look for finer summary statistics dimensions.
    Includes bach normalization, global pooling (no flatten), and dropout.
    """

    initializer = tf.keras.initializers.HeNormal()

    x = tf.keras.layers.Conv2D(
        64, (2, 1), padding="same", kernel_initializer=initializer, name="conv_2x1"
    )(model_input)
    x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.layers.ReLU()(x)

    x = tf.keras.layers.Conv2D(
        64, (1, 2), padding="same", kernel_initializer=initializer, name="conv_1x2"
    )(model_input)

    x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.layers.ReLU()(x)

    x = tf.keras.layers.Conv2D(
        128, (2, 2), padding="same", kernel_initializer=initializer, name="conv_2x2"
    )(x)
    x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.layers.ReLU()(x)

    # Optional deeper conv layer
    x = tf.keras.layers.Conv2D(
        128, (1, 1), padding="same", kernel_initializer=initializer, name="conv_1x1"
    )(x)
    x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.layers.ReLU()(x)

    # x = tf.keras.layers.MaxPooling2D(pool_size=(2, 1), name="gentle_pool")(x)
    x = tf.keras.layers.MaxPooling2D(pool_size=(1, 1), name="gentle_pool")(x)
    x = tf.keras.layers.Dropout(0.15, name="dropout_1")(x)

    # GlobalAveragePooling2D
    x = tf.keras.layers.Flatten()(x)
    # x = self.attention_pool_2d(x, name="attn_pool")

    x = tf.keras.layers.Dense(128, activation="relu", name="dense_1")(x)
    x = tf.keras.layers.Dropout(0.2, name="dropout_2")(x)

    x = tf.keras.layers.Dense(32, activation="relu", name="dense_2")(x)
    x = tf.keras.layers.Dropout(0.1, name="dropout_3")(x)

    output = tf.keras.layers.Dense(
        num_classes, activation="sigmoid", name="output"
    )(x)

    return output

fs_cnn = fs.CNN(
        train_data = "yri_test/fvs.parquet",
        predict_data = "yri_vcf/fvs_yri.parquet",
        output_folder = "yri_vcf",
)
fs_cnn.train(cnn = cnn_finer)

Because we’re providing new custom feature vectors (e.g, new genomic center and window size ranges), we’re also providing an interface to train and predict using a 1D CNN. When using 1D CNN, you must input your own CNN architecture. We’re providing a 1D CNN example with channel attention (Squeeze-and-Excitation) that learns local patterns across genomic positions and reweights features before classification. You can easily provide your own CNN similar to the example above:

def cnn_flexsweep_conv1d(model_input, num_classes=1):
    """
    Conv1D over spatial positions (steps) with stats as channels,
    followed by channel-wise (per-stat) attention.
    Expects model_input shape: (batch, positions=105, stats=11)
    """

    x = model_input

    # Conv1D over positions (channels_last): output (batch, positions, filters)
    x = tf.keras.layers.Conv1D(128, 3, padding="same", activation="relu")(x)
    x = tf.keras.layers.Conv1D(256, 2, padding="same", activation="relu")(x)

    # Channel attention (Squeeze-and-Excitation)
    se = tf.keras.layers.GlobalAveragePooling1D()(x)
    se = tf.keras.layers.Dense(256, activation="sigmoid")(se)
    se = tf.keras.layers.Reshape((1, 256))(se)
    x  = tf.keras.layers.Multiply()([x, se])

    # Head
    x = tf.keras.layers.GlobalAveragePooling1D()(x)
    x = tf.keras.layers.Dense(256, activation="relu")(x)
    x = tf.keras.layers.Dropout(0.3)(x)
    x = tf.keras.layers.Dense(128, activation="relu")(x)
    x = tf.keras.layers.Dropout(0.15)(x)
    output = tf.keras.layers.Dense(num_classes, activation="sigmoid")(x)

    return output

fs_cnn_1d = CNN(
        train_data = "yri_test/fvs.parquet",
        predict_data = "yri_vcf/fvs_yri.parquet",
        output_folder = "yri_vcf",
)
fs_cnn_1d.train(cnn = cnn_flexsweep_conv1d, one_dim = True)

Haplotype sorting

Before feeding a raw haplotype matrix into a custom CNN you may want to rearrange rows (haplotypes) or columns (SNPs) so that similar haplotypes are placed adjacently — this can improve the spatial patterns a 2D CNN learns. All functions below accept a binary (samples × sites) NumPy array. Please read for further information Zhao et al. 2023 and Tran et al. 2025 for further information.

Function

Description

daf_sorting(matrix)

Sort columns (SNPs) by descending derived allele frequency. Most common derived alleles appear first.

freq_sorting(matrix)

Sort rows (haplotypes) by descending number of derived alleles (Hamming weight). Most derived haplotypes appear first.

corr_sorting(matrix)

Sort rows by Pearson correlation coefficient with the most correlated haplotype. Groups similar haplotypes together.

pcc_column_sort_numba(matrix)

Sort columns by total PCC score — SNPs most correlated with the rest of the matrix appear first.

haplotype_freq_sorting(matrix)

Sort columns by haplotype frequency (most common haplotype group first). Returns (reordered, col_order, groups, freqs).

haplotype_freq_sorting_hamming(matrix)

Same as above but within each frequency group, haplotypes are additionally ordered by Hamming distance to the most frequent haplotype. Returns (reordered, freqs).

daf_sorting, freq_sorting, corr_sorting, and pcc_column_sort_numba are Numba-compiled and operate on integer arrays. The two haplotype_freq_sorting variants operate on general NumPy arrays.

import numpy as np
from flexsweep.utils import (
    daf_sorting, freq_sorting, haplotype_freq_sorting
)

# random hap: binary (n_haplotypes × n_sites) array from your VCF window
hap = np.random.randint(0, 2, (1000, 216), dtype=np.int32)

# Sort SNPs by DAF, then haplotypes by frequency
hap_daf = daf_sorting(hap.copy())
hap_sorted, col_order, groups, freqs = haplotype_freq_sorting(hap_daf)

# hap_sorted is now ready for a (batch, haplotypes, sites, 1) CNN input as needed

Demography mis-specification

Flex-sweep is now more versatile to analyse non-model organisms where the quality or availability of simulated parameters, such as the mutation rate, recombination rate, and demography, is limited. We extend the CNN with the Domain Adaptive model proposed by Mo, Z. and Siepel A. 2023. If you plan to use Flex-sweep DA, please cite Mo, Z. and Siepel A. 2023. We highly recommend to read deep the paper along with the code source code provided by the authors.

Flex-sweep-DA trainining takes into account not only labelled simulated data (source domain) as expected for a CNN, but also incorporates empirical unlabelled data (target domain) during the training. The goal then is to generalise the classification task across any feature distorting simulated feature vectors distribution from real data by learning a shared representation that is highly predictive for the CNN classifier but uninformative about the domain (whether the input is simulated or real).

Demography is known to highly shift summary statistics toward values that can mimic sweep signals, even assuming strict neutrality. It’s been a matter of debate how realistic demographic, along with BGS, could explain most sweep signals. In the case of ML approaches like CNN, trained models under unrealistic demographies easily confound sweep prediction due to the overfitting of demographic artifacts. The DA model implemented is explicitly designed to account for and mitigate such a mismatch between simulated and real data. Note that when working with extremely out-of-range demographies (e.g, training over constant population sizes) or simulated parameters, DA implementation may still perform worse than the original CNN, so to work safer, the simulations should span plausible demographic scenarios.

Flex-sweep DA will subset the exact same number of source_data (labelled simulations) from target_data (empirical data) to balance the discriminator during training. Once the model is trained, the software will use the entire target_data dataset to make the predictions.

import flexsweep as fs
fs_cnn = fs.CNN(
    source_data="yri_test/fvs.parquet",
    target_data="yri_vcfs/fvs_yri.parquet",
    output_folder="yri_vcfs",
)

fs_cnn.train_da(ramp_epochs=30,max_lambda=1)
df_prediction = fs_cnn.predict_da()