Advanced usage
Custom summary statistics
Select any combination of built-in statistics by passing a list of names to
the stats argument of summary_statistics. The pipeline handles
computation, windowing, and normalisation automatically for both simulations
and VCF data.
By default (stats=None) the full Flex-sweep statistic set is used.
Normalisation follows two schemes. SNP-based statistics are normalised by frequency bin following Voight et al. 2006. Window-based statistics are Z-scored per window/centre combination:
Available statistics
Window-based statistics — computed for every centre × window-size combination:
Name |
Description |
|---|---|
|
Nucleotide diversity \(\pi\) per base pair |
|
Tajima’s \(D\) |
|
Watterson’s \(\theta_W\) per base pair |
|
Fay & Wu’s \(\theta_H\) (absolute) |
|
Normalised Fay & Wu’s \(H\) |
|
Number of distinct haplotypes |
|
Garud’s \(H_1\) |
|
Garud’s \(H_{12}\) |
|
Garud’s \(H_2/H_1\) |
|
Kelly’s \(Z_{nS}\) |
|
Kim & Nielsen’s \(\omega_{max}\) |
|
Haplotype allele frequency (HAF-top) |
|
Maximum derived allele frequency in window |
|
Variance of pairwise haplotype distances |
|
Skewness of pairwise haplotype distances |
|
Kurtosis of pairwise haplotype distances |
SNP-based statistics — computed per variant and normalised by frequency bin:
Name |
Description |
|---|---|
|
Integrated haplotype score (iHS) |
|
\(\Delta iHH\) |
|
Number of segregating sites by length (\(nS_L\)) |
|
iSAFE sweep-focal SNP score |
|
DIND (derived-background diversity ratio) |
|
High-frequency derived allele statistic |
|
Low-frequency derived allele statistic |
|
S-ratio (singletons vs high-frequency variants) |
|
HapDAF observed |
|
HapDAF simulated |
Additional statistics (Python API)
The following statistics are implemented in flexsweep.fv_v2 but are not
accessible via the stats= argument. They must be called directly from the
Python API (see API Reference for full signatures).
SFS-based neutrality tests:
Function |
Description |
|---|---|
|
Achaz’s \(Y\) (Achaz 2008) |
|
Zeng’s \(E\) (Zeng et al. 2006) |
|
Fu & Li’s \(F^*\) (Fu & Li 1993) |
|
Fu & Li’s \(F\) |
|
Fu & Li’s \(D^*\) |
|
Fu & Li’s \(D\) |
Composite sweep statistics:
Function |
Description |
|---|---|
|
LASSI \(T\) and \(m\) statistics (Harris & DeGiorgio 2020) |
|
RAiSD \(\mu\) statistic (Alachiotis & Pavlidis 2018) |
Balancing selection:
Function |
Description |
|---|---|
|
\(\beta^{(1)*}_{(std)}\) (Siewert & Voight 2020) |
|
NCD1 (Bitarello et al. 2018) |
Selecting statistics
import flexsweep as fs
# A subset of window and SNP statistics
df = fs.summary_statistics(
"./simulations",
stats=["pi", "h12", "ihs", "nsl"],
nthreads=8,
)
The same stats argument applies to VCF data:
df_vcf = fs.summary_statistics(
"./vcf_data",
vcf=True,
stats=["pi", "h12", "ihs", "nsl"],
recombination_map="recomb_map.csv",
nthreads=8,
)
Passing an invalid stat name raises a ValueError listing all valid names.
diploS/HIC-like feature vectors
To estimate the same statistics as diploS/HIC — \(\pi\), \(\theta_W\), \(\theta_H\), Fay & Wu’s \(H\), Tajima’s \(D\), distinct haplotypes, \(H_1\), \(H_{12}\), \(H_2/H_1\), \(Z_{nS}\), \(\omega_{max}\), maximum derived allele frequency, and pairwise distance moments — use the parameters below. The locus is divided into contiguous sub-windows matching the diploS/HIC subwindow approach.
From simulations:
import flexsweep as fs
diploshic_stats = [
"pi", "fay_wu_h", "theta_h", "max_fda", "theta_w", "tajima_d",
"k_counts", "h1", "h12", "h2_h1", "zns", "omega_max",
"dist_var", "dist_skew", "dist_kurtosis",
]
df = fs.summary_statistics(
"./simulations",
stats=diploshic_stats,
locus_length=1100000,
step=100000,
windows=[100000],
nthreads=8,
)
From VCF data:
df_vcf = fs.summary_statistics(
"./vcf_data",
vcf=True,
stats=diploshic_stats,
locus_length=1100000,
step=100000,
step_vcf=1000000,
windows=[100000],
nthreads=8,
)
Locus and window settings
Three arguments control the locus geometry. They must be consistent between the simulation and VCF runs so that normalisation bins match.
Argument |
Default |
Description |
|---|---|---|
|
|
Total locus length in base pairs. Together with |
|
|
List of window sizes (bp). Multiple sizes produce a multi-scale
feature vector. Each centre is evaluated at every window size, giving
|
|
|
Sliding-window step size (bp) for simulations and the within-locus centre grid for VCF. |
|
|
Step size (bp) for tiling VCF contigs into genomic windows. Independent
of |
# Multi-scale windows over the default 1.2 Mb locus
df = fs.summary_statistics(
"./simulations",
windows=[50000, 100000, 500000],
nthreads=8,
)
# Larger locus with a finer step
df = fs.summary_statistics(
"./simulations",
locus_length=2000000,
step=50000,
windows=[100000],
nthreads=8,
)
Recombination-rate stratified normalisation
By default, SNP statistics are normalised within frequency bins only. When
r_bins is provided, normalisation is additionally stratified by
recombination rate — a separate mean and standard deviation is computed for
each recombination-rate stratum defined by the breakpoints.
r_bins accepts a list of breakpoints in cM/Mb (e.g. [1, 5, 10]
creates four strata: <1, 1–5, 5–10, >10 cM/Mb). A recombination map must
also be supplied. The r_bins column is dropped from the output feature
matrix so the feature dimension stays unchanged regardless of whether
stratification is used.
df = fs.summary_statistics(
"./simulations",
recombination_map="decode_like_map.csv",
r_bins=[1, 5, 10],
min_rate=0.01,
nthreads=8,
)
Custom CNN
Flex-sweep is now able to work with custom CNN architectures. The API includes a CNN class able to pre-process the feature vectors while being ready to use with custom CNN implementations. By default CNN class will work with the default Flex-sweep architecture. Nonetheless, we changed the old 2D CNN behaviour so we now input statistics as channels into the 2D CNN: (batch, windows, centers, stats). If you are planning to use custom CNN architectures, please be extremely careful, you must pay attention to feature vector reshaping as needed.
import flexsweep as fs
from flexsweep.cnn import *
def cnn_finer(model_input, num_classes=1):
"""
Changing filter and kernels sizes to look for finer summary statistics dimensions.
Includes bach normalization, global pooling (no flatten), and dropout.
"""
initializer = tf.keras.initializers.HeNormal()
x = tf.keras.layers.Conv2D(
64, (2, 1), padding="same", kernel_initializer=initializer, name="conv_2x1"
)(model_input)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.ReLU()(x)
x = tf.keras.layers.Conv2D(
64, (1, 2), padding="same", kernel_initializer=initializer, name="conv_1x2"
)(model_input)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.ReLU()(x)
x = tf.keras.layers.Conv2D(
128, (2, 2), padding="same", kernel_initializer=initializer, name="conv_2x2"
)(x)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.ReLU()(x)
# Optional deeper conv layer
x = tf.keras.layers.Conv2D(
128, (1, 1), padding="same", kernel_initializer=initializer, name="conv_1x1"
)(x)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.ReLU()(x)
# x = tf.keras.layers.MaxPooling2D(pool_size=(2, 1), name="gentle_pool")(x)
x = tf.keras.layers.MaxPooling2D(pool_size=(1, 1), name="gentle_pool")(x)
x = tf.keras.layers.Dropout(0.15, name="dropout_1")(x)
# GlobalAveragePooling2D
x = tf.keras.layers.Flatten()(x)
# x = self.attention_pool_2d(x, name="attn_pool")
x = tf.keras.layers.Dense(128, activation="relu", name="dense_1")(x)
x = tf.keras.layers.Dropout(0.2, name="dropout_2")(x)
x = tf.keras.layers.Dense(32, activation="relu", name="dense_2")(x)
x = tf.keras.layers.Dropout(0.1, name="dropout_3")(x)
output = tf.keras.layers.Dense(
num_classes, activation="sigmoid", name="output"
)(x)
return output
fs_cnn = fs.CNN(
train_data = "yri_test/fvs.parquet",
predict_data = "yri_vcf/fvs_yri.parquet",
output_folder = "yri_vcf",
)
fs_cnn.train(cnn = cnn_finer)
Because we’re providing new custom feature vectors (e.g, new genomic center and window size ranges), we’re also providing an interface to train and predict using a 1D CNN. When using 1D CNN, you must input your own CNN architecture. We’re providing a 1D CNN example with channel attention (Squeeze-and-Excitation) that learns local patterns across genomic positions and reweights features before classification. You can easily provide your own CNN similar to the example above:
def cnn_flexsweep_conv1d(model_input, num_classes=1):
"""
Conv1D over spatial positions (steps) with stats as channels,
followed by channel-wise (per-stat) attention.
Expects model_input shape: (batch, positions=105, stats=11)
"""
x = model_input
# Conv1D over positions (channels_last): output (batch, positions, filters)
x = tf.keras.layers.Conv1D(128, 3, padding="same", activation="relu")(x)
x = tf.keras.layers.Conv1D(256, 2, padding="same", activation="relu")(x)
# Channel attention (Squeeze-and-Excitation)
se = tf.keras.layers.GlobalAveragePooling1D()(x)
se = tf.keras.layers.Dense(256, activation="sigmoid")(se)
se = tf.keras.layers.Reshape((1, 256))(se)
x = tf.keras.layers.Multiply()([x, se])
# Head
x = tf.keras.layers.GlobalAveragePooling1D()(x)
x = tf.keras.layers.Dense(256, activation="relu")(x)
x = tf.keras.layers.Dropout(0.3)(x)
x = tf.keras.layers.Dense(128, activation="relu")(x)
x = tf.keras.layers.Dropout(0.15)(x)
output = tf.keras.layers.Dense(num_classes, activation="sigmoid")(x)
return output
fs_cnn_1d = CNN(
train_data = "yri_test/fvs.parquet",
predict_data = "yri_vcf/fvs_yri.parquet",
output_folder = "yri_vcf",
)
fs_cnn_1d.train(cnn = cnn_flexsweep_conv1d, one_dim = True)
Haplotype sorting
Before feeding a raw haplotype matrix into a custom CNN you may want to
rearrange rows (haplotypes) or columns (SNPs) so that similar haplotypes are
placed adjacently — this can improve the spatial patterns a 2D CNN learns.
All functions below accept a binary (samples × sites) NumPy array. Please read for further information Zhao et al. 2023 and Tran et al. 2025 for further information.
Function |
Description |
|---|---|
|
Sort columns (SNPs) by descending derived allele frequency. Most common derived alleles appear first. |
|
Sort rows (haplotypes) by descending number of derived alleles (Hamming weight). Most derived haplotypes appear first. |
|
Sort rows by Pearson correlation coefficient with the most correlated haplotype. Groups similar haplotypes together. |
|
Sort columns by total PCC score — SNPs most correlated with the rest of the matrix appear first. |
|
Sort columns by haplotype frequency (most common haplotype group
first). Returns |
|
Same as above but within each frequency group, haplotypes are
additionally ordered by Hamming distance to the most frequent
haplotype. Returns |
daf_sorting, freq_sorting, corr_sorting, and
pcc_column_sort_numba are Numba-compiled and operate on integer arrays.
The two haplotype_freq_sorting variants operate on general NumPy arrays.
import numpy as np
from flexsweep.utils import (
daf_sorting, freq_sorting, haplotype_freq_sorting
)
# random hap: binary (n_haplotypes × n_sites) array from your VCF window
hap = np.random.randint(0, 2, (1000, 216), dtype=np.int32)
# Sort SNPs by DAF, then haplotypes by frequency
hap_daf = daf_sorting(hap.copy())
hap_sorted, col_order, groups, freqs = haplotype_freq_sorting(hap_daf)
# hap_sorted is now ready for a (batch, haplotypes, sites, 1) CNN input as needed
Demography mis-specification
Flex-sweep is now more versatile to analyse non-model organisms where the quality or availability of simulated parameters, such as the mutation rate, recombination rate, and demography, is limited. We extend the CNN with the Domain Adaptive model proposed by Mo, Z. and Siepel A. 2023. If you plan to use Flex-sweep DA, please cite Mo, Z. and Siepel A. 2023. We highly recommend to read deep the paper along with the code source code provided by the authors.
Flex-sweep-DA trainining takes into account not only labelled simulated data (source domain) as expected for a CNN, but also incorporates empirical unlabelled data (target domain) during the training. The goal then is to generalise the classification task across any feature distorting simulated feature vectors distribution from real data by learning a shared representation that is highly predictive for the CNN classifier but uninformative about the domain (whether the input is simulated or real).
Demography is known to highly shift summary statistics toward values that can mimic sweep signals, even assuming strict neutrality. It’s been a matter of debate how realistic demographic, along with BGS, could explain most sweep signals. In the case of ML approaches like CNN, trained models under unrealistic demographies easily confound sweep prediction due to the overfitting of demographic artifacts. The DA model implemented is explicitly designed to account for and mitigate such a mismatch between simulated and real data. Note that when working with extremely out-of-range demographies (e.g, training over constant population sizes) or simulated parameters, DA implementation may still perform worse than the original CNN, so to work safer, the simulations should span plausible demographic scenarios.
Flex-sweep DA will subset the exact same number of source_data (labelled simulations) from target_data (empirical data) to balance the discriminator during training. Once the model is trained, the software will use the entire target_data dataset to make the predictions.
import flexsweep as fs
fs_cnn = fs.CNN(
source_data="yri_test/fvs.parquet",
target_data="yri_vcfs/fvs_yri.parquet",
output_folder="yri_vcfs",
)
fs_cnn.train_da(ramp_epochs=30,max_lambda=1)
df_prediction = fs_cnn.predict_da()