Basic usage
The examples below use the Yoruba (YRI) population from the 1000 Genomes Project (n = 108), which is one of the populations with higher sample size and nucleotide diversity. All timing estimates are based on a 24-core AMD Ryzen 9 PRO 5945 workstation with 64 GB RAM and an NVIDIA RTX 3080.
The full workflow — simulations, feature vector estimation for all 22 YRI autosomes, training, and genome-wide prediction — completes in a ~2 hours on this configuration.
Command line
# Simulate 22,000 neutral + 22,000 sweep replicates (~70 min)
flexsweep simulator \
--sample_size 216 \
--demes yri_spiedel_2019.yaml \
--output_folder yri_test \
--nthreads 24 \
--num_simulations 220000
# Estimate feature vectors from simulations (~50 min)
flexsweep fvs-discoal \
--simulations_path yri_test \
--nthreads 24
# Estimate feature vectors from VCF (~27 min, 22 autosomes)
flexsweep fvs-vcf \
--vcf_path yri_vcfs \
--recombination_map decode_sexavg_2019.txt \
--nthreads 24 \
--suffix yri
# Train and predict (~4 min on RTX 3080)
flexsweep cnn \
--train_data yri_test/fvs.parquet \
--predict_data yri_vcfs/fvs_yri.parquet \
--output_folder yri_test
Python interface
import flexsweep as fs
simulator = fs.Simulator(
216, fs.DEMES_EXAMPLES['yri'], 'yri_test',
num_simulations=int(2.5e5), nthreads=24
)
# Prior parameters to simulate
df_params = simulator.create_params()
# Simulate
sims_list = simulator.simulate()
# Estimate feature vectors from simulations
fvs_sims = fs.summary_statistics(data_dir="yri_test", nthreads=24)
# Estimate feature vectors from VCF
fvs_vcf = fs.summary_statistics(
data_dir="yri_vcfs",
vcf=True,
nthreads=24,
recombination_map=fs.DECODE_MAP,
suffix='yri',
)
# Train and predict
fs_cnn = fs.CNN(
train_data="yri_test/fvs.parquet",
predict_data="yri_vcfs/fvs_yri.parquet",
output_folder="yri_vcfs",
)
fs_cnn.train()
fs_cnn.predict()
The training incorporates early stopping to prevent overfitting and converges in about 40 epochs for YRI-sized datasets.
The package includes built-in defaults for YRI, CEU, and CHB demographic models estimated from Relate, as well as the deCODE recombination map. You can inspect them directly:
import flexsweep as fs
# View the YRI demographic model
print(fs.simulate_discoal.demes.load(fs.DEMES_EXAMPLES['yri']))
# View the deCODE recombination map
print(fs.pl.read_csv(
fs.DECODE_MAP,
separator="\t",
comment_prefix="#",
schema=fs.pl.Schema([
("chr", fs.pl.String),
("start", fs.pl.Int64),
("end", fs.pl.Int64),
("cm_mb", fs.pl.Float64),
("cm", fs.pl.Float64),
])
))