Ancestral state polarization
Several statistics in Flexsweep require knowledge of the ancestral allele — that is, which allele was present in the ancestral population before a mutation arose. Examples include iHS, Fay-Wu H, Zeng E, Fu-Li D/F, hapdaf, DIND, and iSAFE. Without polarization, REF is used as a proxy for the ancestral state, which introduces error when REF is derived.
flexsweep polarize assigns the ancestral allele by comparing each variant
in a VCF to one or more outgroup species in a multi-species alignment (MAF).
Where the outgroup consensus supports the alternate allele as ancestral, REF and
ALT are swapped in the output VCF and all associated fields (AC, AF, GT) are
updated accordingly.
The polarization engine is implemented in Rust for performance and is invoked
via a compiled binary (flexsweep-polarize).
How it works
The pipeline has two steps:
Step 1 — Sort the MAF (if not already sorted by contig/position):
The Rust binary streams the MAF and VCF together in a merge-join. Both files
must be sorted by the same reference contig order and position. If your MAF is
unsorted, run sort-maf first.
Step 2 — Polarize the VCF:
For each biallelic SNP in the VCF, the binary looks up the aligned column in the MAF for each requested outgroup. The ancestral allele is inferred by one of four substitution models and compared to REF/ALT:
If the inferred ancestral allele matches REF → no change.
If the inferred ancestral allele matches ALT → REF and ALT are swapped; AC, AF, and all genotypes are flipped.
If the ancestral allele cannot be determined → the site is left unchanged.
The output is a bgzipped VCF with an AA INFO field recording the inferred
ancestral allele.
Polarization methods
Method |
Description |
|---|---|
|
Majority vote over outgroup alleles. Fast and parameter-free. Suitable when outgroups are closely related and substitution rate is low. |
|
Jukes–Cantor model. Assumes equal substitution rates among all four nucleotides. Fits one free parameter (transition rate). |
|
Kimura two-parameter (K80) model. Distinguishes transitions from transversions. Default. Recommended for mammalian data. |
|
General time-reversible (GTR) model with six free parameters. Most accurate; slowest. Recommended when outgroups are distantly related. |
Multiple random starts (--nrandom, default 10) are used for model fitting
to avoid local optima.
Input files
MAF (Multiple Alignment Format):
A whole-genome alignment anchored to the reference species (the same genome
used for the VCF). Common sources: UCSC Genome Browser multiz alignments.
The file can be plain text or gzipped (.maf.gz).
VCF:
A bgzipped, sorted VCF (*.vcf.gz) of the focal population. Must be sorted
by the same contig order as the MAF reference.
Outgroups:
One or more species names as they appear in the MAF s lines (e.g.
panTro4, ponAbe2, gorGor6 for human data). Multiple outgroups
improve robustness — the model aggregates information across all specified
outgroups.
CLI reference
flexsweep polarize [OPTIONS]
Option |
Default |
Description |
|---|---|---|
|
required |
Input MAF file. Can be gzipped. Must be sorted by contig/position
(or pass |
|
required |
Input VCF file. Can be gzipped. Must be sorted by contig/position. |
|
required |
Comma-separated outgroup species names, e.g.
|
|
required |
Substitution model: |
|
10 |
Number of random starts for model fitting (ignored by |
|
False |
If True, sort the MAF by contig/position before polarizing. |
Examples:
# Sort MAF first, then polarize with Kimura model
flexsweep polarize \
--maf hg38.multiz100way.maf.gz \
--vcf YRI.chr22.vcf.gz \
--outgroups panTro4,ponAbe2,gorGor6 \
--method kimura \
--sort True
# Use parsimony with a pre-sorted MAF
flexsweep polarize \
--maf hg38.multiz100way.sorted.maf.gz \
--vcf YRI.chr22.vcf.gz \
--outgroups panTro4,gorGor6 \
--method parsimony
Python API
import flexsweep as fs
# Build the Rust binary (only needed once)
fs.build_rust_polarization()
# Step 1: sort MAF (skip if already sorted)
sorted_maf = fs.run_sort_maf("hg38.multiz100way.maf.gz")
# Step 2: polarize
output_vcf = fs.run_polarize(
maf=sorted_maf,
vcf="YRI.chr22.vcf.gz",
outgroup=["panTro4", "ponAbe2", "gorGor6"],
method="kimura",
nrandom=10,
output="YRI.chr22.polarized.vcf.gz",
)
The polarized VCF is written to output (default:
{input}.polarized.vcf.gz in the same directory as the input VCF).
Using polarized VCFs with scan and feature vectors
Once polarized, pass the output VCF directly to any Flexsweep command that reads VCF data. Polarization improves the accuracy of all statistics that distinguish ancestral from derived alleles:
scan:
ihs,isafe,dind,high_freq,low_freq,s_ratio,hapdaf_o,hapdaf_s,fay_wu_h,zeng_e,achaz_y,fuli_d,fuli_f.fvs-vcf: all SNP-level features in the feature-vector pipeline.
# Scan with a polarized VCF
flexsweep scan \
--vcf_path polarized_vcfs/ \
--out_prefix results/YRI \
--stats ihs,isafe,hapdaf_o,hapdaf_s,fay_wu_h \
--nthreads 8