Studying variation at specific loci across populations or species usually
means either building an expensive whole-genome graph or falling back to
a single reference. impg takes a third path: it treats all-vs-all
pairwise alignments as an implicit pangenome graph and projects
target ranges through the alignment network to extract only the
homologous sequences you need. Query regions across hundreds of
genomes in seconds, walk transitive alignments, partition a cohort into
comparable loci, refine regions to maximize sample support — all
without ever materializing a graph structure.
At its core, impg lifts ranges from a target sequence (the reference
in a given alignment) into the queries aligned onto it. It outputs
BED / BEDPE / PAF — ready to feed FASTA extraction, multiple sequence
alignment, or a graph builder like pggb or minigraph-cactus — and
can also emit GFA directly by chaining sweepga + seqwish + smoothxg-style
smoothing.
impg uses coitrees (cache-oblivious
interval trees) for fast range lookup, and stores CIGAR strings as
compact deltas. The result is fast, memory-efficient projection of
sequence ranges through alignment networks.
# Bioconda
conda install -c bioconda impg
# Source
git clone --recursive https://github.com/pangenome/impg.git
cd impg
cargo install --force --path .The source install places impg, gfaffix, and the companion aligner
binaries (wfmash, FastGA) into ~/.cargo/bin/.
docker pull pangenome/impg
docker run pangenome/impgOn older glibc systems (e.g. Debian Buster) a plain cargo build can
fail because wfmash needs modern CMake / GCC / glibc. Use Guix's
toolchain:
source ./env.sh
cargo build --releaseSee .guix/ for the Guix build recipe (guix build -L .guix/modules --file=guix.scm). For libclang link errors, set LIBCLANG_PATH to
your LLVM install (see env -i … LIBCLANG_PATH=…).
impg query -a cerevisiae.pan.paf.gz -r S288C#1#chrI:50000-100000 -x-a— alignment file (PAF / 1ALN / TPA). PAF must use=/XCIGAR ops (fromwfmashorminimap2 --eqx).-r— target range,seq:start-end.-x— walk the transitive closure: find everything aligned to the initial result, recursively.
Example output (BED):
S288C#1#chrI 50000 100000
DBVPG6044#1#chrI 35335 85288
Y12#1#chrI 36263 86288
DBVPG6765#1#chrI 36166 86150
YPS128#1#chrI 47080 97062
UWOPS034614#1#chrI 36826 86817
SK1#1#chrI 52740 102721
All commands accept -a (alignment files, mixed PAF/1ALN/TPA) or
--alignment-list (text file, one per line), -t / --threads, and
-v 0|1|2 for verbosity. Every command has a --help with the
exhaustive flag list — this section covers the flags you'll actually turn.
# A single range
impg query -a aln.paf -r chr1:1000-2000
# Transitive closure (depth 2 by default)
impg query -a aln.paf -r chr1:1000-2000 -x -m 3
# Many regions from a BED, mixed PAF + 1ALN
impg query -a f1.paf f2.1aln -b regions.bed
# Output formats: auto | bed | bedpe | paf | gfa | maf | fasta | fasta+paf | fasta-aln
impg query -a aln.paf -r chr1:1000-2000 -o bed
impg query -a aln.paf -r chr1:1000-2000 -o gfa --sequence-files genomes.fa
impg query -a aln.1aln -r chr1:1000-2000 -o fasta --sequence-files *.fa \
--reverse-complement
# Filter / shape the result
impg query -a aln.paf -r chr1:1000-2000 --min-identity 0.9 -l 5000 -d 1000
# Restrict to a sequence whitelist (also filters transitive intermediates)
impg query -a aln.paf -r chr1:1000-2000 -x --subset-sequence-list seqs.txt
# Fast approximate mode (.1aln only; bed/bedpe output)
impg query -a aln.1aln -r chr1:1000-2000 --approximateGFA / MAF / FASTA outputs need --sequence-files (FASTA or AGC
archive) or --sequence-list. See GFA engines for
engine selection and partitioned builds.
Runs alignment + seqwish + (optional) smoothing, no pre-computed alignment needed.
# Default pipeline: pggb (align → seqwish → smooth → gfaffix)
impg graph --sequence-files genomes.fa -g output.gfa -t 16
# Partitioned mode for large inputs (aligns once, then builds per-window)
impg graph --sequence-files genomes.fa -g output.gfa --gfa-engine pggb:10000
# Reuse an existing PAF instead of aligning
impg graph --sequence-files genomes.fa -g output.gfa --paf-file aln.paf
# Batch alignment to cap per-batch RAM (wfmash) or disk (FastGA)
impg graph --sequence-files genomes.fa -g output.gfa --batch-bytes 2Gquery -o gfa and graph share the same engine code and flags — the
only difference is where the sequences come from (IMPG index +
sequence files for query; FASTAs directly for graph).
# 1Mb windows, single BED output
impg partition -a aln.paf -w 1000000
# One FASTA per partition (for downstream pipelines)
impg partition -a aln.1aln -w 1000000 -o fasta --sequence-files *.fa \
--separate-files --output-folder partitions/
# Selection strategies pick the next starting sequence
impg partition -a aln.paf -w 1000000 --selection-mode longest # default
impg partition -a aln.paf -w 1000000 --selection-mode sample # PanSN sample
impg partition -a aln.paf -w 1000000 --selection-mode haplotype # PanSN haplotype
# Start from a fixed list of sequences
impg partition -a aln.paf -w 1000000 --starting-sequences-file seqs.txt
# GFA output per partition; engines: pggb | seqwish | poa
impg partition -a aln.paf -w 1000000 -o gfa --gfa-engine pggb \
--sequence-files *.fa --separate-files --output-folder gfas/
# Fully partitioned pipeline: build → lace → one gfaffix pass
impg partition -a aln.paf -w 100000 -o gfa --gfa-engine pggb:10000 \
--sequence-files *.fa --output-folder results/Explores asymmetric left/right expansions around each range, picking the smallest window that keeps the most sequences, samples, or haplotypes fully spanning it. Useful for anchoring loci outside structural variants.
impg refine -a aln.paf -r chr1:1000-2000
impg refine -a aln.paf -b loci.bed --span-bp 2000 -d 200000
# Maximize PanSN samples / haplotypes instead of raw sequence count
impg refine -a aln.paf -r chr1:1000-2000 --pansn-mode sample
impg refine -a aln.paf -r chr1:1000-2000 --pansn-mode haplotype
# Cap expansion distance
impg refine -a aln.paf -r chr1:1000-2000 --max-extension 0.90 # 90% of locus
impg refine -a aln.paf -r chr1:1000-2000 --max-extension 50000 # 50kb absolute
# Emit the spanning-entity list alongside the refined BED
impg refine -a aln.paf -r chr1:1000-2000 --support-output support.bedimpg similarity -a aln.paf -r chr1:1000-2000 --sequence-files *.fa
impg similarity -a aln.1aln -b regions.bed --sequence-files *.fa --distances
# Group by PanSN prefix
impg similarity -a aln.paf -r chr1:1000-2000 --sequence-files *.fa \
--delim '#' --delim-pos 2 # sample#haplotype
# PCA / MDS on the distance matrix
impg similarity -a aln.paf -r chr1:1000-2000 --sequence-files *.fa \
--pca --pca-components 3 --pca-measure cosine# GFAs (auto-detected)
impg lace -f gfa1.gfa gfa2.gfa gfa3.gfa -o combined.gfa
# From a list file, fill inter-window gaps with sequence
impg lace -l gfa_list.txt -o combined.gfa --fill-gaps 1 --sequence-files ref.fa
# VCFs
impg lace -f *.vcf -o combined.vcf --reference ref.faPath names must follow NAME:START-END (e.g.
HG002#1#chr20:1000-2000); the coordinates drive reassembly. NAME
may contain : — the last : is the separator.
Recommended post-processing:
gfaffix combined.gfa -o combined.fix.gfa &> /dev/null
odgi unchop -i combined.fix.gfa -o - -t 16 | \
odgi sort -i - -o - -p gYs -t 16 | \
odgi view -i - -g > combined.final.gfa# Single combined index
impg index -a aln.paf -i aln.impg
# Mixed PAF + 1ALN + TPA
impg index -a f1.paf f2.1aln f3.tpa -i all.impg
# Per-file index (faster incremental rebuilds for large cohorts)
impg index --alignment-list files.txt --index-mode per-file--index-mode auto (default) picks per-file when ≥ 100 files are
listed, single otherwise. impg warns when the index is older than its
input alignments; -f/--force-reindex rebuilds.
Bgzipped PAFs work natively (impg reads .paf.gz); optionally
bgzip -r alignments.paf.gz creates a .gzi sidecar to speed up the
first read.
impg stats -a aln.paf
impg stats -a f1.paf f2.1alnThe graph, query -o gfa, and partition -o gfa commands share
one set of engine implementations, selected via --gfa-engine:
| Engine | Pipeline | Use for |
|---|---|---|
pggb (default) |
sweepga + seqwish + smoothxg-style smoothing + gfaffix | smoothed variation graphs |
seqwish |
sweepga + seqwish + gfaffix | raw (unsmoothed) graphs |
poa |
single-pass SPOA | small regions, quick MSA-based output |
Append :WINDOW to any engine to build per-window and lace:
impg query -a aln.paf -r chr1:0-500000 -o gfa \
--gfa-engine pggb:10000 --sequence-files *.fa -O out
impg graph --sequence-files *.fa -g out.gfa --gfa-engine seqwish:10000
impg partition -a aln.paf -w 100000 -o gfa --gfa-engine pggb:10000 \
--sequence-files *.fa --output-folder results/Window size is in bp (≥ 1000). Partitioned mode is the recommended approach for large regions — it caps peak memory and runs one final gfaffix pass over the laced graph.
The flags below are available on all three GFA-producing commands. Defaults match pggb's conventions; only tune if the default graph doesn't meet your need.
# Seqwish induction
--min-match-len 23 # minimum transitive-match length
--transclose-batch 10000000 # batch size (reduce for lower memory)
--sparse-factor 0.0 # drop this fraction of input matches
--disk-backed # use disk-backed interval trees
--repeat-max / --min-repeat-dist
# Smoothxg-style smoothing (pggb only)
--target-poa-length 700,1100 # one pass per value
--max-node-length 100
--poa-padding-fraction 0.001
# Alignment filtering (sweepga, seqwish + pggb only)
--no-filter # skip post-alignment filtering
--num-mappings many:many # plane-sweep cardinality
--scaffold-jump 50000 # scaffold chaining gap (0 = off)
--scaffold-mass 10000 # min scaffold chain length
--overlap 0.95
--min-aln-identity 0.9
# Aligner backend
--aligner wfmash # default; alt: fastga
--sparsify auto # wfmash-only; pair-selection heuristic
--map-pct-identity 90 # wfmash -p value
--fastga-frequency / --fastga-frequency-multiplier # fastga-only
# Temp files (can be large)
--temp-dir /scratch/tmp # explicit path
--temp-dir ramdisk # → /dev/shm on LinuxCombining --aligner fastga with --sparsify or --aligner wfmash
with --fastga-frequency is rejected at parse time.
-a / --alignment-files— one or more PAF/1ALN/TPA files (can be.gz).--alignment-list— text file, one alignment path per line.-i / --index— existing IMPG index.-f / --force-reindex— rebuild even if the index is up-to-date.-t / --threads— default4.-d / --merge-distance— merge nearby hits within this gap (bp).--no-merge— disable merging.--consider-strandness— keep strands separate during merge.--subset-sequence-list— restrict results to listed sequences.--unidirectional— disable bidirectional alignment interpretation.
Sequence-requiring outputs (GFA/MAF/FASTA, similarity, lace --fill-gaps) take --sequence-files (FASTA or AGC) or
--sequence-list.
FASTA="cerevisiae.fa.gz"
PAF="cerevisiae.paf"
THREADS=16
# 1. Index
impg index -a "$PAF" -i yeast.impg -t "$THREADS"
# 2. Partition into 100kb windows, one FASTA per window
mkdir -p partitions gfas
impg partition -i yeast.impg -w 100000 \
--sequence-files "$FASTA" -o fasta \
--separate-files --output-folder partitions -t "$THREADS"
# 3. Build per-partition GFAs in parallel
ls partitions/*.fasta | xargs -P 4 -I {} bash -c '
f="{}"; base=$(basename "$f" .fasta)
impg graph --sequence-files "$f" -g "gfas/${base}.gfa" -t 4
'
# 4. Lace, filling inter-window gaps with reference sequence
find gfas -name "*.gfa" -size +0 | sort -V > gfa_list.txt
impg lace --file-list gfa_list.txt --sequence-files "$FASTA" \
-o yeast.gfa --fill-gaps 2 -t "$THREADS"
# 5. Post-process with odgi
odgi build -g yeast.gfa -o yeast.og -t "$THREADS"
odgi sort -i yeast.og -o yeast.sort.og -O -p Ygs -t "$THREADS"
odgi layout -i yeast.sort.og -o yeast.lay -t "$THREADS"
odgi viz -i yeast.sort.og -o yeast.viz.png -x 4000 -y 1000 -s '#'
odgi draw -i yeast.sort.og -c yeast.lay -p yeast.draw.pngFor modern inputs, you can replace steps 1–4 with a single
impg graph --sequence-files "$FASTA" -g yeast.gfa --gfa-engine pggb:100000.
scripts/faln2html.py renders the fasta-aln output into an
interactive HTML MSA using react-msa
or ProSeqViewer.
impg query -a aln.paf -r chr1:1000-2000 -o fasta-aln --sequence-files *.fa \
| python scripts/faln2html.py -i - -o alignment.html [--tool proseqviewer]Andrea Guarracino aguarra1@uthsc.edu · Bryce Kille brycekille@gmail.com · Erik Garrison erik.garrison@gmail.com
MIT.