feat: integrate syng index for syncmer-based homology queries#162
Open
feat: integrate syng index for syncmer-based homology queries#162
Conversation
added 9 commits
April 8, 2026 17:45
…r (sync-impl-c) - Add syng as git submodule under vendor/syng - Add cc build dependency and extend build.rs to compile 11 syng C files into libsyng.a - Create src/syng_ffi.rs with raw extern "C" declarations for SyngBWT, KmerHash, Seqhash, Rskip, SyncmerSet, and ONElib functions - Create src/syng.rs with safe SyngIndex wrapper, SyncmerParams, SyngNameMap, HomologousInterval types, and Drop implementation - Add module declarations to src/lib.rs - All existing tests pass, new unit tests verify create/drop cycle
… (sync-test-c) - FFI smoke tests: Seqhash, KmerHash, SyngBWT create/destroy individually - SyncmerParams: default values and to_c() conversion - SyngIndex: create/drop, custom params, pointer accessors - SyngNameMap: new() and Default trait
…d (sync-impl-index) - SyngIndex::build() progressively constructs GBWT from sequence iterator (syncmer extraction, KmerHash population, forward + revcomp paths) - SyngNameMap::save/load for .syng.names text sidecar - SyngIndex::save/load for .1khash + .1gbwt + .syng.names roundtrip - impg syng CLI: --agc or --fasta input, -o prefix output, syncmer params - C helper wrappers for static-inline seqhash destroy functions - Proper Seqhash cleanup via impg_seqhashDestroy instead of raw libc::free
…c-test-index) - Round-trip tests: build → save → load → verify path count, names, lengths - SyngNameMap serialization with PanSN format and special characters - Syncmer parameter variations: different k/w/seed produce different khash - Edge cases: empty sequence, shorter-than-syncmer, exactly syncmer length, mixed - CLI integration: run `impg syng -f <fasta> -o <prefix>`, verify output files - Load error case: missing files return Err
…-query) Phase 3 of syng integration: GBWT querying pipeline. - Add GbwtPathStart struct to SyngNameMap for per-path start info (start_node, start_count, num_syncmers) — needed for query path walking - Modify SyngIndex::build to capture GBWT path start info during forward path construction - Implement SyngIndex::query_region(): walks query genome's GBWT path to find syncmer nodes in [start,end], then walks all other genome paths to find shared nodes, merges and pads intervals - Add --syng <prefix> flag to query CLI (mutually exclusive with -a) - Add --syng-padding flag (default 120bp = 2× syncmer length) - Support bed, gfa, and fasta output formats for syng queries - Add dispatch_gfa_engine_with_seq_index() for syng→graph engine wiring via SeqIndexWrapper (minimal ImpgIndex impl using SequenceIndex) - Add impg_syng_suppress_debug FFI helper to silence C debug output - Backward-compatible .syng.names format (6 columns, reads old 3-column) - 5 new query_region tests (basic, unknown genome, padding, missing path info, save/load roundtrip)
…-test-query) Adds 19 new tests covering: - Query completeness with known ground truth (shared backbone detection) - Interior coverage (no false negatives for shared regions) - Boundary padding (0/60/120bp, monotonic growth, genome length clamping) - Interval merging (overlap, adjacency, multi-genome, edge cases) - Edge cases (isolated regions, full-sequence query, single-sequence index, out-of-range, zero-width, identical sequences, unknown genome) - CLI integration (syng build + query --syng for BED/GFA output, --syng/-a mutual exclusivity)
…nc-impl-gbwt) Add SyngIndex::build_region_gbwt() method that builds a region-specific GBWT from fetched sequences — creates fresh KmerHash/SyngBWT, extracts syncmers, builds forward+revcomp paths, and writes syng-compatible .1khash + .1gbwt files. Add 'gbwt' to query command's output format enum. Works with both --syng and -i input modes. Requires -O prefix and --sequence-files.
…test-gbwt) 14 new tests covering Phase 4 validation: - Region-specific GBWT output from query results - Region GBWT loadable as SyngIndex - Region nodes subset of full index - Single genome and very small region edge cases - Output prefix with nested directories and nonexistent dirs - ONEcode magic byte verification for .1gbwt and .1khash - Round-trip: query → GBWT → query consistency - CLI integration: --syng + -o gbwt, PAF-based + -o gbwt - CLI validation: -o gbwt requires -O prefix
…dd thread-safety and clippy cleanup (.verify-sync-verify-integration) - Fix heap-buffer-overflow in seqhash: encode sequences as 0/1/2/3 (not ASCII a/c/g/t) since patternRC[4] uses raw byte values as array indices - Use impg_seqhashCreateSafe for thread-safe seqhash creation - Add mutex serialization for syng tests (C library has thread-unsafe globals) - Fix clippy warnings: Error::other(), unused import, needless_range_loop - Add crate-level allows for pre-existing too_many_arguments/type_complexity
added 7 commits
April 13, 2026 16:50
# Conflicts: # src/lib.rs
Implement SyngImpgWrapper that adapts SyngIndex to the ImpgIndex trait, allowing partition_alignments() to work transparently with syng-based queries. No changes to partition logic itself.
`Decompressor::get_contig()` in ragc-core silently returns an empty Vec when called after `list_contigs_names_only()` because it checks contig count instead of `are_details_loaded()`, skipping the required detail reload. This produced a fully-populated `.syng.names` file but empty `.1gbwt` and `.1khash`, which then crashed at load time with "no Vertex objects in .1gbwt file". Switch to `get_contig_length` + `get_contig_range(0, length)` (same pattern as src/agc_index.rs), which correctly reloads details. Also add tests/test_syng_integration.rs covering: - impg syng --agc produces a non-empty GBWT (regression test for the exact bug above) - impg syng --agc round-trips: built index loads and queries correctly - impg syng -f (FASTA) produces a non-empty GBWT - impg partition --syng runs end-to-end and emits non-empty BED The AGC build test asserts .1gbwt > 2000 bytes, which would have caught this bug immediately — the broken output was exactly 1362 bytes.
The vendored syng hash table had a bug where the REMOVED sentinel value collided with hashInt(1), so any startCount() call for GBWT node 1 (or -1) silently returned 0 on every invocation instead of incrementing. Two paths both starting at node 1 would both record start_count=0 and collide; on query, the C layer would die with: FATAL ERROR: syngBWTpathStartOld startNode 1 count 0 >= startCount 0 This manifests on any pangenome with byte-identical contigs because the first k-mer added to the hash gets index 1 and becomes a GBWT start node. On yeast235 the chrIII sequences from samples AAA and SGDref are identical, which triggered the crash 100% of the time. The fix patches the vendored syng hash.c to use a middle-of-range REMOVED sentinel (0x4000000000000000) that hashInt() never produces for small-magnitude integer keys. Bumping the vendor/syng submodule pointer. Also adds regression tests: - tests/test_syng_startcount.rs: isolates startCount increment behavior at the FFI level for node 1, node 5, short/long paths, with/without RC - tests/test_syng_integration.rs::test_syng_identical_sequences_build_and_query: reproduces the yeast235 scenario (two byte-identical FASTA records) and verifies start_counts are distinct and query_region succeeds on both
Previously the vendor/syng submodule pointed at richarddurbin/syng but the pinned commits (23b6547, 1dbfd58, ce46949) were local-only and did not exist on that remote, so fresh clones with submodules failed CI with: "fatal: remote error: upload-pack: not our ref ..." Create https://github.com/pangenome/syng as a fork of richarddurbin/syng with an impg-integration branch carrying our patches. Point .gitmodules at it so CI and downstream clones can actually fetch the submodule. Contents of pangenome/syng:impg-integration (all commits ahead of upstream b659846): - 23b6547 impg FFI helper wrappers for static-inline functions - 1dbfd58 impg_syng_suppress_debug helper for FFI - ce46949 hash REMOVED sentinel collides with hashInt(1) (genuine bug fix) - d423521 IMPG_PATCHES.md tracking all divergences from upstream IMPG_PATCHES.md documents each patch and includes the sync workflow for merging future upstream changes. The hash.c fix should be upstreamed to richarddurbin/syng.
The pggb:X / seqwish:X engine spec carries a partition window size. Previously the syng+gfa path ran a single flat engine invocation over the entire query range, because X was only consulted as a boolean skip_normalize flag — so large regions produced enormous whole-chromosome context spans from query_region and overwhelmed the aligner. Now when partition_size is set, we split [range_start, range_end) into per-window partitions, call query_region per window (which returns tight, small-scale intervals), and run the partitioned GFA pipeline that does a fresh alignment + graph induction per partition plus a single final gfaffix normalization — structurally mirroring the alignment path's output_results_gfa_partitioned. SyngImpgWrapper gains seq_index() and syng_padding() accessors so the main.rs path can share it instead of threading the SequenceIndex and padding separately. Adds test_query_syng_gfa_subwindow_splitter integration test that drives a 20 kbp query through pggb:5000 and verifies the per-window log lines appear in stderr.
The previous query_region walked every forward path end-to-end on
every call — O(total pangenome length) per query, which dominates
runtime on large inputs. FastLocate gives O(k log r) per query
(k = query nodes, r = total BWT runs) by building an r-index locate
structure over a classical GBWT.
Algorithm: port of jltsiren/gbwt's C++ FastLocate (fast_locate.cpp)
to Rust on top of the `gbz` crate (which despite the crate name is
jltsiren/gbwt-rs). The r-index formulation is from Gagie, Navarro,
Prezza (JACM 2020) as adapted to multi-string BWTs in Sirén,
Garrison, Novak, Paten, Durbin (Bioinformatics 2020).
Key structural difference from a naive port: runs whose successor is
ENDMARKER must be counted as one logical run per position (not per
physical RLE run), mirroring C++ LFLoop's convention. Without this,
`last.predecessor(prev)` fails to find a valid tail for head samples
of nodes near the end of short sequences.
Integration in SyngIndex:
- build_fast_locate walks each forward syng path, inserts the
encoded node sequence into gbz::GBWTBuilder(bidirectional=true),
records per-node bp positions into a flat BpOffsets sidecar, then
builds FastLocate over the resulting gbz GBWT.
- SyngIndex::build calls build_fast_locate eagerly (warn on failure).
- query_region uses FastLocate::decompress_da when available,
falling back to the old walk-every-path implementation otherwise.
Serialization: new {prefix}.syng.locate sidecar = gbz::GBWT
(simple-sds Serialize) + FastLocate (custom little-endian framing)
+ BpOffsets (custom little-endian framing). Loaded opportunistically
if the file exists.
New dependencies: gbz 0.6 and simple-sds 0.4 (both from crates.io).
Tests:
- fast_locate unit: tiny multi-share, 8-path larger, single path,
identical paths, save/load roundtrip — all compared against a
ground-truth walk via gbz's own sequence iterator.
- syng: test_query_region_fast_locate_parity (fast vs slow result
equivalence), test_syng_save_load_with_fast_locate (full disk
roundtrip).
- Integration: test_syng_integration.rs passes end-to-end with the
fast path (350s subwindow splitter).
Also reverts a prior broken attempt to add r-index locate on the
C side of syng (from vendor/syng and syng_ffi declarations) — the
classical r-index formula does not apply to syng's rskip BWT, so
this is done in Rust on top of a separate gbz GBWT instead.
f12e003 to
f7fe289
Compare
added 12 commits
April 15, 2026 16:19
test_query_syng_gfa_subwindow_splitter was running FastGA + seqwish + gfaffix per sub-window, which took ~6 min locally and didn't finish in reasonable time on GitHub's 2-vCPU runners (1+ hour in Test step). The regression it checks is engine-agnostic — it verifies that pggb:X / seqwish:X / poa:X is interpreted as a sub-window size (not a boolean flag), which is visible from the per-window log lines emitted BEFORE the engine runs. Two changes: - Switch from `seqwish:5000` to `poa:1000`. POA skips the FastGA alignment + seqwish transclosure chain entirely and runs a single partial-order alignment per sub-window. The sub-window loop and its log emission are identical, so the regression assertion still fires. 1000 bp is impg's minimum allowed partition size. - Shrink the FASTA from 15 kbp + 2 kbp tails (34 kbp total) to 3 kbp + 500 bp tails (7 kbp total), and shrink the query range from 15 kbp to 3 kbp. That keeps the 3-sub-window assertion intact while dropping the per-run pipeline cost. - Drop the secondary "GFA has > 100 bytes" assertion. The primary sub-window-count assertion is the regression test; downstream engine success is checked by the existing pipeline tests in test_pipeline_integration.rs. Local timing: test_query_syng_gfa_subwindow_splitter now 0.49s (vs 346.96s previously, ~700x speedup). Full test_syng_integration suite now 0.80s (vs 350s). Full release serial test suite now ~2s.
The AGC syng build path formatted sequence names as `contig@sample`, but PanSN-style contig names already embed the sample (e.g. `S288C#0#chrIV`), so the result was `S288C#0#chrIV@S288C#0` — a redundant suffix that made `query -r S288C#0#chrIV:...` fail with "genome not found". Use the contig name directly instead.
Outlines the per-hop pipeline: syng one-hop seed → fetch sequences → local pairwise alignment (FastGA/sweepga) → build in-memory ImpgIndex from PAFs → re-query through implicit graph with cigar-based coordinate projection → precise boundaries for next hop. Key insight: no explicit graph (no seqwish/GFA) — the implicit graph IS the PAFs, which impg already knows how to query through. Realignment at each hop prevents compounding syncmer-resolution slop. See notes/SYNG_TRANSITIVE_DESIGN.md for full design, open questions, and integration plan. Work continues on PR #162.
Commit b7a9323 dropped the `@sample` suffix when naming AGC-imported contigs. That was correct for PanSN-encoded names (`sample#haplotype#contig`), which already embed the sample — but broke when contigs are raw names like `chr1` that appear across multiple samples: all three collapse to the same `name_to_path` entry and two of them are silently lost. Only append `@sample` when the contig name doesn't already contain `#` (i.e. isn't PanSN-encoded). Restores test_syng_agc_roundtrip_query to passing without regressing the PanSN path.
Syng's syncmerIterator reports the first syncmer at wherever the min k-mer lands in the initial window — anywhere in [0, w+k). walk_path was hardcoding the first syncmer's accumulated position to 0 and returning relative offsets. Callers (query_region, fast-locate bp_offsets, anchor fetch in boundary realignment) treated those as absolute sequence coordinates, introducing a systematic shift equal to the first-syncmer offset on every query. Store the first-syncmer absolute position in GbwtPathStart at build time (from syncmers[0].1), bump the .syng.names file format to 7 columns, and anchor walk_path's bp accumulator there. Older 6-column files load with first_syncmer_pos=0 and behave as before (still buggy for anchors; a fresh rebuild corrects).
Introduce Anchor + HomologousIntervalWithAnchors; add query_region_with_anchors that emits per-hit shared-syncmer positions and distinguishes forward vs RC homology. The fast-locate fast path previously collapsed every syncmer visit to the forward-orientation GBWT node via unsigned_abs(), silently dropping every reverse-orientation visit. Preserve the sign from walk_path by setting the low bit of the GBWT encoded node id so decompress_da(2*N + 1) returns actual reverse-orientation visits. query_region_with_anchors queries BOTH orientations per query node, tags anchors by (query_orient XOR target_orient), and groups results per (target_path, strand) so forward and RC intervals stay separate (they describe distinct homologies, not the same interval reported twice). query_region is now a thin wrapper that drops anchors. merge_intervals is replaced by merge_intervals_with_anchors which unions + de-dupes the anchor list across merged intervals. Unit tests updated accordingly.
New impg::syng_transitive module implements the multihop syng-seeded transitive query described in notes/SYNG_TRANSITIVE_DESIGN.md. For each syng homolog, resolve its two fuzzy edges to base-pair precision using BiWFA (lib_wfa2 MemoryMode::Ultralow) on small anchor-flanked windows — no full in-memory Impg, no subprocess aligner. Multihop iteration re-seeds each hop from the refined intervals of the previous one so slop doesn't compound across hops. Strand-aware: forward homology uses the anchor pair directly; RC homology fetches the target window in reverse-order bounds, RC's it, aligns, and projects offsets from the left anchor's forward anchor point (t_pos + syncmer_len) downward. Flanks each syng query by ANCHOR_FLANK_BP=512 on both sides so each edge has a candidate anchor on either side; falls back to the syncmer-resolution bound when a flank is missing. Skips realignment for anchor gaps > MAX_REALIGN_WINDOW_BP=2048 (where the cost model of local realignment inverts). Integration tests: - test_syng_boundary_realign_tightens_edges: on an identical backbone, refined edges snap to exact query coords (where the previous syncmer-resolution output was padded by ~150bp). - test_syng_rc_homolog_end_to_end: on an inversion fixture, verifies strand='-' is reported, refined intervals overlap the expected RC window by >=200bp, and RC'd target bytes share a >=30bp exact-match run with query bytes. Coordinate precision is approximate due to inherent RC-syncmer sparsity (~3-5 RC-shared anchors per kb vs ~35 per kb forward).
impg query --syng now runs the BiWFA boundary-realignment transitive pipeline by default for bed/fasta/gbwt output. Each homolog's syncmer-resolution edges are snapped to base-pair precision; under --transitive with --max-depth > 1 the refined intervals feed back into the next syng hop. Add --syng-raw (debug-only) to preserve the previous pass-through behaviour that emits raw syncmer-resolution intervals without realignment — useful for inspecting what syng's node graph reports before any refinement is applied. GFA output stays on raw intervals for now: its partitioning pipeline does its own per-partition alignment downstream, and the interaction with boundary realignment needs its own design pass. Boundary realignment needs a UnifiedSequenceIndex for edge-window fetches. Sequence files are now required for bed output unless --syng-raw is set — error message points users to either --sequence-files/--sequence-list or --syng-raw.
- notes/SYNG_TRANSITIVE_DESIGN.md: pivot from in-memory-Impg + FastGA/WFMASH full-region realignment to BiWFA boundary-only realignment. Document the per-hop pipeline, strand handling, anchor selection, resolved/outstanding open questions, and the known limitation on RC coordinate precision. - examples/syng_probe.rs: diagnostic tool that compares syng's C iterator syncmer positions against walk_path output. Served as the verification step for the walk_path absolute-coord fix; keep it in the tree for future syng-coordinate debugging. - .gitignore: add .tmp*, seqwish temp dirs, .workgraph.1/, .claude/ so syng/seqwish/FastGA scratch files from test runs don't clutter git status.
Two related bugs, both exposed by:
impg query --syng yeast235 -r S288C#0#chrIV:409000-409500 \
-o gfa --gfa-engine pggb -d 1000
1. `-o gfa` was exempt from boundary realignment — the previous
commit kept it on raw syncmer-resolution intervals with the
reasoning "the partitioning pipeline does its own alignment
downstream." That was wrong: fragmented short intervals feed
straight into the GFA partitioned / flat pipelines and produce a
fragmented graph. Wire both GFA sub-paths (partitioned and flat)
through impg::syng_transitive::query_transitive when --syng-raw
isn't set, just like bed/fasta/gbwt do.
2. `-d <merge_distance>` (min_distance_between_ranges) was silently
a no-op in the syng path. Add a merge_distance: u64 parameter to
syng_transitive::query_transitive and one_hop, plumb the value
through from query.transitive_opts.min_distance_between_ranges in
main.rs. Inside one_hop, run bedtools-style distance-merge on the
padded syncmer hits before boundary realignment via a new
distance_merge_anchored helper (anchors from merged intervals are
unioned and deduplicated). Distance 0 is the existing default =
overlap-only merge.
Adds test_syng_query_reconstructs_homology_with_diffs: three genomes
sharing a 3kb region with scattered SNPs and a 10bp indel. Query
genome_a[500..2500] returns exactly one interval per target with
refined edges within single-digit bp of biological truth (the indel
correctly shifts genome_c's right edge to 2490).
…ection
My earlier boundary-realignment implementation used BiWFA on
anchor-flanked windows per-edge per-homolog — which on real
pangenome data (e.g. yeast235 with ~300 haplotypes per region) was
spending most of its time on thousands of AGC sequence fetches and
small BiWFA runs. A 2kb query took >10 min wall time before being
killed.
For the intended use case — closely-related genomes — the shared
syncmer anchors already encode the homology structure. Linear
extrapolation from the innermost anchors to the user's query edges
gives coordinate precision within the local indel burden (single-digit
bp on <5%-divergent genomes), matching what PAF-based queries produce
from their stored CIGARs.
Replace refine_boundaries with project_query_to_target, a pure
coordinate arithmetic function. No AGC fetches. No BiWFA. Per-homolog
cost drops from O(fetches × aligner overhead) to O(anchor list size).
Wall time on the yeast235 example collapses from >10min (killed) to
~1s for the refinement step.
Drop resolve_edge_via_biwfa, ANCHOR_FLANK_BP, MAX_REALIGN_WINDOW_BP,
the thread-local BiWFA aligner, with_biwfa_edit_aligner,
project_query_offset_via_cigar, flanking_anchors, the query-range
widening in one_hop, and all associated tests. Net -400 lines.
Also fix a separate wiring bug: syng_merge_distance was reading from
transitive_opts.min_distance_between_ranges (long-only flag, default
10) instead of QueryOpts::merge_distance (short `-d`, default 0 —
the flag CLI users actually type). Switch to effective_merge_distance()
so `-d 10000` works as documented. Without this, my distance-merge was
effectively a no-op on CLI usage.
Verified on yeast235 --agc:
impg query --syng yeast235 --sequence-files yeast235.agc \
-r S288C#0#chrIV:408000-410000 -o bed -d 10000
Before: 746 bed rows, up to 7 fragments per haplotype on same strand.
After: 526 bed rows, max 2 rows per (sample,hap,chrom) — one `+`
and one `-` (real RC homology, not fragmentation).
Updates notes/SYNG_TRANSITIVE_DESIGN.md to match the simpler algorithm.
New unit tests in src/syng_transitive.rs cover linear projection on
both strands, clamping, and distance-merge semantics.
…trand
Syng's query_region_with_anchors groups shared-syncmer hits by
(path, strand) and emits one padded interval per group. For short
syncmers (k=8) on large genomes like yeast, a query syncmer's
canonical hash coincidentally matches other positions on the same
target path in the opposite orientation — producing spurious `-`
anchors inside (or around) real `+` homologies and vice versa.
On yeast235 this caused:
* ~50% of `+` homologs to come paired with a nested `-` "homolog"
at the same target location, doubling the sequence set fed to
downstream consumers (GFA pipeline ended up building a doubled
graph where every region was aligned against its own RC).
* The -o gfa path to take 20+ minutes because seqwish/smoothxg
were processing twice the inputs.
Fix: after per-(path, strand) distance-merge, run a cross-strand
dedupe pass per path. If `+` and `-` intervals on the SAME path
overlap on target forward coordinates, keep only the strand with more
anchor support and drop the other. Non-overlapping `+`/`-` intervals
on the same path stay separate (real inversion on a separate region
of the same contig is legitimate biology).
Result on the yeast235 command (S288C#0#chrIV:408000-410000):
* Bed rows: 526 → 271
* `-` strand rows: 266 → 61 (noise gone, real RC retained)
* HN1#0#chrIV, AAA#0#chrIV, etc: one clean `+` interval each,
no nested `-` duplicates.
4 new unit tests in src/syng_transitive.rs cover:
- overlapping +/- → keep majority (noise filter)
- non-overlapping +/- on same path → keep both (real inversion)
- different paths never cross-dedupe
- deterministic tie-breaking
added 30 commits
April 16, 2026 21:21
…tive to linear+edge Current syng query uses linear projection from innermost anchors with a bounded BiWFA edge realignment. Validation against PAF ground truth showed ~1500bp systematic undershoot on RC homologs, because closed syncmers aren't RC-symmetric — the anchor grid is sparse on '-' strand and the innermost anchor sits far from the query boundary. This note studies replacing linear+edge realignment with per-homolog pairwise alignment: fetch query bytes + target bytes (padded with generous flank on both sides), BiWFA-align semi-global, trace the CIGAR to project qs/qe onto target forward strand. Covers: the algorithm, corner cases (partial homology, divergence, repeats, multihop), cost analysis (~10-15s wall for yeast235 2kb query, likely parallelizable), expected accuracy gain (>95% within 5bp), and an implementation plan in 7 steps.
Replaces the per-edge BiWFA realignment with a single per-homolog pairwise alignment: fetch the full query bytes once per hop, fetch the target bytes padded ±2048bp around syng's syncmer-resolution bounds, BiWFA EndsFree (target both ends free, query End2End), and read qs/qe offsets directly from the CIGAR. Rationale: the previous edge-realignment approach extrapolated linearly from innermost anchors and could only BiWFA-extend a bounded outer span. Under RC homology syncmers are ~10x sparser than on forward, so the innermost '-' anchor often sat 500-2000bp inside the real homology. PAF validation showed a systematic ~1500bp undershoot on '-' strand paths because edge realignment couldn't extend far enough outward. Per-homolog alignment sidesteps this: the full query aligns inside a padded target window, the CIGAR gives exact bp boundaries, and anchor density on the outer edge stops mattering. Implementation notes: - New `refine_homolog_by_alignment`. Linear-projection fallback preserved in `refine_by_linear_projection` for when sequence fetch or alignment fails. - Query bytes cached once per hop (avoids 270x redundant AGC reads). - Identity gate: >= 30% of query bases in M/= ops. Below that the "homolog" is treated as syncmer noise and falls back to linear. - HOMOLOG_FETCH_PAD_BP = 2048 (comfortably exceeds the worst ~1500bp undershoot observed in RC validation). - lib_wfa2 EndsFree on target: text_begin_free = text_end_free = t_len. Allows alignment to land anywhere inside the padded window. Validation vs PAF ground truth on yeast235 S288C#0#chrIV:408000-410000 -d 10000: - |start Δ| ≤ 5bp: 59.3% → 98.1% - |end Δ| ≤ 5bp: 74.2% → 99.0% - Max |start Δ|: 1583bp → 394bp (and only 4 paths exceed 20bp) - Wall time: ~5s, unchanged. All 152 lib tests + 9 integration tests pass. See notes/SYNG_OPTION3_PAIRWISE_REFINE.md for design.
Runs both syng and PAF-based impg queries on N random regions along one chromosome, emits per-region agreement stats (mean/max deltas, %-within-5bp). Useful for seeing how agreement distributes across different parts of the genome — easy, repeat-rich, and everything between. First run on yeast235 S288C#0#chrIV (15 regions): 9/15 essentially perfect (100% within 5bp), the other 6 have 1-14kb max deltas concentrated in the rDNA cluster (~1.17Mb) and one extreme 124kb outlier that's likely a specific structural variant in a paralog- rich region. Non-repeat regions agree effectively exactly between syng and a sparse all-vs-reference PAF.
…blist Sparsification strategies (auto, random, tree, giant) now select pairs at PanSN SAMPLE#HAPLOTYPE granularity instead of contig granularity. For inputs like yeast235 (9,902 contigs across 235 haplotypes) the selection runs over 27,495 haplotype pairs rather than 49M contig pairs, then expands each selected haplotype pair to the cross-product of contig pairs plus all intra-haplotype contig pairs. Non-PanSN names fall back to contig-level behavior via the existing PanSnLevel::Haplotype fallback. Implementation: - group_indices_by_haplotype groups via sweepga::pansn::extract_pansn_key. - merge_sketches unions MinHash minimizers across a haplotype's contigs (bottom-k sketch mergeability). - select_pairs_haplotype_aware runs the strategy on merged haplotype sketches, then expand_haplotype_pairs fans out to contig pairs. - generate_pairs / generate_pairs_for_sequences route through the new helpers; sketch-free strategies (None/Random/WfmashDensity) use the _no_sketch variant. Also fix the --format joblist path which hardcoded FastGA and emitted one command per contig pair (millions of identical lines for single-file PanSN input). Now branches on config.aligner: - wfmash: one `wfmash -T <hap_i> -Q <hap_j>` command per haplotype pair, matching wfmash's one-genome-pair-per-invocation model. - fastga: one command per unique (target_file, query_file) pair instead of per contig pair. Sync sweepga dep to b7d8820 (pangenome/sweepga#fix-batch-pairs-file) which plumbs pairs_file through WfmashBatchAligner so pair-sparsification and batching compose. Adjust call sites for the upstream API split (align_self_paf -> align_self_paf_direct / align_self_paf_batched) and the FastGAIntegration::new signature change (usize instead of Option<usize>). Tested: 155 lib tests pass (5 new covering grouping, expansion, merge, None-equivalence). On yeast235 with --sparsify random:0.05: 27495 hap pairs * 0.05 + 235 self = ~1610 hap pairs -> 1,587 wfmash commands written to joblist.
Execute an existing joblist file (one shell command per line) in parallel with built-in progress logging. Each line runs via `bash -c`; `--jobs N` sets the concurrent slot count (defaults to `--threads`). Progress lines are emitted to stderr every ~2 seconds: completed/total, elapsed, rate in jobs/s, and ETA. When `--run-joblist` is provided, sequence inputs and sparsification are ignored — it's a direct execution path so users don't need xargs / GNU parallel to run a previously generated joblist. The expected pipeline is: 1. `impg align --format joblist -o JOBDIR ...` — emits align_jobs.txt 2. `impg align --run-joblist JOBDIR/align_jobs.txt --jobs 4` — runs it Failures are collected and reported at the end; successful jobs are not rolled back (the per-command output redirects are already on disk).
Under `impg query --syng -x`, BFS iteration could revisit the same homologous region via different transitive paths and emit it repeatedly with slightly different BiWFA-refined boundaries (off by 1-150 bp). For a 2.5 kb query on yeast235 this inflated output ~6x (1459 rows for 268 unique sequences, ~5.4 rows per sequence). Add `merge_overlapping_on_same_path`: after BFS, group hits by (genome, strand), sort by start, and merge strictly overlapping intervals into their union. Touching or gapped intervals are left separate so genuinely adjacent homologs (tandem arrays, fragmented assemblies with 20 bp gaps) are not collapsed. Verified on S288C#0#chrIV:1257961-1260470: - rows: 1459 -> 691 - AAA#0#chrIV: 4 rows collapse to 2 (1248140-1250212 kept, 3 overlapping refinements at 1250232-1252*** merge to 1250232-1252741) - overlap correspondence with PAF: 95.9% (6 remaining gaps are a separate transitive-BFS reach issue, not a per-hit dedup bug)
Adds a source-side query-window extension (bp) for syncmer discovery, applied only at the initial BFS hop. When the user's query sits just past the end of a conserved syncmer block, a few kb of extension lets syng pick up the block's terminal anchors and emit homologs that would otherwise be missed. Boundaries stay clipped by BiWFA refinement; the extension only widens the lookup filter on the query side. New API: - SyngIndex::query_region_ext / query_region_with_anchors_ext - syng_transitive::one_hop_ext / query_transitive_ext CLI: --syng-extension <BP> (default 0). Non-zero values can significantly increase BFS fan-out and query time: extension 5 kb on yeast235 expands from ~700 to ~17k result rows and turns a 35s query into ~13 min, so this is a targeted knob for boundary cases, not an always-on default. Also update tests/validation/battery_syng_vs_paf.sh to use -x on both syng and PAF queries so transitive-vs-transitive comparisons are apples-to-apples. Investigation context (for PR thread): Correspondence between syng -x and PAF -x on yeast235 full alignment shows 95.9% overlap coverage at ext=0. The remaining ~4% splits into (a) wfmash cross-chromosome alignment artifacts that syng correctly rejects, and (b) cases where wfmash extends a chained alignment past where any shared syncmers exist. Case (b) cannot be recovered by extension alone — if no syncmer edge connects query to target, the syng graph has no path. Extension helps when syncmer support ends just outside the query window; it does not manufacture homology.
The BFS in query_transitive calls one_hop for every frontier region. On repeat-dense regions, those frontiers overlap heavily in the syncmer-node space: the same node gets walked, GBWT-located, target-visit-enumerated, anchor-built, and BiWFA-refined once per frontier it touches. All of that work is then thrown away when the coordinate-level `visited` set dedups the resulting hits. Thread a shared `visited_nodes: FxHashSet<u32>` through the BFS. Each syncmer node is processed at most once across the entire transitive query; subsequent encounters short-circuit before lookup. Unique sequence/haplotype reach is preserved — the skipped work would only have produced refined variants of already-reached targets that coordinate dedup (and the post-BFS interval-merge) would discard anyway. Measured on yeast235: - Hard repeat-rich region (S288C#0#chrIV:1162956-1167850, max_depth=2): ~15 min → 4 min 19s (≥3.5× speedup). Was previously the tail-case that made whole-genome batteries impractical. - Easy region (S288C#0#chrIV:1257961-1260470, max_depth=2): unchanged at ~5 s, same 268 unique sequences / 220 haplotypes reached. New API: - SyngIndex::query_region_with_anchors_ext_visited — takes Option<&mut FxHashSet<u32>>. - syng_transitive::one_hop_ext_visited — forwards the set. Prior signatures preserved as `None`-threading wrappers. All 95 syng lib tests pass.
Integrates main's alignment-unification (PR #163) while keeping impg's single-CLI UX: our impg align subcommand stays, but it now runs on top of main's refactored EngineOpts/AlnArgs (fields now under aln.sw.* + engine_opts.pipeline.*) and uses main's align_sequences() helper for shared pipeline plumbing. Notable resolutions: - src/commands/align.rs: kept (ours). PanSN haplotype-level sparsification, wfmash-per-haplotype joblist, --run-joblist progress/ETA remain here for now; next step is upstreaming them to sweepga so impg align becomes a pure pass-through. - src/commands/mod.rs: re-enable `pub mod align`; keep our sweepga-aware aligner constructor and filter helpers. - src/commands/graph.rs: take main's build_graph path (calls new align_sequences() + map_pct_identity from config); restore the extra pairs_file arg (None) for sweepga::align_self_paf_batched at the unsparsified call site. - src/lib.rs: keep SyngImpgWrapper + dispatch_gfa_engine_with_seq_index; drop obsolete build_seqwish_config helper (main dropped SeqwishConfig in favour of passing &engine_opts.pipeline directly). - src/main.rs: keep our Align CLI variant and handler; migrate field reads to aln.sw.* to match main's AlnArgs flattening; fix the syng partition path that referenced the pre-main min_identity name. All 155 lib tests pass.
…epga Bumps sweepga dep to rev 2b9f749 (pangenome/sweepga#29), which upstreams the PanSN helpers impg was maintaining locally. Impg align stays — it's the one-tool CLI entry point — but its guts are now one-liners that delegate to sweepga. Deleted (moved upstream): - group_indices_by_haplotype → sweepga::pansn::group_indices_by_pansn - merge_sketches → sweepga::knn_graph::merge_sketches - expand_haplotype_pairs → sweepga::knn_graph::expand_haplotype_pairs - select_pairs_haplotype_aware → ditto in sweepga::knn_graph - select_pairs_haplotype_aware_no_sketch → ditto - write_wfmash_joblist inline emit loop → sweepga::joblist::write_wfmash_pansn_commands - sanitize_for_filename (wfmash path; fastga path kept stems directly since file_stem() produces no PanSN `#` characters). The 5 unit tests covering those helpers moved upstream with the code; impg now keeps only the integration-level tests that validate its align CLI. generate_pairs_for_sequences / generate_pairs keep their shape (public API for impg::build_graph callers) but are reduced to thin adapters selecting between sweepga's sketch and no-sketch variants by strategy kind. Verified on yeast8 (scerevisiae8.fa.gz): `impg align --aligner wfmash --sparsify tree:1:0:0.01 --format joblist --threads 8` emits the same 13 wfmash -T -Q commands that `sweepga --wfmash --sparsify tree:1:0:0.01 --joblist` produces. All 150 impg lib tests pass.
Bumps sweepga dep to ddd31d3 which generalizes write_wfmash_pansn_commands to carry per-job target/query FASTA paths. impg align's write_wfmash_joblist now builds a haplotype→FASTA map across all input files and emits WfmashPansnJob entries with the correct paths per side. Single-FASTA input keeps the self-map form (one file arg); multi-FASTA input emits two files in target/query order. Verified on a 4-file yeast PanSN split — output matches `sweepga ... --joblist` byte-for-byte.
distance_merge_anchored collapses adjacent same-path hits within merge_distance (default 10 kb), which on repeat arrays — rDNA on chrXII, Ty elements, subtelomeric repeats — collapses thousands of syncmer hits across a ~Mb region into ONE giant merged hit. Its refinement fetch window then spans up to ~target_len bytes, and BiWFA in MemoryMode::High allocates O(query_len × target_len) working memory: a 5 kb tile against a 1 Mb repeat-cluster target is 5 000 × 1 000 000 × 8 B ≈ 40 GB peak per call. This is what OOM- killed an impg worker at 77 GB RSS during the genome-wide battery (dmesg 2026-04-18 20:59:58). Short-circuit refine_homolog_by_alignment when the target window exceeds MAX_REFINE_TARGET_WINDOW_BP (200 kb — 40× a typical 5 kb tile). The existing caller already falls back to refine_by_linear_projection when refinement returns None, so this degrades gracefully: super-long hits get approximate bp coordinates from the leftmost/rightmost anchors rather than a base-precise BiWFA CIGAR. Alternative approaches considered but rejected: - Frequency-masking syncmer nodes: kills signal in repeat-rich tiles entirely (tested: --syng-max-node-freq 1000 on the previously-studied hard region dropped 7018 rows to 0). - Downgrading MemoryMode::High to Ultralow: loses CIGAR, which we need for the leading/trailing-I boundary count. - Shrinking merge_distance: would fragment legitimate homologs. Row counts on normal and hard regions are unchanged; the cap only changes behaviour on rDNA-scale merged hits, where boundaries were already ill-defined.
This reverts commit 6a4e111.
Two changes that bound query memory without losing signal on repeat-rich regions — the pathology that OOM-killed a single impg process at 77 GB RSS during the genome-wide battery. ## 1. Decouple internal anchor merge from output -d Previously `syng_merge_distance` (the value handed to `distance_merge_anchored` on every hop) inherited the user's output `-d` / `--merge-distance`. On rDNA / Ty / subtelomeric tiles this caused the whole array of per-syncmer hits to collapse into one mega-interval BEFORE BiWFA refinement. The 5 kb query was then aligned against a target window spanning the full repeat array — which is where the O(s² + n·m) memory blew up. Fix: the internal anchor merge is now fixed at 0. Each syncmer hit keeps its own identity into BiWFA refinement; every repeat copy gets its own refined bp coordinates from its own small fetch window. bedtools-style output merging remains available via `merge_overlapping_on_same_path` (post-BFS, on refined coords) and downstream BED output merging, both of which operate on alignment-refined intervals rather than syncmer-resolution ones. ## 2. BiWFA MemoryMode::High → Low `Distance::Edit` aligner now uses `MemoryMode::Low` — piggyback bidirectional WFA with O(s · log s) working memory, instead of the `High` mode's full O(s² + n·m) DP matrix. Traceback (needed for the leading / trailing-I counts) is preserved in both modes; only the intermediate storage shrinks. ## Measurements on S288C#0#chrIV:1162956-1167850 * peak RSS: 77 GB OOM -> 2.75 GB (under 3 GB per process; 14+ parallel workers comfortably fit in 93 GB) * row count: 7018 -> 7022 (the 4 extra rows are repeat units now refined separately instead of being collapsed pre-refinement, exactly the intended outcome) * wall time: ~4-15 min -> 8:16 (stable; more BiWFA calls, each cheap and bounded) S288C#0#chrIV:1257961-1260470 (non-repeat regression check): rows unchanged at 563, wall ~8 s.
Prerequisite for embedding syng backends alongside PAF-backed Impgs in
a heterogeneous `Vec<Arc<dyn ImpgIndex + Send + Sync>>` (MultiImpg-style
mixed sources).
Prior state: SyngImpgWrapper (and SeqIndexWrapper) impl'd the 4–5
PAF-specific trait methods by calling `unimplemented!()`. As long as
callers only touched `query*` on a `&impl ImpgIndex` that was actually
a syng wrapper, nothing blew up. Put the same wrapper inside a
trait-object vector iterated generically (caches, tree fetches) and
we'd panic.
Changes (no call-site changes; behaviour preserved for existing
`&impl ImpgIndex` paths):
- `query_with_cache` -> delegate to `query_via_syng` (ignore cache).
- `populate_cigar_cache` -> no-op (syng has no CIGAR store).
- `get_or_load_tree` -> `None` (syng has no alignment interval tree).
- `target_ids` -> enumerate syng's `name_map.path_to_name` and
filter through this wrapper's sequence index
so caller-visible IDs align.
- `remove_cached_tree` -> no-op.
- `SeqIndexWrapper`: same treatment — returns empty Vec / None from
every method. Only `seq_index()` is used in practice; other methods
are there for heterogeneous iteration safety.
No behavioural change on today's `&impl ImpgIndex`-typed call sites.
Unblocks a future `MultiImpg.sub_indices: Vec<Arc<dyn ImpgIndex + ..>>`
refactor to mix alignment-backed Impgs with syng wrappers. The full
heterogenization — forest_map bookkeeping for syng paths, unified
SequenceIndex plumbing, loader dispatch — remains a follow-up.
150 lib tests pass.
…epeat tiles
Repeat / subtelomeric 10-30 kb queries can produce 10 000-20 000
syncmer hits per hop. Each hit gets an EndsFree BiWFA alignment of
the full query bytes against a ~4 kb padded target window. At
~250 ms per call (edit distance scales with query length × per-call
divergence) that's >1 hour/hop on the worst tiles; the agent's
300-region battery timed out on 52/300 regions at a 240 s budget.
Measured on S288C#0#chrI:171571-192391 (21 kb, timed out pre-fix):
cap=0 (projection only) 6.4 s 18487 rows
cap=100 31.7 s 18325 rows
cap=200 58.9 s 18171 rows
cap=500 (default) ~2 min (still timed out over 120 s but
finishes well inside a 4-min budget)
Add `MAX_BIWFA_REFINES_PER_HOP = 500`: after per-hop distance-merge +
strand-dedupe, rank surviving hits by anchor count descending and
refine only the top 500. Remaining hits fall back to the existing
linear-projection path (same path that's already used for fetch /
alignment failures). Anchor count is a sensible confidence proxy —
hits with many shared syncmers are more likely to be real homologs
than singletons, so high-confidence hits keep bp-precise BiWFA
boundaries and the noisy tail gets syncmer-resolution projection.
Row counts on the hard region change <2 % between cap=0 and cap=200
(18 487 -> 18 325 -> 18 171), so precision trades are bounded.
Normal regions (hits <= 500) are untouched: every hit still gets
BiWFA.
Also document the fixed internal syng_merge_distance=0 in main.rs
(unchanged from 70f5dd1) — within-homology merging via signature
guard never successfully distinguished paralogs from collinear
chains at yeast kb-scale spacing, so the cap is the real memory /
time safeguard.
Follow-up (not in this commit): expose as a CLI flag so users can
opt into higher caps for bp-precise whole-genome studies.
…merge `distance_merge_anchored` took two effective knobs — target-axis distance (primary) and signature tolerance (guard, hardcoded at merge_distance / 10). That coupling meant: - At any `-d`, distance selected a set of hit pairs; signatures rejected paralogs only when signatures differed by > merge/10. - To keep within-homology syncmer chains merging into single blocks needed a generous `-d`, but that `-d` also pulled in paralog copies via the target-axis check, leaving the /10 signature tolerance to clean up — which didn't reliably happen at yeast kb scale. - Result: either repeat arrays collapsed into one mega-hit (giant BiWFA target windows, OOM), or each syncmer landed on its own hit (tens of thousands of tiny BiWFA calls, timeout). Signature space is the right primary axis: along a colinear homology the signature (target_pos − query_pos on '+', the sum on '-') is invariant modulo local indel burden; paralog copies have structurally different signatures by design. So: `cluster_by_signature(hits, merge_distance)`: for each (genome, strand) bucket, sort hits by signature, walk the sorted list and start a new cluster whenever the gap between consecutive signatures exceeds merge_distance. Each cluster → one HomologousIntervalWithAnchors with the union of its members' anchors and target bounds. One knob with one physical meaning: the maximum signature excursion tolerated *within* a homology block (= the local indel-burden tolerance). Passed straight through from the user's `-d` / `--merge-distance`. No inner fixed-ratio tolerance, no cross-talk with target distance, no separate "internal merge" constant. Integration: - `distance_merge_anchored` → `cluster_by_signature`. - `SYNG_INTERNAL_MERGE_BP` constant removed; `syng_merge_distance` in main.rs is once again `query.effective_merge_distance()`. - Pipeline doc updated; tests renamed + covering both the sig-diff-within-merge (clusters) and sig-diff-exceeds-merge (splits) cases. Measured on S288C#0#chrI:171571-192391 (21 kb repeat-rich tile that used to time out at 300 s): -d 0: 17805 rows, 2:20 wall -d 500: 16329 rows, 2:19 wall -d 2000: 16022 rows, 2:14 wall Normal region S288C#0#chrIV:1257961-1260470 unchanged at 254 rows, 7 s. The existing top-K BiWFA cap (MAX_BIWFA_REFINES_PER_HOP = 500) still applies per hop as belt-and-suspenders. 151 lib tests pass (was 150; one additional test).
Previously BiWFA was handed the *full user query bytes* (e.g. 21 kb)
against a ~4 kb per-cluster target window with ends-free on target
only. Forcing the full query to fully align inside a much smaller
target forced an edit-distance of at least (query_len − target_len),
driving each call to O(s²) — 250 ms per refinement. That cost is
what drove the earlier MAX_BIWFA_REFINES_PER_HOP cap: speed hack for
what's fundamentally "wrong input to BiWFA".
`refine_homolog_by_alignment` now takes the full query bytes plus
the query region's start coordinate and extracts a **sub-query
window** covering only the cluster's anchor-supported span plus
200 bp flank padding. A cluster whose anchors cover, say, query
positions 15 000–15 300 of a 21 kb region aligns just those 300 bp
(plus padding) against the ~4 kb target window. Small × small, fast.
That reshapes BiWFA cost from O(query_len²) per call to O(span²)
per call — milliseconds instead of hundreds of ms.
Consequence: MAX_BIWFA_REFINES_PER_HOP is gone. Every cluster gets
its own BiWFA refinement, which is the semantics the signature
clustering is designed for (one block, one precise boundary). The
belt-and-suspenders cap was covering up the inefficiency, not
solving it.
Measured on S288C#0#chrI:171571-192391 (21 kb repeat-rich, the
region that used to time out at 300 s):
full-query + cap=500 sub-query, no cap
-d 0 (strict sig): 17 805 rows / 2:20 -> 17 286 rows / 0:42
-d 500: 16 329 rows / 2:19 -> 16 213 rows / 0:53
Normal region regression (S288C#0#chrIV:1257961-1260470):
unchanged at 254 rows, ~6 s.
Memory peak: 2.8 GB (same ballpark as before — the thread-local
MemoryMode::Low aligner dominates).
All 151 lib tests pass.
Per-cluster BiWFA refinement now extends both sub-query and target window outward from the anchor-supported core by a tunable budget, clipped on the target axis to the signature-sorted neighbor clusters. The effect is a proper "anchor establishes where, extension decides how far" semantics: - Target window: `[hit.start - budget, hit.end + budget]`, clipped to `prev_cluster.end` / `next_cluster.start` on the same `(genome, strand)` and the chromosome edges. - Sub-query window: anchor extent extended by `min(budget, target_span / 2)`, clipped to the user's query region. The target-span cap on sub-query is load-bearing: a cluster whose neighbors clip the target window to, say, 2 kb must not receive a 10 kb sub-query or BiWFA blows up (same O(s²) pathology we just spent multiple commits escaping). Keeping the two sides similarly sized keeps every refinement a balanced small-vs-small alignment. In practice: on unique-conservation regions with distant neighbors, both windows extend to the budget and BiWFA recovers the full block's coordinates. On repeat arrays the neighbor clusters are a repeat-period apart, so the windows stay tight and each copy is refined on its own bounded span. API additions: - `--syng-extend-budget <BP>` on `impg query` (default 1 kb). - `syng_transitive::DEFAULT_EXTEND_BUDGET_BP = 1 000` for library callers. - `extend_budget: u64` threaded through `query_transitive`, `query_transitive_ext`, `one_hop`, `one_hop_ext`, `one_hop_ext_visited`, and `refine_homolog_by_alignment`. - New helper `attach_target_neighbors` groups clusters by `(genome, strand)`, sorts by target start, and emits each with its `(prev_target_end, next_target_start)` neighbors for the extension clip. Measured on S288C#0#chrI:171571-192391 (21 kb repeat-rich): budget=200 (close to prior sub-query flank): completes budget=1000 (new default): 16 240 rows, 1:32 budget=2000+: timeout under 2 min Normal region unchanged at ~270 rows / ~5 s across budgets, with row-count shifts of ≤ 10% as extension recovers or merges block boundaries. 1 kb default covers indel-bounded block boundaries (typical intra- homology indel burst < 500 bp) while bounding BiWFA work on repeat- dense queries. Larger budgets pay off on sparse pangenomes with longer conserved blocks — at the cost of more BiWFA work per cluster. Tunable; benchmark the region class of interest. Removed `HOMOLOG_FETCH_PAD_BP` (fixed 2 kb target padding). Target padding is now controlled by `extend_budget`. 151 lib tests pass.
A syncmer with K occurrences in the user's query and V target visits would previously allocate K × V anchors per node at `query_region_with_anchors_ext_visited` time — *before* signature clustering has a chance to collapse anything. On pathological inputs (centromeric / telomeric / Ty-class syncmers where both K and V are large) this drove a single impg process to 85 GB RSS in 31 seconds, OOM-killing the worker. Neither the `MemoryMode::Low` aligner downgrade nor the anchor-and-extend refinement helped — they both run *after* these anchors have been materialized. Cap K per emitted interval at `MAX_ANCHORS_PER_INTERVAL = 32`. The node is still emitted for every one of its V target visits (structure preserved), every cluster still sees representative anchors, and downstream consumers are unaffected: - signature clustering uses the median anchor signature (stable under any non-degenerate sample of a collinear cluster), - linear projection / sub-query windowing use the leftmost and rightmost anchor query_pos (well-approximated by the first 32 query positions in deterministic input order). Measured on S288C#0#chrI:171571-192391 (21 kb, chrI subtelomere that OOM'd two tiling runs in a row): before: killed at 85 GB RSS after 31 s after: 16 240 rows, 2.7 GB peak RSS, 1:40 wall Normal region (S288C#0#chrIV:1257961-1260470): unchanged at 270 rows, 5.4 s, 2.6 GB peak. Both the fast-locate path and the pre-locate fallback path carry the cap. 151 lib tests pass. Follow-up if we need to tighten further: an optional V-cap (`--syng-max-node-freq`) would drop nodes with pathologically many target visits — but a pure K-cap is sufficient in all cases seen so far without losing node-level signal.
Per-anchor ends-only projection: each chain gets two small ends-free
BiWFA calls (backward from first anchor, forward from last) instead of
one big BiWFA across the anchor-supported span. The interior is
trusted because every interior anchor is a syncmer match; alignment
only places the outer edges.
Clustering replaces signature-bucket walk with greedy most-similar-
diagonal assignment under two caps: merge_distance (signature
tolerance, absorbs small-indel drift) AND a new query-axis positional
cap = --syng-positional-cap-multiplier × --syng-extend-budget
(default 4, configurable). Kills the same-diagonal-far-apart pathology
of pure diagonal clustering.
K-cap removed from anchor collection. Chain-level positional cap
bounds downstream memory; the 32-cap silently dropped repeat signatures
and was arbitrary.
New flags:
--syng-positional-cap-multiplier (default 4)
--syng-emit-cigar (off; attaches approximate per-segment CIGAR —
exact for the two end runs, M-fill across trusted interior; interior
gap-BiWFA aggregation is a follow-up)
merge_distance = 0 preserved as "no signature gating" for CLI default
(bedtools-style overlap-only semantics).
The five previously-timed-out tiles from the genome-wide battery
(chrIII:180k-185k, chrIII:185k-190k, chrIV:885k-890k, chrVII:565k-570k,
chrVII:815k-820k) now complete in 27-45s with RSS <= 2.8 GB.
…ission Replaces the diagonal+positional-cap clustering with mutual-best-buddy anchor chaining under range-bounded, diagonal-preferring scoring. **Memory (fixes 83 GB OOM):** anchor emission in syng.rs now dedupes by diagonal signature per (query_node, forward_path_idx, strand) at first-write-wins. A hyper-repetitive syncmer with K query occurrences and V target visits collapses from K×V anchors to at most K+V-1 (one per distinct diagonal). On yeast tandems this is a ~500× reduction on the worst tiles without losing any colinearity information chain building needs downstream. **Chaining:** `chain_anchors` scores each candidate pair (A, B) where B is downstream of A on both query and target axes with `score = W_sig × |dq − dt| + max(dq, dt)`, `W_sig ≫ 1` under hard constraints: `dq ≤ query_range_len`, `dt ≤ query_range_len`, `|dq − dt| ≤ query_range_len`. Pairs form an edge only when mutually each other's best; chains are connected components via union-find. Same-diagonal partners dominate; off-diagonal paralog crosstalk doesn't reciprocate and is dropped. The queried range itself caps the chain — no positional_cap knob. **CLI simplifications:** - Removed `--syng-positional-cap-multiplier` (derived from range now) - Removed `--merge-distance` plumbing into syng-transitive (not used by best-buddy; bedtools semantics preserved at the top level) - Function signatures: `one_hop*`, `query_transitive*` drop `merge_distance` and `positional_cap_multiplier` **Tests:** new best-buddy coverage — colinear-forms-one-chain, range-constraint-rejects-out-of-range, mutual-best-rejects-non- colinear-triangle, tandem-emits-parallel-chains, strand-boundaries. All 152 lib tests + integration suite pass. **Verified on 5 previously-OOM tiles:** all complete with RSS ≤ 2.8 GB. Wall clock 90-140s per tile (v2 was 30-45s); the ~3× slowdown and 3× row count come from emitting N parallel chains per N-copy repeat instead of one merged chain — more granular and correct output. Tile-worker timeout raised 120→240s to accommodate.
…ains Two bugs surfaced when comparing v3 best-buddy output to PAF on chrIV:530-535k tile (IoU=0.184): 1) Chain-total dt was unbounded. The per-edge hard cap dt ≤ range only bounds individual hops; without a chain-total bound, a 5-hop chain at 3 kb/hop assembles a 15 kb span on a 5 kb query. Fix: track per- root (min_target, max_target) span in the union-find loop. Reject unions whose combined span would exceed query_range_len × 2. 2) Best-buddy fragmentation. The strict mutual-best criterion can cut an otherwise-clean chain when local anchor density creates an edge where A's best forward ≠ B's best backward; the BHH target on the chrIV tile split one ~9kb PAF block into 4 near-touching chains with 0-181 bp bilateral gaps. Fix: after best-buddy assembly, run a merge pass over chains on the same (genome, strand) — two chains with small q_gap AND small t_gap AND colinear (|q_gap − t_gap| ≤ merge_gap_cap = range/20, min 128) merge. Merged span is re-checked against the chain-total cap to avoid re-introducing unbounded extension. Unit tests still pass. Smoke on the bad tile: 4 BHH chains → 3, wall-clock 91s → 6.5s (fragmentation collapse also reduces per-chain BiWFA work).
Tighter caps from the previous commit rejected legitimate 2-3× target- to-query ratios (TE/MEI insertions). Relax to: - max_dt = 3 × range (was 1 × range) - sig_diff_ceil = 3 × range - max_chain_span = 3 × range (was 2 × range) The W_SIG-weighted scoring still strongly prefers shorter, more colinear partners when they exist — the score function does the biasing, the hard caps only reject clearly-implausible alignments. Investigation on the chrIV:530k "bad" tile showed the L1374 and BHH "fragmentation" was actually paralog-diagonal splitting: 3 chains on L1374 have negative q_gap (-5 kb) between adjacent-on-target chains, meaning they represent 3 repeat copies occupying overlapping target ranges at different query positions. PAF collapses these into one alignment; syng keeps them distinct. Both are defensible — syng's output is more granular, PAF's is more aggregated. The IoU drop vs PAF in repeat-dense tiles reflects this granularity difference, not a wrong answer.
Three wins that take rDNA tiles from infinite-hang to 4-7 min: 1) **Parallelism was never on.** `par_iter` in refinement looked right but `--threads` defaulted to 4, so the rayon global pool was sized to 4. Now documented as the tunable that matters: `-t 16` gives 9.5× effective parallelism on a normal hard tile (chrVII wall 16s → 13.3s). 2) **Singleton chains were dominating chain count.** On chrXII:450k rDNA, 2.62M chains emitted, 87% singletons (1-anchor chains with no mutual-best colinear partner). New `--syng-min-chain-anchors` flag defaults to 2; drops singletons at emission. Chain count 345K after filter. Auto-relaxes to 1 when the queried range is shorter than `min × syncmer_len` — a query too short to host multi-anchor chains shouldn't be silently zeroed. 3) **dedupe_strand_overlaps was O(N²) across all genomes.** With 2.62M chains, the global pair loop sat in the genome-string compare for 20+ minutes. Bucket by genome first → O(K²) per target; 1200s → 0.5s on the same tile. Profiling (gated on SYNG_PROFILE=1 env) added atomic counters for fetch/align split and per-phase wall timing in the hot path. Per-tile profile on rDNA:450k with -t 16 after fixes: emit=165s chain_anchors=52s after_filter=345K/2.62M dedupe=0.5s refine=3.5s wall=226s (was ∞) Remaining bottleneck: syng's own `query_region_with_anchors_ext_visited` at 165s on rDNA — serial anchor emission, upstream of this change.
The rDNA tile profile showed 99.99% of emit work on one syncmer node
(a monster with 1.87M target visits × 300 query occurrences each =
6.3 billion anchor-emission iterations). Per-node outer parallelism
didn't help because one node has ~all the work; inner parallelism is
what matters.
New structure:
- outer par_iter over nodes (handles the common case: many small
nodes across threads)
- inner par_iter via fold+reduce over the node's visits, gated on
PAR_VISITS_THRESHOLD = 50k (below that, serial is faster than
the rayon chunking overhead)
- each fold shard accumulates into its own per-(path, strand)
signature hashmap; reduce merges first-write-wins
Each node's per-node hashmap is materialized into
HomologousIntervalWithAnchors entries inside the outer closure; a
post-pass re-keys on (forward_path_idx, strand) for the downstream
per_path map.
Results with -t 16:
chrVII (normal hard): 13.3s → 14.5s (≈baseline)
chrIV: 5.7s → 5.3s
chrI: 4.8s (new)
chrXII:450k (rDNA): 226s → 103s (cpu=403%)
The remaining rDNA gap to perfect 16× scaling (would be ~15 s) is
shard-merge cost in the reduce step; acceptable for now.
SYNG_EMIT_PROFILE env var added for phase-level profiling in syng.rs
(mirrors SYNG_PROFILE in syng_transitive.rs).
Liftover endpoints should be a function of local alignment evidence,
not how the user tiled the query. Prior code leaked `query_range_len`
into chain-assembly caps AND projection target-fetch bounds, so
changing tile size shifted the lifted coordinates. The 1 kb vs 5 kb
tile IoU improvement was a consequence of this bug — smaller tile =
less room for drift — not an algorithm gain.
Fixes:
1) **Chain caps scale with `extend_budget`, not `query_range_len`.**
max_dq = max_dt = sig_diff_ceiling = 3 × budget (per-hop cap)
max_chain_span = 10 × budget (chain total)
post-merge bilateral gap cap = budget / 4
Tile size drops out of chain_anchors entirely.
2) **Projection target-fetch = query_gap + slop.** For each projected
edge (qs backward from A_first, qe forward from A_last) the target
window is sized to the distance being projected. Two-tier slop:
tier 1 "homologous short-circuit" = max(32, query_gap / 20)
tier 2 full budget = extend_budget
Most projections are well-conserved and complete on tier 1; the
retry on tier 2 absorbs divergent cases. The budget is now
physically "indel drift tolerance per projection," not "how far
to blindly extend."
3) Signature changes:
chain_anchors(hits, extend_budget, syncmer_len)
[was (hits, query_range_len, syncmer_len)]
All 152 lib tests pass. Integration suite pass.
For ends-only projection, BiWFA cost scales with query_gap². On 5kb
tiles with chains whose first/last anchor sits mid-chain, query_gaps
can exceed 2kb and per-call cost balloons to 18ms. Previous mitigation
(skip BiWFA above 2kb gap → pure linear) cut wall 6× (61→10.5 min)
but lost 3.4 IoU points.
New "parallel-walk" projection:
- BiWFA on at most `extend_budget` bp closest to the anchor (where
drift is informative and alignment cost is bounded).
- Linear extrapolation for the remaining gap along the anchor's
diagonal (distant bp already well-defined by colinearity).
- Skip BiWFA entirely when query_gap < 64 bp (linear drift < 6bp
at 10% indel rate).
Per-call BiWFA work is now bounded by extend_budget regardless of
chain position; wall is 15.5 min on 5kb battery (4× speedup from
61 min) with mean IoU 0.903 (vs 0.925 baseline = 2.2 pt loss).
Battery comparison on yeast235 5kb:
wall mean IoU IoU<0.8
full BiWFA 61 min 0.925 288
skip-cap 10.5 min 0.891 411
parallel-walk 15.5 min 0.903 341
Ends-only projection with M-fill produces fake CIGARs. We only need
bed/bedpe as the foundation for partitioning; real CIGAR/PAF would
require per-anchor-pair interior BiWFA (future work).
Changes:
- Remove `--syng-emit-cigar` CLI flag and all plumbing.
- refine_ends_only returns (refined_ts, refined_te).
- HomologousInterval.cigar stays in the type for forward-compat;
always None now.
- pure_linear_{back,fwd} use project_query_to_target helper, which
correctly handles query_end inside a_last's syncmer span
(previously overshot by syncmer_len bp).
- Cargo.toml: add [profile.profiling] for flamegraph work.
All 152 lib + 31 integration tests pass.
New CLI flag. Drops chains whose query-axis extent covers less than `fraction × query_range_len`. Default 0.0 preserves current behavior (keep every chain, maximum paralog discovery). Use cases: - 0.0 (default): exhaustive paralog discovery - 0.3-0.5: partitioning foundation (drop fragment chains) - 0.8+: strict (only chains spanning most of the query) Plumbed from CLI → query_transitive_ext → one_hop_ext_visited; filter runs after chain_anchors, before dedupe/projection/refine. Spot checks on 5 kb tiles: chrVII:565k f=0.0:10795 rows 7.8s → f=0.5:1069 rows 5.2s chrXII:450k f=0.0: 2603 rows 75s → f=0.5: 2 rows 69s (rDNA emit-bound; filter mostly culls paralog noise rather than speeding)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Integrates syng as a C library inside impg, providing syncmer-based homology detection via GBWT index queries — complementing the existing alignment-based approach.
cccrate intolibsyng.a, with safe Rust FFI bindings (syng_ffi.rs,syng.rs)impg syng --agc pangenome.agc -o prefixorimpg syng -f genomes.fa -o prefix— progressive construction, one sequence at a time (low memory). Produces.1khash,.1gbwt,.syng.namesfiles (interoperable with standalone syng)impg query --syng prefix -b region.bed— query homologous regions using the syng index, with BED/GFA/FASTA output-o gbwtoutput format to produce region sub-indexes--syncmer-k(default 8),--syncmer-w(default 55),--syncmer-seed(default 7)Key implementation details
seqhash.c(ASCII vs numeric encoding forpatternRC[4])Test plan
cargo test(126 tests pass)cargo clippy -- -D warningsclean