Skip to content

[WIP] Add stable timestamps mode with VAD-aware timing#3675

Draft
thewh1teagle wants to merge 3 commits intoggml-org:masterfrom
thewh1teagle:feature/stable-timestamps
Draft

[WIP] Add stable timestamps mode with VAD-aware timing#3675
thewh1teagle wants to merge 3 commits intoggml-org:masterfrom
thewh1teagle:feature/stable-timestamps

Conversation

@thewh1teagle
Copy link
Contributor

Add a new stable timestamps mode that makes word/segment timing less likely to drift into silence.

What it includes:

  • VAD-based silence map and post-hoc timestamp snapping
  • DTW alignment improvements (gap padding + dynamic head selection)
  • Silence-constrained timestamp decoding via logits filtering
  • CLI/API wiring for --stable-timestamps
  • Synthetic TTS verification scripts to compare baseline vs stable outputs and timing quality

Early 5-minute synthetic verification results (large-v3-turbo): start-in-silence dropped 43.3% -> 10.9% (213 -> 55), silence overlaps dropped 240 -> 88, WER improved 23.8% -> 2.6%, CER
improved 21.6% -> 2.2%, and token count stayed close (492 -> 506). Runtime in this run was 10.6s baseline vs 30.2s stable (needs optimization).

@thewh1teagle thewh1teagle force-pushed the feature/stable-timestamps branch 3 times, most recently from cdf2ea4 to b967a8f Compare February 23, 2026 04:07
thewh1teagle and others added 2 commits March 6, 2026 02:15
…stamps

Replace concatenate-decode-remap pipeline with per-segment VAD decoding,
matching how stable-ts/faster-whisper works. Each VAD speech segment is
decoded independently and timestamps are offset by the segment's original
start time — no mapping table or interpolation needed.

Results on 5-min synthetic audio (46 utterances, 7x 20s pauses):
  pct_words_overlap: 0.89% (vs 5.7% stable-ts, 22.6% previous v2)
  n_words_overlap:   5     (vs 22  stable-ts, 144   previous v2)
  Wall time:         22.8s (vs 43.2s stable-ts — 1.9x faster via Metal)

Code removed:
- whisper_vad() concatenation + mapping table building
- vad_time_mapping struct, vad_mapping_table, has_vad_segments from state
- map_processed_to_original_time() in whisper.cpp
- whisper_stable_map_processed_to_original() in whisper-stable.cpp
- mapping params from whisper_stable_snap_segments()

Code added:
- whisper_full_vad_segments(): ~70-line per-segment decode loop
- whisper_full_parallel() with VAD delegates to whisper_full()

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@thewh1teagle thewh1teagle force-pushed the feature/stable-timestamps branch from ab381e3 to 4f5d796 Compare March 6, 2026 00:15
Plans, notes, test outputs, and benchmark scripts are internal development
artifacts — not relevant to the PR review.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant