[WIP] Add stable timestamps mode with VAD-aware timing#3675
Draft
thewh1teagle wants to merge 3 commits intoggml-org:masterfrom
Draft
[WIP] Add stable timestamps mode with VAD-aware timing#3675thewh1teagle wants to merge 3 commits intoggml-org:masterfrom
thewh1teagle wants to merge 3 commits intoggml-org:masterfrom
Conversation
cdf2ea4 to
b967a8f
Compare
…stamps Replace concatenate-decode-remap pipeline with per-segment VAD decoding, matching how stable-ts/faster-whisper works. Each VAD speech segment is decoded independently and timestamps are offset by the segment's original start time — no mapping table or interpolation needed. Results on 5-min synthetic audio (46 utterances, 7x 20s pauses): pct_words_overlap: 0.89% (vs 5.7% stable-ts, 22.6% previous v2) n_words_overlap: 5 (vs 22 stable-ts, 144 previous v2) Wall time: 22.8s (vs 43.2s stable-ts — 1.9x faster via Metal) Code removed: - whisper_vad() concatenation + mapping table building - vad_time_mapping struct, vad_mapping_table, has_vad_segments from state - map_processed_to_original_time() in whisper.cpp - whisper_stable_map_processed_to_original() in whisper-stable.cpp - mapping params from whisper_stable_snap_segments() Code added: - whisper_full_vad_segments(): ~70-line per-segment decode loop - whisper_full_parallel() with VAD delegates to whisper_full() Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ab381e3 to
4f5d796
Compare
Plans, notes, test outputs, and benchmark scripts are internal development artifacts — not relevant to the PR review. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add a new stable timestamps mode that makes word/segment timing less likely to drift into silence.
What it includes:
Early 5-minute synthetic verification results (large-v3-turbo): start-in-silence dropped 43.3% -> 10.9% (213 -> 55), silence overlaps dropped 240 -> 88, WER improved 23.8% -> 2.6%, CER
improved 21.6% -> 2.2%, and token count stayed close (492 -> 506). Runtime in this run was 10.6s baseline vs 30.2s stable (needs optimization).