[BugFix][Engine] Fix functional gaps in ZMQ token processing path vs legacy batch output path#6954
Draft
[BugFix][Engine] Fix functional gaps in ZMQ token processing path vs legacy batch output path#6954
Conversation
|
|
|
Thanks for your contribution! |
…atch output path Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Fix functional gaps in new ZMQ-based token processing
[BugFix][Engine] Fix functional gaps in ZMQ token processing path vs legacy batch output path
Mar 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The ZMQ-based token post-processing path (
_process_batch_output_use_zmq+_process_per_token) had multiple functional gaps compared to the legacy_process_batch_outputpath, causing behavioral divergence in production scenarios.Modifications
_process_per_tokenFD_ENABLE_INTERNAL_ADAPTEReos filtering: eos tokens were unconditionally appended toresult.outputs.token_ids; now filtered out when internal adapter is enabled (still appended totask.output_token_ids)_compute_speculative_statusmissing arg: called withoutresult, causing a runtime error in speculative decoding pathscache_output_tokensmissing: output token KV cache not persisted whenenable_prefix_caching + enable_output_cachingare both onTTFT_S: addedttft_s = ttft + task.metrics.time_in_queueto match legacy log format_process_batch_output_use_zmqRequestOutputmissing fields:output_type=3andprompt_token_ids_lenwere absent; downstream usage stats and serialization depended on thesenum_cached_tokensset only on first token: moved outside thetokens_counter == 0block so it reflects the current value on every stepprefill_chunk_infonot handled: chunked prefill would produce premature intermediate results before all chunks completedscheduler_metrics_loggernot notified: decode token metrics were never reported via ZMQ pathdraft_token_idsnot populated for multi-token prefill: splitwise prefill scenarios missing draft token passthroughChecklist
pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.Original prompt
Background
In PR #6879 (
feat/zmq_mtp_newbranch from sunlei1024/FastDeploy),token_processor.pyintroduces a new ZMQ-based token post-processing path viaprocess_model_runner_output()+_process_per_token(), intended to be functionally equivalent to the legacy_process_batch_output()path. However, a detailed comparison reveals multiple functional gaps and missing logic in the new path. These need to be fixed to ensure behavioral parity.File to modify
fastdeploy/output/token_processor.pyThe file can be viewed at: https://github.com/sunlei1024/FastDeploy/blob/c8eaaec4ff504248cf86c04fac422d44822c4f3c/fastdeploy/output/token_processor.py
Specific Issues to Fix
1. Missing
prompt_token_ids_lenandoutput_typein RequestOutput construction (process_model_runner_output, ~L1006-L1017)The old path (
_process_batch_output, L656-L669) constructsRequestOutputwithoutput_type=mtypeandprompt_token_ids_len=task.prompt_token_ids_len. The new path is missing both fields.Fix: Add
output_type=model_output.decode_modeandprompt_token_ids_len=task.prompt_token_ids_lento the RequestOutput constructor inprocess_model_runner_output.2. Missing
FD_ENABLE_INTERNAL_ADAPTERfiltering in_process_per_token(~L1070-L1072)The old path (L685-L689) filters eos tokens from
result.outputs.token_idswhenFD_ENABLE_INTERNAL_ADAPTERis enabled:The new path unconditionally appends.
Fix: Add the same
FD_ENABLE_INTERNAL_ADAPTERcheck in_process_per_token:3. Missing
cache_output_tokenscall in_process_per_token(~L1074-L1092)The old path (L758-L765) caches output tokens when prefix caching + output caching is enabled. The new path has no such call before
_recycle_resources.Fix: Add the same output caching logic in
_process_per_tokenbefore the_recycle_resourcescall:4. Missing
num_cached_tokensassignment outside first-token block (process_model_runner_output, ~L1037-L1044)The old path sets
num_cached_tokensunconditionally (outside thetokens_counter == 0block). The new path only sets it inside the first-token block.Fix: Move
result.num_cached_tokens = task.num_cached_tokensoutside theif self.tokens_counter[task_id] == 0block. Keepmultimodal_inputshandling inside the first-token block (matching old behavior).5. Missing
prefill_chunk_infohandling (process_model_runner_output)The old path (L623-L628) handles chunked prefill. The new path has no equivalent logic.
Fix: Add prefill_chunk_info handling in
process_model_runner_output, after the abort handling and before the metrics section:6. Missing
scheduler_metrics_loggercallback (process_model_runner_output)The old path (L620-L621) notifies the scheduler metrics logger. The new path has no equivalent.
Fix: Add in
process_model_runner_output, after abort handling and chunk handling, before metrics:7. Missing
draft_token_idsfor prefill scenarioThe old path (L679-L680) populates
draft_token_idsduring prefill. The new path has no equivalent.Fix: Add in
process_model_runner_output, after constructing the result and before calling_process_per_token:8. Missing
_record_speculative_decoding_metricsin ZMQ loopThe old path calls global speculative decoding metrics recording after processing all batches. The new ZMQ path never calls it.
Fix: In
process_sampling_results_use_zmq, afterprocess_model_runner_outputreturns, compute accept_num fromcu_num_generated_tokensand call_record_speculative_decoding_metrics: