-
-
Notifications
You must be signed in to change notification settings - Fork 6k
Description
Summary
When a 429 (rate-limit) error occurs during async streaming, it is raised directly to the caller instead of being wrapped in MidStreamFallbackError. This prevents the Router's _acompletion_streaming_iterator from catching it and triggering the fallback chain.
router = litellm.Router(
model_list=[
{"model_name": "gemini", "litellm_params": {"model": "vertex_ai/gemini-3-flash"}},
{"model_name": "openai", "litellm_params": {"model": "gpt-4o"}},
],
fallbacks=[{"gemini": ["openai"]}],
)
# When Vertex AI returns 429 during streaming:
response = await router.acompletion(model="gemini", messages=[...], stream=True)
async for chunk in response: # ← RateLimitError raised here, no fallback to "openai"
print(chunk)Expected: Router falls back to openai model group.
Actual: RateLimitError raised directly, no fallback attempted.
Steps to Reproduce
- Configure Router with fallbacks:
{"gemini": ["openai"]} - Send a streaming request that triggers a 429 from the primary model
- Observe that
RateLimitErroris raised without fallback
Root Cause
PR #18698 added a blanket 4xx filter in CustomStreamWrapper.__anext__() (streaming_handler.py:2159-2162):
if mapped_status_code is not None and 400 <= mapped_status_code < 500:
raise mapped_exception # ← 429 hits this, skips MidStreamFallbackErrorThis correctly prevents non-retriable client errors (400, 401, 403, 404) from triggering fallbacks. But 429 is fundamentally different — it's transient and retriable, and should trigger the Router's fallback chain.
Additionally, when MidStreamFallbackError fires before any content is generated (e.g. rate-limit on the very first chunk during lazy stream initialization), _acompletion_streaming_iterator still appends a continuation prompt with empty generated_content, wasting ~100 tokens. The is_pre_first_chunk flag is already available on the exception but not checked.
Suggested Fix
- streaming_handler.py: Exclude 429 from the 4xx filter:
and mapped_status_code != 429 - router.py: Check
is_pre_first_chunk/ emptygenerated_contentin_acompletion_streaming_iteratorand skip the continuation prompt when no content was generated
Impact
Any user relying on Router fallbacks for streaming completions will not get fallback behavior when their primary model returns 429. The request fails immediately instead of falling back to an alternative model group.
What part of LiteLLM is this about?
SDK (litellm Python package)
What LiteLLM version are you on?
Latest main (post PR #18698, merged 2026-01-06)
Related Issues
- [Bug]: Fallbacks do not work with Pydantic-AI when Gemini 3 Preview throws 429 rate limit error #20870 — Gemini 429 during streaming doesn't trigger fallbacks (user report, same root cause)
- Allow disabling or customizing mid-stream fallback continuation prompt #18229 — Empty continuation prompt on pre-first-chunk errors
- [Feature]: Improving Retry Mechanism Consistency and Logging for Streamed Responses in LiteLLM Proxy #8648 — Streaming retry inconsistency for 429 errors
- NO FALLBACK when streaming and [Bug]: litellm.InternalServerError: AnthropicException - Overloaded. Handle with
litellm.InternalServerError. #6532 / [Bug]: NO FALLBACK when streaming and [Bug]: litellm.InternalServerError: AnthropicException - Overloaded. Handle with litellm.InternalServerError. #6957 — Streaming fallbacks not working