Skip to content

[Bug]: 429 RateLimitError excluded from MidStreamFallbackError — streaming fallbacks don't fire for rate limits #22296

@CSteigstra

Description

@CSteigstra

Summary

When a 429 (rate-limit) error occurs during async streaming, it is raised directly to the caller instead of being wrapped in MidStreamFallbackError. This prevents the Router's _acompletion_streaming_iterator from catching it and triggering the fallback chain.

router = litellm.Router(
    model_list=[
        {"model_name": "gemini", "litellm_params": {"model": "vertex_ai/gemini-3-flash"}},
        {"model_name": "openai", "litellm_params": {"model": "gpt-4o"}},
    ],
    fallbacks=[{"gemini": ["openai"]}],
)

# When Vertex AI returns 429 during streaming:
response = await router.acompletion(model="gemini", messages=[...], stream=True)
async for chunk in response:  # ← RateLimitError raised here, no fallback to "openai"
    print(chunk)

Expected: Router falls back to openai model group.
Actual: RateLimitError raised directly, no fallback attempted.

Steps to Reproduce

  1. Configure Router with fallbacks: {"gemini": ["openai"]}
  2. Send a streaming request that triggers a 429 from the primary model
  3. Observe that RateLimitError is raised without fallback

Root Cause

PR #18698 added a blanket 4xx filter in CustomStreamWrapper.__anext__() (streaming_handler.py:2159-2162):

if mapped_status_code is not None and 400 <= mapped_status_code < 500:
    raise mapped_exception  # ← 429 hits this, skips MidStreamFallbackError

This correctly prevents non-retriable client errors (400, 401, 403, 404) from triggering fallbacks. But 429 is fundamentally different — it's transient and retriable, and should trigger the Router's fallback chain.

Additionally, when MidStreamFallbackError fires before any content is generated (e.g. rate-limit on the very first chunk during lazy stream initialization), _acompletion_streaming_iterator still appends a continuation prompt with empty generated_content, wasting ~100 tokens. The is_pre_first_chunk flag is already available on the exception but not checked.

Suggested Fix

  1. streaming_handler.py: Exclude 429 from the 4xx filter: and mapped_status_code != 429
  2. router.py: Check is_pre_first_chunk / empty generated_content in _acompletion_streaming_iterator and skip the continuation prompt when no content was generated

Impact

Any user relying on Router fallbacks for streaming completions will not get fallback behavior when their primary model returns 429. The request fails immediately instead of falling back to an alternative model group.

What part of LiteLLM is this about?

SDK (litellm Python package)

What LiteLLM version are you on?

Latest main (post PR #18698, merged 2026-01-06)

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions