Skip to content

fix: allow 429 RateLimitError to trigger MidStreamFallbackError in streaming#22297

Open
CSteigstra wants to merge 1 commit intoBerriAI:mainfrom
CSteigstra:fix/429-streaming-midstream-fallback
Open

fix: allow 429 RateLimitError to trigger MidStreamFallbackError in streaming#22297
CSteigstra wants to merge 1 commit intoBerriAI:mainfrom
CSteigstra:fix/429-streaming-midstream-fallback

Conversation

@CSteigstra
Copy link
Contributor

Summary

PR #18698 introduced a blanket 4xx filter in CustomStreamWrapper.__anext__() that prevents all 400-499 status codes from being wrapped in MidStreamFallbackError. While correct for non-retriable client errors (400, 401, 403, 404), 429 (rate-limit) is fundamentally transient and should trigger the Router's fallback system to switch to a different model group.

# Works for 5xx errors (503, 529):
# → MidStreamFallbackError → Router catches → falls back ✅

# Broken for 429:
# → RateLimitError raised directly → no fallback ❌

Changes

  • litellm/litellm_core_utils/streaming_handler.py: Exclude 429 from the 4xx filter in __anext__() so rate-limit errors raise MidStreamFallbackError instead of RateLimitError directly. Other 4xx errors (400, 401, 403, 404) still raised directly as before.
  • litellm/router.py: When MidStreamFallbackError has is_pre_first_chunk=True or empty generated_content (e.g. 429 before any tokens), skip the continuation prompt and retry with original messages. Previously this always appended a "continue from this text:" system message with empty content, wasting ~100 tokens.
  • Tests: Added test_vertex_streaming_rate_limit_triggers_midstream_fallback (streaming handler) and test_acompletion_streaming_iterator_pre_first_chunk_skips_continuation (router). Updated existing edge case test for new empty-content behavior.

Fixes #22296
Relates to #20870, #18229, #8648, #6532

…reaming

PR BerriAI#18698 introduced a blanket 4xx filter that prevents all 400-499 status
codes from being wrapped in MidStreamFallbackError during async streaming.
While this is correct for non-retriable client errors (400, 401, 403, 404),
429 (rate-limit) is fundamentally transient and should trigger the Router's
fallback system to switch to a different model group.

Changes:
1. streaming_handler.py: Exclude 429 from the 4xx filter in __anext__()
   so rate-limit errors raise MidStreamFallbackError instead of
   RateLimitError directly.
2. router.py: When MidStreamFallbackError has is_pre_first_chunk=True or
   empty generated_content (e.g. 429 before any tokens), skip the
   continuation prompt and retry with original messages. Previously this
   always appended a "continue from this text:" system message with empty
   content, wasting ~100 tokens.

Fixes BerriAI#20870
Relates to BerriAI#18229, BerriAI#8648, BerriAI#6532
@vercel
Copy link

vercel bot commented Feb 27, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
litellm Ready Ready Preview, Comment Feb 27, 2026 4:48pm

Request Review

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 27, 2026

Greptile Summary

This PR fixes 429 (rate-limit) errors during streaming being raised directly as RateLimitError instead of being wrapped in MidStreamFallbackError, which prevented the Router's fallback system from switching to a different model group. The fix excludes 429 from the blanket 4xx filter introduced in PR #18698 and adds an optimization to skip the continuation prompt when no content was generated before the error.

  • streaming_handler.py: Added and mapped_status_code != 429 to both 4xx checks so rate-limit errors flow through to MidStreamFallbackError wrapping, consistent with _should_retry(429) returning True
  • router.py: When is_pre_first_chunk=True or generated_content is empty, retries with original messages instead of appending a wasteful continuation prompt with empty content
  • Behavioral change: The empty-content optimization applies to all MidStreamFallbackError cases (not just 429), improving token efficiency for any error that occurs before content is generated
  • Tests: Two new mock-only regression tests cover the 429-to-MidStreamFallbackError conversion and the pre-first-chunk message handling; one existing edge-case test updated for the new empty-content behavior

Confidence Score: 4/5

  • This PR is safe to merge — the changes are minimal, well-targeted, and backed by regression tests.
  • Score of 4 reflects: (1) the streaming_handler.py change is a simple, correct two-line addition that aligns with existing _should_retry() semantics; (2) the router.py change is a sensible optimization with a minor behavioral change for all empty-content MidStreamFallbackErrors (not just 429), but the change is clearly an improvement; (3) tests are mock-only and comprehensive; (4) downstream compatibility verified — _should_retry(429) returns True, and MidStreamFallbackError with status_code=429 flows correctly through the fallback system. Docked one point because the behavioral change to empty-content handling is broader than the PR title implies.
  • Pay attention to litellm/router.py — the empty-content condition change affects all MidStreamFallbackErrors, not just 429 rate-limit errors.

Important Files Changed

Filename Overview
litellm/litellm_core_utils/streaming_handler.py Excludes 429 from the 4xx filter so rate-limit errors are wrapped in MidStreamFallbackError instead of raised directly. Minimal, targeted change consistent with _should_retry() semantics.
litellm/router.py Adds pre-first-chunk / empty-content check to skip continuation prompt and use original messages. Behavioral change for all empty-content MidStreamFallbackErrors, not just 429. Logic is sound and avoids wasting tokens.
tests/test_litellm/litellm_core_utils/test_streaming_handler.py Adds mock-only regression test for 429 rate-limit triggering MidStreamFallbackError. No real network calls. Follows existing test patterns.
tests/test_litellm/test_router.py Adds mock-only regression test for pre-first-chunk skip behavior and updates existing edge case test for new empty-content handling. No real network calls.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Router
    participant StreamWrapper as CustomStreamWrapper
    participant Provider as LLM Provider

    Client->>Router: acompletion(stream=True)
    Router->>StreamWrapper: iterate stream
    StreamWrapper->>Provider: make_call()
    Provider-->>StreamWrapper: 429 RateLimitError

    Note over StreamWrapper: Before fix: raise RateLimitError directly ❌
    Note over StreamWrapper: After fix: 429 excluded from 4xx filter

    StreamWrapper-->>Router: MidStreamFallbackError(is_pre_first_chunk=True)
    
    alt is_pre_first_chunk or empty generated_content
        Note over Router: Use original messages (skip continuation prompt)
    else has generated content
        Note over Router: Append continuation prompt + assistant prefix
    end

    Router->>Router: async_function_with_fallbacks_common_utils()
    Router->>Provider: Fallback to next model group
    Provider-->>Router: Successful response
    Router-->>Client: Stream fallback response ✅
Loading

Last reviewed commit: 257f29a

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@CSteigstra CSteigstra marked this pull request as draft February 27, 2026 17:29
@CSteigstra
Copy link
Contributor Author

Some failing checks. Can have a look later.

@CSteigstra
Copy link
Contributor Author

Some failing checks. Can have a look later.

Okay these are flaky tests unrelated to this PR.

@CSteigstra CSteigstra marked this pull request as ready for review February 27, 2026 17:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: 429 RateLimitError excluded from MidStreamFallbackError — streaming fallbacks don't fire for rate limits

1 participant