fix: allow 429 RateLimitError to trigger MidStreamFallbackError in streaming#22297
Open
CSteigstra wants to merge 1 commit intoBerriAI:mainfrom
Open
fix: allow 429 RateLimitError to trigger MidStreamFallbackError in streaming#22297CSteigstra wants to merge 1 commit intoBerriAI:mainfrom
CSteigstra wants to merge 1 commit intoBerriAI:mainfrom
Conversation
…reaming PR BerriAI#18698 introduced a blanket 4xx filter that prevents all 400-499 status codes from being wrapped in MidStreamFallbackError during async streaming. While this is correct for non-retriable client errors (400, 401, 403, 404), 429 (rate-limit) is fundamentally transient and should trigger the Router's fallback system to switch to a different model group. Changes: 1. streaming_handler.py: Exclude 429 from the 4xx filter in __anext__() so rate-limit errors raise MidStreamFallbackError instead of RateLimitError directly. 2. router.py: When MidStreamFallbackError has is_pre_first_chunk=True or empty generated_content (e.g. 429 before any tokens), skip the continuation prompt and retry with original messages. Previously this always appended a "continue from this text:" system message with empty content, wasting ~100 tokens. Fixes BerriAI#20870 Relates to BerriAI#18229, BerriAI#8648, BerriAI#6532
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Contributor
Greptile SummaryThis PR fixes 429 (rate-limit) errors during streaming being raised directly as
Confidence Score: 4/5
|
| Filename | Overview |
|---|---|
| litellm/litellm_core_utils/streaming_handler.py | Excludes 429 from the 4xx filter so rate-limit errors are wrapped in MidStreamFallbackError instead of raised directly. Minimal, targeted change consistent with _should_retry() semantics. |
| litellm/router.py | Adds pre-first-chunk / empty-content check to skip continuation prompt and use original messages. Behavioral change for all empty-content MidStreamFallbackErrors, not just 429. Logic is sound and avoids wasting tokens. |
| tests/test_litellm/litellm_core_utils/test_streaming_handler.py | Adds mock-only regression test for 429 rate-limit triggering MidStreamFallbackError. No real network calls. Follows existing test patterns. |
| tests/test_litellm/test_router.py | Adds mock-only regression test for pre-first-chunk skip behavior and updates existing edge case test for new empty-content handling. No real network calls. |
Sequence Diagram
sequenceDiagram
participant Client
participant Router
participant StreamWrapper as CustomStreamWrapper
participant Provider as LLM Provider
Client->>Router: acompletion(stream=True)
Router->>StreamWrapper: iterate stream
StreamWrapper->>Provider: make_call()
Provider-->>StreamWrapper: 429 RateLimitError
Note over StreamWrapper: Before fix: raise RateLimitError directly ❌
Note over StreamWrapper: After fix: 429 excluded from 4xx filter
StreamWrapper-->>Router: MidStreamFallbackError(is_pre_first_chunk=True)
alt is_pre_first_chunk or empty generated_content
Note over Router: Use original messages (skip continuation prompt)
else has generated content
Note over Router: Append continuation prompt + assistant prefix
end
Router->>Router: async_function_with_fallbacks_common_utils()
Router->>Provider: Fallback to next model group
Provider-->>Router: Successful response
Router-->>Client: Stream fallback response ✅
Last reviewed commit: 257f29a
Contributor
Author
|
Some failing checks. Can have a look later. |
Contributor
Author
Okay these are flaky tests unrelated to this PR. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PR #18698 introduced a blanket 4xx filter in
CustomStreamWrapper.__anext__()that prevents all 400-499 status codes from being wrapped inMidStreamFallbackError. While correct for non-retriable client errors (400, 401, 403, 404), 429 (rate-limit) is fundamentally transient and should trigger the Router's fallback system to switch to a different model group.Changes
litellm/litellm_core_utils/streaming_handler.py: Exclude 429 from the 4xx filter in__anext__()so rate-limit errors raiseMidStreamFallbackErrorinstead ofRateLimitErrordirectly. Other 4xx errors (400, 401, 403, 404) still raised directly as before.litellm/router.py: WhenMidStreamFallbackErrorhasis_pre_first_chunk=Trueor emptygenerated_content(e.g. 429 before any tokens), skip the continuation prompt and retry with original messages. Previously this always appended a "continue from this text:" system message with empty content, wasting ~100 tokens.test_vertex_streaming_rate_limit_triggers_midstream_fallback(streaming handler) andtest_acompletion_streaming_iterator_pre_first_chunk_skips_continuation(router). Updated existing edge case test for new empty-content behavior.Fixes #22296
Relates to #20870, #18229, #8648, #6532