[Bug]: 429 RateLimitError excluded from MidStreamFallbackError — streaming fallbacks don't fire for rate limits

## Summary

When a 429 (rate-limit) error occurs during async streaming, it is raised directly to the caller instead of being wrapped in `MidStreamFallbackError`. This prevents the Router's `_acompletion_streaming_iterator` from catching it and triggering the fallback chain.

```python
router = litellm.Router(
    model_list=[
        {"model_name": "gemini", "litellm_params": {"model": "vertex_ai/gemini-3-flash"}},
        {"model_name": "openai", "litellm_params": {"model": "gpt-4o"}},
    ],
    fallbacks=[{"gemini": ["openai"]}],
)

# When Vertex AI returns 429 during streaming:
response = await router.acompletion(model="gemini", messages=[...], stream=True)
async for chunk in response:  # ← RateLimitError raised here, no fallback to "openai"
    print(chunk)
```

**Expected:** Router falls back to `openai` model group.
**Actual:** `RateLimitError` raised directly, no fallback attempted.

## Steps to Reproduce

1. Configure Router with fallbacks: `{"gemini": ["openai"]}`
2. Send a streaming request that triggers a 429 from the primary model
3. Observe that `RateLimitError` is raised without fallback

## Root Cause

PR #18698 added a blanket 4xx filter in `CustomStreamWrapper.__anext__()` (`streaming_handler.py:2159-2162`):

```python
if mapped_status_code is not None and 400 <= mapped_status_code < 500:
    raise mapped_exception  # ← 429 hits this, skips MidStreamFallbackError
```

This correctly prevents non-retriable client errors (400, 401, 403, 404) from triggering fallbacks. But 429 is fundamentally different — it's transient and retriable, and should trigger the Router's fallback chain.

Additionally, when `MidStreamFallbackError` fires before any content is generated (e.g. rate-limit on the very first chunk during lazy stream initialization), `_acompletion_streaming_iterator` still appends a continuation prompt with empty `generated_content`, wasting ~100 tokens. The `is_pre_first_chunk` flag is already available on the exception but not checked.

## Suggested Fix

1. **streaming_handler.py**: Exclude 429 from the 4xx filter: `and mapped_status_code != 429`
2. **router.py**: Check `is_pre_first_chunk` / empty `generated_content` in `_acompletion_streaming_iterator` and skip the continuation prompt when no content was generated

## Impact

Any user relying on Router fallbacks for streaming completions will not get fallback behavior when their primary model returns 429. The request fails immediately instead of falling back to an alternative model group.

### What part of LiteLLM is this about?

SDK (litellm Python package)

### What LiteLLM version are you on?

Latest main (post PR #18698, merged 2026-01-06)

## Related Issues

- #20870 — Gemini 429 during streaming doesn't trigger fallbacks (user report, same root cause)
- #18229 — Empty continuation prompt on pre-first-chunk errors
- #8648 — Streaming retry inconsistency for 429 errors
- #6532 / #6957 — Streaming fallbacks not working

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: 429 RateLimitError excluded from MidStreamFallbackError — streaming fallbacks don't fire for rate limits #22296

Summary

Steps to Reproduce

Root Cause

Suggested Fix

Impact

What part of LiteLLM is this about?

What LiteLLM version are you on?

Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: 429 RateLimitError excluded from MidStreamFallbackError — streaming fallbacks don't fire for rate limits #22296

Description

Summary

Steps to Reproduce

Root Cause

Suggested Fix

Impact

What part of LiteLLM is this about?

What LiteLLM version are you on?

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions