Skip to content

fix: improve model read retry logic for eventual consistency#22

Open
shin-bot-litellm wants to merge 1 commit intoBerriAI:mainfrom
shin-bot-litellm:litellm_fix_model_create_race_condition
Open

fix: improve model read retry logic for eventual consistency#22
shin-bot-litellm wants to merge 1 commit intoBerriAI:mainfrom
shin-bot-litellm:litellm_fix_model_create_race_condition

Conversation

@shin-bot-litellm
Copy link

Summary

Fixes #16 - litellm_model resource intermittently fails with "Root object was present, but now absent" on create

Problem

When creating multiple litellm_model resources in parallel, some randomly fail because:

  1. The Create API call succeeds
  2. The immediate Read to verify returns empty/nil due to eventual consistency
  3. Terraform interprets this as "resource disappeared"

Root Cause

Database eventual consistency - the model is written but not immediately readable, especially under concurrent writes.

Solution

Improved the retry mechanism in retryModelRead():

Before After
Fixed error string matching Flexible pattern matching via isRetryableModelError()
No initial delay 200ms initial delay for database sync
1s starting retry delay 500ms starting delay (faster first retry)
5 retries max 8 retries max
~31s max wait ~45s max wait

Error Patterns Now Handled

  • "not found" messages
  • "Model id = X" patterns
  • "model_not_found" sentinel values
  • Cleared resource IDs

Retry Timeline

200ms initial delay
500ms → 1s → 2s → 4s → 8s → 10s → 10s → 10s
(exponential backoff with 10s cap)

Testing

The fix addresses the intermittent failure by:

  1. Allowing more time for database replication
  2. Using robust error detection that won't break with API changes
  3. Providing better debug logging

Fixes BerriAI#16 - Model resource intermittently fails with 'Root object was present, but now absent'

The issue occurs when creating multiple models in parallel due to eventual
consistency in the LiteLLM database - the model is created successfully but
the immediate read-back verification returns empty/nil.

Changes:
- Add isRetryableModelError() helper for flexible error pattern matching
- Add 200ms initial delay before first read to allow database sync
- Start with 500ms delay (was 1s) for faster first retry
- Increase retry count from 5 to 8 (total max wait ~45s)
- More robust error detection using pattern matching instead of exact string
- Better logging for debugging eventual consistency issues

The retry pattern now handles:
- 'not found' error messages
- 'Model id = X' patterns
- 'model_not_found' sentinel values
- Cleared resource IDs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

litellm_model resource intermittently fails with "Root object was present, but now absent" on create

2 participants