-
-
Notifications
You must be signed in to change notification settings - Fork 6k
Description
Description
When deploying LiteLLM (v1.81.9) on Kubernetes with an AWS RDS PostgreSQL instance, intermittent connection errors occur:
{"level":"ERROR","fields":{"message":"Error in PostgreSQL connection: Error { kind: Closed, cause: None }"},"target":"quaint::connector::postgres"}
This manifests as failed API requests that succeed on retry (typically after few attempts), causing visible latency spikes (P95/P99) and ~50% server error rates under moderate load.
Root Cause
RDS silently drops idle TCP connections after a period of inactivity. Quaint (Prisma's database driver) holds connections in its pool without validating them, and hands a dead connection to the next incoming request. The request fails with kind: Closed, retries on a fresh connection, and succeeds.
LiteLLM does not set a default value for max_idle_connection_lifetime in the DATABASE_URL, meaning connections can sit idle indefinitely and be silently dropped by RDS before LiteLLM recycles them.
Solution
Adding the following parameters to the DATABASE_URL resolves the issue completely:
postgresql://user:password@host:5432/dbname?max_idle_connection_lifetime=60&socket_timeout=10
max_idle_connection_lifetime=60— quaint proactively closes connections idle for 60 seconds, before RDS drops themsocket_timeout=10— if a stale connection slips through, the request fails fast rather than hanging
These are quaint-native URL parameters, confirmed supported in quaint's PostgreSQL connector.
Suggestion
LiteLLM should set a sensible default for max_idle_connection_lifetime when initialising the Prisma/quaint connection, given that managed cloud databases (RDS, Cloud SQL, Azure Database) routinely drop idle connections. Leaving it unconfigured means any LiteLLM deployment on a managed PostgreSQL instance will hit this error without a clear path to resolution.
A default of 60 seconds for max_idle_connection_lifetime would prevent this for most cloud deployments without meaningful performance impact.
Environment
- LiteLLM version: 1.81.9
- Deployment: Kubernetes (2 replicas)
- Database: AWS RDS PostgreSQL (db.t4g.small)
- Default connection pool: 10 per pod (20 total)