Skip to content

PostgreSQL connections silently dropped by RDS due to missing default idle connection lifetime config #22289

@nikogio

Description

@nikogio

Description

When deploying LiteLLM (v1.81.9) on Kubernetes with an AWS RDS PostgreSQL instance, intermittent connection errors occur:

{"level":"ERROR","fields":{"message":"Error in PostgreSQL connection: Error { kind: Closed, cause: None }"},"target":"quaint::connector::postgres"}

This manifests as failed API requests that succeed on retry (typically after few attempts), causing visible latency spikes (P95/P99) and ~50% server error rates under moderate load.

Root Cause

RDS silently drops idle TCP connections after a period of inactivity. Quaint (Prisma's database driver) holds connections in its pool without validating them, and hands a dead connection to the next incoming request. The request fails with kind: Closed, retries on a fresh connection, and succeeds.

LiteLLM does not set a default value for max_idle_connection_lifetime in the DATABASE_URL, meaning connections can sit idle indefinitely and be silently dropped by RDS before LiteLLM recycles them.

Solution

Adding the following parameters to the DATABASE_URL resolves the issue completely:

postgresql://user:password@host:5432/dbname?max_idle_connection_lifetime=60&socket_timeout=10
  • max_idle_connection_lifetime=60 — quaint proactively closes connections idle for 60 seconds, before RDS drops them
  • socket_timeout=10 — if a stale connection slips through, the request fails fast rather than hanging

These are quaint-native URL parameters, confirmed supported in quaint's PostgreSQL connector.

Suggestion

LiteLLM should set a sensible default for max_idle_connection_lifetime when initialising the Prisma/quaint connection, given that managed cloud databases (RDS, Cloud SQL, Azure Database) routinely drop idle connections. Leaving it unconfigured means any LiteLLM deployment on a managed PostgreSQL instance will hit this error without a clear path to resolution.

A default of 60 seconds for max_idle_connection_lifetime would prevent this for most cloud deployments without meaningful performance impact.

Environment

  • LiteLLM version: 1.81.9
  • Deployment: Kubernetes (2 replicas)
  • Database: AWS RDS PostgreSQL (db.t4g.small)
  • Default connection pool: 10 per pod (20 total)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions