Skip to content

[Control Plane] No rate limiting on hot endpoints #429

@santoshkumarradha

Description

@santoshkumarradha

Summary

No rate-limiting middleware is registered anywhere in the control plane, so a single misbehaving client can exhaust the async worker pool and database connections on hot endpoints.

Context

A search for rate.NewLimiter, tollbooth, or httprate in control-plane/internal/server/ returns no matches. The endpoints most at risk are /api/v1/execute (spawns goroutines + DB writes), /api/v1/discovery (fan-out resolution), and /api/v1/nodes/status/bulk (N storage reads per call). Without throttling, a runaway agent or a scanning client can trigger connection-pool exhaustion and starve legitimate traffic. There is no circuit breaker or backpressure mechanism downstream to compensate.

Scope

In Scope

  • Add per-API-key token-bucket rate limiting on at minimum: execute, discovery, and bulk-status endpoints.
  • Add per-IP rate limiting for any unauthenticated public paths (e.g. health, registration before auth).
  • Make limits configurable via agentfield.yaml / environment variables with sensible defaults.
  • Return HTTP 429 with a Retry-After header when a limit is exceeded.

Out of Scope

  • Distributed rate limiting across multiple control-plane replicas — in-process limiting is sufficient for v1.
  • Adding circuit breakers on storage calls — that is a separate resilience concern.
  • Rate limiting internal agent-to-agent calls that bypass the HTTP layer.

Files

  • control-plane/internal/server/middleware/ — add new ratelimit.go middleware using golang.org/x/time/rate or a Gin-compatible library
  • control-plane/internal/server/routes.go — apply rate-limit middleware to execute, discovery, and bulk-status route groups
  • control-plane/internal/config/config.go — add RateLimit config struct with per-endpoint RequestsPerSecond and BurstSize fields
  • control-plane/internal/server/middleware/ratelimit_test.go — tests: limit is enforced, 429 returned, Retry-After header present

Acceptance Criteria

  • The execute endpoint returns HTTP 429 when a single API key exceeds the configured rate
  • The 429 response includes a Retry-After header
  • Rate limits are configurable per endpoint via config file / env vars
  • Default limits allow normal agent traffic without throttling (document the defaults)
  • Tests pass (go test ./control-plane/...)
  • Linting passes (make lint)

Notes for Contributors

Severity: MEDIUM

golang.org/x/time/rate is already a transitive dependency via otel — check go.sum before adding a new library. Use a sync.Map keyed by API key to hold per-key rate.Limiter instances. The limiter map should evict idle entries (e.g. via a background sweep) to avoid unbounded growth in deployments with many short-lived API keys.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:control-planeControl plane server functionalitybugSomething isn't workingenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions