Production Stack

This project provides a reference implementation on how to build an inference stack on top of Kaito.

Architecture

Istio Gateway — Entry point for all inference requests. Routes client requests (e.g., GET /completions) through the stack.
Body-based Routing — Parses request body to extract the model name and injects the x-gateway-model-name header, enabling model-level routing.
GAIE EPP (Gateway API Inference Extension Endpoint Picker) — Performs KV-cache aware routing by injecting the x-gateway-destination-endpoint header, directing requests to the optimal inference pod.
Kaito InferenceSet — Manages groups of vLLM inference pods. Multiple InferenceSets (e.g., Model-A, Model-B) can run different models simultaneously.
vLLM Inference Pods — Serve model inference requests using vLLM.
Kaito-Keda-Scaler — Metric-based autoscaler built on KEDA that scales vLLM inference pods up and down based on workload metrics.
Mocked GPU Nodes / CPU Nodes — Infrastructure layer providing compute resources for inference workloads.

All component versions are centralized in versions.env. This file is the single source of truth used by both CI and local E2E scripts.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github		.github
charts/gpu-node-mocker		charts/gpu-node-mocker
cmd/gpu-node-mocker		cmd/gpu-node-mocker
docker		docker
docs/imgs		docs/imgs
hack		hack
pkg/gpu-node-mocker		pkg/gpu-node-mocker
test/e2e		test/e2e
.gitignore		.gitignore
.golangci.yaml		.golangci.yaml
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum
versions.env		versions.env