Concurrent code security audit system powered by quantized LLMs. Uses HyDE (Hypothetical Document Embeddings) to detect vulnerabilities by comparing original code against LLM-regenerated code from a specification.
┌──────────────────────────────────────┐
│ FastAPI Gateway (:8000) │
│ Least-Load Dispatcher + Affinity Pin │
├──────┬──────┬──────┬─────────────────┤
│/audit│/hyde │/comp │ /metrics │
└──┬───┴──┬──┴──┬───┴────┬────────────┘
│ │ │ │
┌───────────┼──────┼─────┼────────┘
│ │ │ │
┌────▼───┐ ┌─────▼──┐ ┌─▼────┐ ┌──────┐
│ vLLM:2 │ │ vLLM:3 │ │vLLM:6│ │vLLM:7│
│ :8100 │ │ :8101 │ │:8102 │ │:8103 │
│ GPU 2 │ │ GPU 3 │ │GPU 6 │ │GPU 7 │
└────┬───┘ └────┬───┘ └──┬───┘ └──┬───┘
└──PIX──┘ └──PIX──┘
Pair 2/3 (NUMA 0) Pair 6/7 (NUMA 1)
4x RTX A6000 (48GB each) out of 8 available. GPUs selected by PCIe topology (nvidia-smi topo -m): PIX-paired GPUs on separate NUMA nodes avoid cross-socket memory bandwidth contention.
3-Stage Diff Analysis (POST /v1/audit):
project_description + source_code
→ [Stage 1] Logic Summary
→ [Stage 2] Refactored Code (security fixes applied)
→ [Stage 3] Diff Analysis → CWE-classified Vulnerability Report
HyDE Discrepancy Detection (POST /v1/hyde/audit):
original_code
→ [LLM] Generate technical spec (what it SHOULD do)
→ [LLM] Regenerate code from spec (without seeing original)
→ diff(original, regenerated) → Verdict: OK / Failed / Unmatch
# Prerequisites: 4x NVIDIA RTX A6000, CUDA 12.8, Python 3.10+
nvidia-smi && python3 --version
# Install
cd ./HyDiff
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# Key deps: vllm>=0.6, torch>=2.4, llmcompressor>=0.10 (GPTQ), transformers>=4.44
# Download pre-quantized model (~18GB, ~5 min)
# Output defaults to ~/models/... (override with MODEL_PATH env var)
python3 src/fast_quant_download.py \
--model "Qwen/Qwen2.5-Coder-32B-Instruct-GPTQ-Int4" \
--output ~/models/Qwen2.5-Coder-32B-Instruct-GPTQ-W4A16 \
--threads 8
# Launch (2-5 min startup, wait for "Gateway ready with 4/4 backends")
bash scripts/launch.sh --gpus 2,3,6,7
# Or with custom model path:
# MODEL_PATH=/path/to/model bash scripts/launch.sh --gpus 2,3,6,7
# Verify
curl -s http://127.0.0.1:8000/health | python3 -m json.tool
# → {"status": "healthy", "backends_total": 4, "backends_healthy": 4}bash examples/01_health_check.sh # Gateway health + metrics
python3 examples/02_single_audit.py # 3-stage audit on flask_sqli.py
python3 examples/03_hyde_audit.py # HyDE audit on race_condition.py
python3 examples/04_batch_audit.py # Batch HyDE audit
python3 examples/06_concurrent_stress.py --concurrency 10
# Audit your own code
python3 examples/02_single_audit.py --file /path/to/code.py --language python
python3 examples/03_hyde_audit.py --file /path/to/code.c --language cbash benchmarks/benchmark_report.sh http://127.0.0.1:8000 50 100
# Or: launch + benchmark in one command
bash scripts/launch.sh --benchmark --gpus 2,3,6,7├── src/
│ ├── infra_server.py # FastAPI gateway + vLLM backend manager (779 LOC)
│ ├── quantization_lab.py # GPTQ W4A16 quantization pipeline (927 LOC)
│ ├── hyde_audit.py # HyDE audit: code → doc → regen → diff (736 LOC)
│ └── fast_quant_download.py # Parallel 8-thread model downloader (337 LOC)
├── configs/
│ └── vllm_config.yaml # Per-instance vLLM configuration
├── scripts/
│ └── launch.sh # One-command launcher (298 LOC)
├── benchmarks/
│ └── benchmark_report.sh # 50-concurrent-user load test (324 LOC)
├── calibration_data/ # Generated calibration cache + quantization estimates
├── logs/ # Per-GPU vLLM instance logs (created at runtime)
├── examples/
│ ├── vulnerable_samples/ # Test targets (flask_sqli.py, race_condition.py)
│ └── 01-06_*.py/sh # Usage examples
└── requirements.txt # Python dependencies
GPTQ minimizes per-layer reconstruction error during quantization. The calibration data determines which weight configurations the optimizer prioritizes preserving.
Problem: generic calibration (Pile/C4 text) exercises different weight pathways than security-critical code. Pointer arithmetic, bounds checking, malloc/free pairing, and SQL query construction activate specific attention heads that generic calibration under-prioritizes.
Solution: 4-stage pipeline:
Stage 1: Build calibration dataset
CodeSearchNet (Python/Go/Java/JS) + optional local code (C/C++/Rust)
→ Filter by security keywords, complexity score, min 15 lines
→ Rank: security_score × 3.0 + complexity_score
→ Wrap in chat template (matches production prompt format)
→ 256 calibration samples (configurable)
Stage 2: GPTQ W4A16 quantization via llmcompressor
→ ~64GB FP16 → ~18GB INT4 (3.5x compression)
Stage 3: Perplexity evaluation
→ Sliding-window PPL on held-out code, acceptable: < 0.5 PPL delta
Stage 4: VRAM budget analysis
Security keyword filter by language family:
| Language | Keywords |
|---|---|
| C/C++/Rust | malloc, free, buffer, overflow, strcpy, memcpy, unsafe, pointer ops |
| Python | ctypes, struct.pack, subprocess, execute(, SQL formatting |
| Go | unsafe.Pointer, cgo, race conditions |
| JS | eval(, innerHTML, __proto__, prototype pollution |
| Java | Runtime.exec, ObjectInputStream, JNDI, deserialization |
Chat template wrapping: every calibration sample is wrapped in the same <|im_start|> chat template used in production via tokenizer.apply_chat_template(). Without this, attention patterns on special tokens and role markers would be miscalibrated.
# Run full pipeline
python3 src/quantization_lab.py --stage all --gpu 2
# Individual stages
python3 src/quantization_lab.py --stage calibrate --gpu 2
python3 src/quantization_lab.py --stage quantize --gpu 2
python3 src/quantization_lab.py --stage evaluate --gpu 2
python3 src/quantization_lab.py --stage analysis
# Calibrate with your own codebase (recommended for domain-specific auditing)
python3 src/quantization_lab.py --stage all --gpu 2 \
--calibration-source mixed \
--local-code-dir /path/to/your/project \
--local-ratio 0.5Qwen2.5-Coder-32B: 64 layers, 8 KV heads (GQA), 40 query heads, head_dim=128
Per-token KV memory (note: uses GQA KV heads, NOT query heads):
FP16: 2 × 64 × 8 × 128 × 2B = 0.25 MB/token
INT8: 2 × 64 × 8 × 128 × 1B = 0.125 MB/token (50% savings)
VRAM budget per A6000 (GPTQ W4A16):
Total : 48 GB
Model weights (INT4) : ~18 GB
Overhead : ~2 GB
Available for KV : ~28 GB
INT8 KV @ 32K tokens : 4 GB → ~7 concurrent 32K streams
INT8 KV @ 8K tokens : 1 GB → ~26 concurrent 8K streams
FP16 KV @ 8K tokens : 2 GB → ~13 concurrent 8K streams (half of INT8)
INT8 KV doubles the concurrent capacity (e.g., 26 vs 13 streams at 8K context) on a single 48GB card with no measurable quality loss for code audit tasks.
32B model at INT4 ≈ 18GB, well within single A6000 capacity. We run 4 independent vLLM instances instead of TP=4:
- 4× throughput: 4 independent instances serve 4 requests in parallel. TP=4 gives one instance with faster single-request latency but lower aggregate throughput.
- Fault isolation: single GPU failure doesn't take down the whole system.
- No inter-GPU communication: avoids PCIe bandwidth bottleneck (even PIX pairs are ~32 GB/s vs ~900 GB/s HBM).
Level 1 — Intra-pipeline (within a single /v1/audit call): all 3 stages are pinned to the same vLLM backend. Stage 3 reuses prefix KV blocks from Stage 1 without re-prefill:
Stage 1: [system_prompt] + [project_desc + source_code] → summary
Stage 3: [system_prompt] + [source_code + ...] → cache HIT on source_code prefix
Level 2 — Inter-request (across API calls): SHA-256 hash of (project_description + source_code) maps to a backend index. Same project always hits the same GPU:
affinity_key = sha256(project_description + source_code)
preferred_idx = int(affinity_key, 16) % len(backends)
if backends[preferred_idx].healthy and not overloaded:
return backends[preferred_idx] # prefix cache warm
else:
return ring_walk_or_least_load() # graceful fallbackOverload threshold: active requests > 2× fleet average triggers ring walk, then least-load fallback.
| Scenario | Without | With | Speedup |
|---|---|---|---|
| Stage 3 TTFT (8K input) | ~4.5s | ~0.8s | ~5.6× |
| Repeated audit (same project) | Full prefill | Cache HIT | ~30-40% E2E |
| 10 analysts, same codebase | 10× prefill | 1× prefill + 9× HIT | ~9× savings |
Code audit inputs often reach 8K-16K tokens. Without chunking, a 16K prefill monopolizes the GPU for ~200ms, starving concurrent decode-phase requests (head-of-line blocking).
With --enable-chunked-prefill at 8192 tokens per chunk: the 16K input splits into 2 iterations, allowing decode requests to execute between chunks. Result: TTFT increases slightly for large requests, but TPOT stays stable for all concurrent users.
All paths are configurable via environment variables with sensible defaults. No hard-coded user paths in source code.
| Variable | Default | Description |
|---|---|---|
MODEL_PATH |
~/models/Qwen2.5-Coder-32B-Instruct-GPTQ-W4A16 |
Local path to quantized model weights. Falls back to HuggingFace Qwen/Qwen2.5-Coder-32B-Instruct-GPTQ-Int4 if not found. |
VLLM_CONFIG |
configs/vllm_config.yaml |
Path to vLLM YAML configuration file |
HF_CACHE |
~/.cache/huggingface |
HuggingFace cache directory (model downloads) |
QUANT_OUTPUT |
~/models/Qwen2.5-Coder-32B-Instruct-GPTQ-W4A16 |
Output directory for self-quantized model |
# Example: override model path for a different mount point
MODEL_PATH=/mnt/nvme/models/Qwen2.5-Coder-32B-GPTQ bash scripts/launch.sh --gpus 2,3,6,7
# Example: use a shared NFS cache for model downloads
HF_CACHE=/shared/hf_cache python3 src/fast_quant_download.py --threads 8All vLLM parameters live in configs/vllm_config.yaml (uses gptq_marlin quantization backend).
| Parameter | Default | Notes |
|---|---|---|
gpu_memory_utilization |
0.88 | 48 × 0.88 = 42.2 GB usable. Increase to 0.92 for more KV cache. |
max_model_len |
32768 | Reduce to 16384 to double concurrent batch capacity |
max_num_seqs |
8 | Max concurrent sequences per instance |
kv_cache.dtype |
int8 | auto for FP16 KV (halves context capacity) |
block_size |
16 | PagedAttention block size (16 optimal for A6000) |
chunked_prefill.max_num_batched_tokens |
8192 | Prefill chunk size; smaller = less HOL blocking, higher TTFT |
Default GPUs are 2,3,6,7 (PIX pairs on separate NUMA nodes). To use different GPUs:
python3 -m src.infra_server --gpus 0,1,4,5 --port 8000
# Check your topology first:
nvidia-smi topo -m
# Look for PIX (same PCIe switch) pairs on separate NUMA nodesGateway exposes Prometheus metrics at GET /metrics:
| Metric | Description |
|---|---|
gateway_requests_total |
Request count by endpoint/status |
gateway_request_latency_seconds |
E2E latency histogram (buckets: 0.1s–300s) |
gateway_total_generation_throughput_tps |
Aggregated token throughput |
gateway_gpu_kv_cache_usage_pct |
KV cache utilization per backend |
gateway_active_requests |
In-flight requests per backend |
gateway_affinity_routing_total |
Affinity routing hit/miss counters |
gateway_backend_healthy |
Per-backend health (1/0) |
Each vLLM instance also exposes native vLLM metrics (vllm:avg_generation_throughput_toks_per_s, vllm:time_per_output_token_seconds, vllm:gpu_cache_usage_perc, etc.).
Backend status and per-GPU health.
{
"project_description": "A Flask REST API for user management...",
"source_code": "from flask import Flask...",
"language": "python",
"max_tokens": 4096,
"temperature": 0.3,
"focus_areas": ["injection", "auth_bypass", "overflow"]
}{
"prompt": "Analyze this code for vulnerabilities: ...",
"system_prompt": "You are a security auditor.",
"max_tokens": 2048,
"temperature": 0.3
}{
"code": "void parse_packet(char *raw, int len) { ... }",
"language": "c",
"context": "IoT packet parser"
}{
"samples": [
{"name": "app.py", "code": "...", "language": "python"},
{"name": "parser.c", "code": "...", "language": "c"}
],
"max_concurrency": 4
}Only needed to reproduce the GPTQ pipeline with custom calibration. Takes 30-90 min on a single A6000.
Output path defaults to $QUANT_OUTPUT (or ~/models/Qwen2.5-Coder-32B-Instruct-GPTQ-W4A16).
# Full pipeline
python3 src/quantization_lab.py --stage all --gpu 2
# With your own codebase (recommended)
python3 src/quantization_lab.py --stage all --gpu 2 \
--calibration-source mixed \
--local-code-dir /path/to/your/project \
--local-ratio 0.5
# Via launch.sh (quantize + serve)
bash scripts/launch.sh --quantize-local --gpus 2,3,6,7
bash scripts/launch.sh --quantize-local \
--calib-source mixed \
--local-code-dir /path/to/project \
--local-ratio 0.6 --gpus 2,3,6,7| Calibration Source | Flag | Use Case |
|---|---|---|
| CodeSearchNet (default) | --calib-source codesearchnet |
General code audit |
| Local codebase only | --calib-source local --local-code-dir DIR |
Domain-specific (e.g., IoT firmware) |
| Mixed (recommended) | --calib-source mixed --local-code-dir DIR --local-ratio 0.5 |
Best of both |
# No healthy backends
cat logs/vllm_gpu2.log | tail -50
# Common: CUDA OOM → reduce gpu_memory_utilization in configs/vllm_config.yaml
# Model not found → check MODEL_PATH env var, re-run fast_quant_download.py
# Port conflict → lsof -i :8100 -i :8101 -i :8102 -i :8103
# Slow first request → normal (CUDA graph capture + KV cache alloc)
# Check which model path is being used
echo $MODEL_PATH # should point to your GPTQ model directory
ls $MODEL_PATH # should contain config.json, model.safetensors, etc.- Speculative Decoding: Qwen2.5-Coder-0.5B as draft model for ~1.5× decode speedup
- Guided Generation:
Outlinesintegration for constrained exploit payload formatting - Tensor Parallelism: TP=2 across PIX pairs for 64K+ context support