Skip to content

cubicYYY/HyDiff

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HyDiff: Hypothetical Diff Analysis

Concurrent code security audit system powered by quantized LLMs. Uses HyDE (Hypothetical Document Embeddings) to detect vulnerabilities by comparing original code against LLM-regenerated code from a specification.

Architecture

                ┌──────────────────────────────────────┐
                │     FastAPI Gateway (:8000)           │
                │  Least-Load Dispatcher + Affinity Pin │
                ├──────┬──────┬──────┬─────────────────┤
                │/audit│/hyde │/comp │ /metrics         │
                └──┬───┴──┬──┴──┬───┴────┬────────────┘
                   │      │     │        │
       ┌───────────┼──────┼─────┼────────┘
       │           │      │     │
  ┌────▼───┐ ┌─────▼──┐ ┌─▼────┐ ┌──────┐
  │ vLLM:2 │ │ vLLM:3 │ │vLLM:6│ │vLLM:7│
  │ :8100  │ │ :8101  │ │:8102 │ │:8103 │
  │ GPU 2  │ │ GPU 3  │ │GPU 6 │ │GPU 7 │
  └────┬───┘ └────┬───┘ └──┬───┘ └──┬───┘
       └──PIX──┘            └──PIX──┘
    Pair 2/3 (NUMA 0)    Pair 6/7 (NUMA 1)

4x RTX A6000 (48GB each) out of 8 available. GPUs selected by PCIe topology (nvidia-smi topo -m): PIX-paired GPUs on separate NUMA nodes avoid cross-socket memory bandwidth contention.

Two Audit Pipelines

3-Stage Diff Analysis (POST /v1/audit):

project_description + source_code
  → [Stage 1] Logic Summary
  → [Stage 2] Refactored Code (security fixes applied)
  → [Stage 3] Diff Analysis → CWE-classified Vulnerability Report

HyDE Discrepancy Detection (POST /v1/hyde/audit):

original_code
  → [LLM] Generate technical spec (what it SHOULD do)
  → [LLM] Regenerate code from spec (without seeing original)
  → diff(original, regenerated) → Verdict: OK / Failed / Unmatch

Quick Start

# Prerequisites: 4x NVIDIA RTX A6000, CUDA 12.8, Python 3.10+
nvidia-smi && python3 --version

# Install
cd ./HyDiff
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# Key deps: vllm>=0.6, torch>=2.4, llmcompressor>=0.10 (GPTQ), transformers>=4.44

# Download pre-quantized model (~18GB, ~5 min)
# Output defaults to ~/models/... (override with MODEL_PATH env var)
python3 src/fast_quant_download.py \
    --model "Qwen/Qwen2.5-Coder-32B-Instruct-GPTQ-Int4" \
    --output ~/models/Qwen2.5-Coder-32B-Instruct-GPTQ-W4A16 \
    --threads 8

# Launch (2-5 min startup, wait for "Gateway ready with 4/4 backends")
bash scripts/launch.sh --gpus 2,3,6,7
# Or with custom model path:
# MODEL_PATH=/path/to/model bash scripts/launch.sh --gpus 2,3,6,7

# Verify
curl -s http://127.0.0.1:8000/health | python3 -m json.tool
# → {"status": "healthy", "backends_total": 4, "backends_healthy": 4}

Running Examples

bash examples/01_health_check.sh              # Gateway health + metrics
python3 examples/02_single_audit.py            # 3-stage audit on flask_sqli.py
python3 examples/03_hyde_audit.py              # HyDE audit on race_condition.py
python3 examples/04_batch_audit.py             # Batch HyDE audit
python3 examples/06_concurrent_stress.py --concurrency 10

# Audit your own code
python3 examples/02_single_audit.py --file /path/to/code.py --language python
python3 examples/03_hyde_audit.py --file /path/to/code.c --language c

Benchmarks

bash benchmarks/benchmark_report.sh http://127.0.0.1:8000 50 100
# Or: launch + benchmark in one command
bash scripts/launch.sh --benchmark --gpus 2,3,6,7

Project Structure

├── src/
│   ├── infra_server.py        # FastAPI gateway + vLLM backend manager (779 LOC)
│   ├── quantization_lab.py    # GPTQ W4A16 quantization pipeline (927 LOC)
│   ├── hyde_audit.py          # HyDE audit: code → doc → regen → diff (736 LOC)
│   └── fast_quant_download.py # Parallel 8-thread model downloader (337 LOC)
├── configs/
│   └── vllm_config.yaml       # Per-instance vLLM configuration
├── scripts/
│   └── launch.sh              # One-command launcher (298 LOC)
├── benchmarks/
│   └── benchmark_report.sh    # 50-concurrent-user load test (324 LOC)
├── calibration_data/          # Generated calibration cache + quantization estimates
├── logs/                      # Per-GPU vLLM instance logs (created at runtime)
├── examples/
│   ├── vulnerable_samples/    # Test targets (flask_sqli.py, race_condition.py)
│   └── 01-06_*.py/sh          # Usage examples
└── requirements.txt           # Python dependencies

Technical Decisions

Quantization: CodeSearchNet Calibration (quantization_lab.py)

GPTQ minimizes per-layer reconstruction error during quantization. The calibration data determines which weight configurations the optimizer prioritizes preserving.

Problem: generic calibration (Pile/C4 text) exercises different weight pathways than security-critical code. Pointer arithmetic, bounds checking, malloc/free pairing, and SQL query construction activate specific attention heads that generic calibration under-prioritizes.

Solution: 4-stage pipeline:

Stage 1: Build calibration dataset
  CodeSearchNet (Python/Go/Java/JS) + optional local code (C/C++/Rust)
  → Filter by security keywords, complexity score, min 15 lines
  → Rank: security_score × 3.0 + complexity_score
  → Wrap in chat template (matches production prompt format)
  → 256 calibration samples (configurable)

Stage 2: GPTQ W4A16 quantization via llmcompressor
  → ~64GB FP16 → ~18GB INT4 (3.5x compression)

Stage 3: Perplexity evaluation
  → Sliding-window PPL on held-out code, acceptable: < 0.5 PPL delta

Stage 4: VRAM budget analysis

Security keyword filter by language family:

Language Keywords
C/C++/Rust malloc, free, buffer, overflow, strcpy, memcpy, unsafe, pointer ops
Python ctypes, struct.pack, subprocess, execute(, SQL formatting
Go unsafe.Pointer, cgo, race conditions
JS eval(, innerHTML, __proto__, prototype pollution
Java Runtime.exec, ObjectInputStream, JNDI, deserialization

Chat template wrapping: every calibration sample is wrapped in the same <|im_start|> chat template used in production via tokenizer.apply_chat_template(). Without this, attention patterns on special tokens and role markers would be miscalibrated.

# Run full pipeline
python3 src/quantization_lab.py --stage all --gpu 2

# Individual stages
python3 src/quantization_lab.py --stage calibrate --gpu 2
python3 src/quantization_lab.py --stage quantize --gpu 2
python3 src/quantization_lab.py --stage evaluate --gpu 2
python3 src/quantization_lab.py --stage analysis

# Calibrate with your own codebase (recommended for domain-specific auditing)
python3 src/quantization_lab.py --stage all --gpu 2 \
    --calibration-source mixed \
    --local-code-dir /path/to/your/project \
    --local-ratio 0.5

INT8 KV Cache: VRAM Analysis

Qwen2.5-Coder-32B: 64 layers, 8 KV heads (GQA), 40 query heads, head_dim=128

Per-token KV memory (note: uses GQA KV heads, NOT query heads):
  FP16: 2 × 64 × 8 × 128 × 2B = 0.25 MB/token
  INT8: 2 × 64 × 8 × 128 × 1B = 0.125 MB/token  (50% savings)

VRAM budget per A6000 (GPTQ W4A16):
  Total                : 48 GB
  Model weights (INT4) : ~18 GB
  Overhead             : ~2 GB
  Available for KV     : ~28 GB

  INT8 KV @ 32K tokens : 4 GB   → ~7 concurrent 32K streams
  INT8 KV @ 8K tokens  : 1 GB   → ~26 concurrent 8K streams
  FP16 KV @ 8K tokens  : 2 GB   → ~13 concurrent 8K streams (half of INT8)

INT8 KV doubles the concurrent capacity (e.g., 26 vs 13 streams at 8K context) on a single 48GB card with no measurable quality loss for code audit tasks.

Multi-Instance vs Tensor Parallel

32B model at INT4 ≈ 18GB, well within single A6000 capacity. We run 4 independent vLLM instances instead of TP=4:

  • 4× throughput: 4 independent instances serve 4 requests in parallel. TP=4 gives one instance with faster single-request latency but lower aggregate throughput.
  • Fault isolation: single GPU failure doesn't take down the whole system.
  • No inter-GPU communication: avoids PCIe bandwidth bottleneck (even PIX pairs are ~32 GB/s vs ~900 GB/s HBM).

Prefix Caching: Two-Level Affinity

Level 1 — Intra-pipeline (within a single /v1/audit call): all 3 stages are pinned to the same vLLM backend. Stage 3 reuses prefix KV blocks from Stage 1 without re-prefill:

Stage 1: [system_prompt] + [project_desc + source_code] → summary
Stage 3: [system_prompt] + [source_code + ...] → cache HIT on source_code prefix

Level 2 — Inter-request (across API calls): SHA-256 hash of (project_description + source_code) maps to a backend index. Same project always hits the same GPU:

affinity_key = sha256(project_description + source_code)
preferred_idx = int(affinity_key, 16) % len(backends)
if backends[preferred_idx].healthy and not overloaded:
    return backends[preferred_idx]    # prefix cache warm
else:
    return ring_walk_or_least_load()  # graceful fallback

Overload threshold: active requests > 2× fleet average triggers ring walk, then least-load fallback.

Scenario Without With Speedup
Stage 3 TTFT (8K input) ~4.5s ~0.8s ~5.6×
Repeated audit (same project) Full prefill Cache HIT ~30-40% E2E
10 analysts, same codebase 10× prefill 1× prefill + 9× HIT ~9× savings

Chunked Prefill

Code audit inputs often reach 8K-16K tokens. Without chunking, a 16K prefill monopolizes the GPU for ~200ms, starving concurrent decode-phase requests (head-of-line blocking).

With --enable-chunked-prefill at 8192 tokens per chunk: the 16K input splits into 2 iterations, allowing decode requests to execute between chunks. Result: TTFT increases slightly for large requests, but TPOT stays stable for all concurrent users.


Environment Variables

All paths are configurable via environment variables with sensible defaults. No hard-coded user paths in source code.

Variable Default Description
MODEL_PATH ~/models/Qwen2.5-Coder-32B-Instruct-GPTQ-W4A16 Local path to quantized model weights. Falls back to HuggingFace Qwen/Qwen2.5-Coder-32B-Instruct-GPTQ-Int4 if not found.
VLLM_CONFIG configs/vllm_config.yaml Path to vLLM YAML configuration file
HF_CACHE ~/.cache/huggingface HuggingFace cache directory (model downloads)
QUANT_OUTPUT ~/models/Qwen2.5-Coder-32B-Instruct-GPTQ-W4A16 Output directory for self-quantized model
# Example: override model path for a different mount point
MODEL_PATH=/mnt/nvme/models/Qwen2.5-Coder-32B-GPTQ bash scripts/launch.sh --gpus 2,3,6,7

# Example: use a shared NFS cache for model downloads
HF_CACHE=/shared/hf_cache python3 src/fast_quant_download.py --threads 8

Configuration Reference

All vLLM parameters live in configs/vllm_config.yaml (uses gptq_marlin quantization backend).

Parameter Default Notes
gpu_memory_utilization 0.88 48 × 0.88 = 42.2 GB usable. Increase to 0.92 for more KV cache.
max_model_len 32768 Reduce to 16384 to double concurrent batch capacity
max_num_seqs 8 Max concurrent sequences per instance
kv_cache.dtype int8 auto for FP16 KV (halves context capacity)
block_size 16 PagedAttention block size (16 optimal for A6000)
chunked_prefill.max_num_batched_tokens 8192 Prefill chunk size; smaller = less HOL blocking, higher TTFT

GPU Selection

Default GPUs are 2,3,6,7 (PIX pairs on separate NUMA nodes). To use different GPUs:

python3 -m src.infra_server --gpus 0,1,4,5 --port 8000
# Check your topology first:
nvidia-smi topo -m
# Look for PIX (same PCIe switch) pairs on separate NUMA nodes

Observability

Gateway exposes Prometheus metrics at GET /metrics:

Metric Description
gateway_requests_total Request count by endpoint/status
gateway_request_latency_seconds E2E latency histogram (buckets: 0.1s–300s)
gateway_total_generation_throughput_tps Aggregated token throughput
gateway_gpu_kv_cache_usage_pct KV cache utilization per backend
gateway_active_requests In-flight requests per backend
gateway_affinity_routing_total Affinity routing hit/miss counters
gateway_backend_healthy Per-backend health (1/0)

Each vLLM instance also exposes native vLLM metrics (vllm:avg_generation_throughput_toks_per_s, vllm:time_per_output_token_seconds, vllm:gpu_cache_usage_perc, etc.).


API

GET /health

Backend status and per-GPU health.

POST /v1/audit

{
  "project_description": "A Flask REST API for user management...",
  "source_code": "from flask import Flask...",
  "language": "python",
  "max_tokens": 4096,
  "temperature": 0.3,
  "focus_areas": ["injection", "auth_bypass", "overflow"]
}

POST /v1/completion

{
  "prompt": "Analyze this code for vulnerabilities: ...",
  "system_prompt": "You are a security auditor.",
  "max_tokens": 2048,
  "temperature": 0.3
}

POST /v1/hyde/audit

{
  "code": "void parse_packet(char *raw, int len) { ... }",
  "language": "c",
  "context": "IoT packet parser"
}

POST /v1/hyde/audit/batch

{
  "samples": [
    {"name": "app.py", "code": "...", "language": "python"},
    {"name": "parser.c", "code": "...", "language": "c"}
  ],
  "max_concurrency": 4
}

Self-Quantize from Scratch

Only needed to reproduce the GPTQ pipeline with custom calibration. Takes 30-90 min on a single A6000. Output path defaults to $QUANT_OUTPUT (or ~/models/Qwen2.5-Coder-32B-Instruct-GPTQ-W4A16).

# Full pipeline
python3 src/quantization_lab.py --stage all --gpu 2

# With your own codebase (recommended)
python3 src/quantization_lab.py --stage all --gpu 2 \
    --calibration-source mixed \
    --local-code-dir /path/to/your/project \
    --local-ratio 0.5

# Via launch.sh (quantize + serve)
bash scripts/launch.sh --quantize-local --gpus 2,3,6,7
bash scripts/launch.sh --quantize-local \
    --calib-source mixed \
    --local-code-dir /path/to/project \
    --local-ratio 0.6 --gpus 2,3,6,7
Calibration Source Flag Use Case
CodeSearchNet (default) --calib-source codesearchnet General code audit
Local codebase only --calib-source local --local-code-dir DIR Domain-specific (e.g., IoT firmware)
Mixed (recommended) --calib-source mixed --local-code-dir DIR --local-ratio 0.5 Best of both

Troubleshooting

# No healthy backends
cat logs/vllm_gpu2.log | tail -50
# Common: CUDA OOM → reduce gpu_memory_utilization in configs/vllm_config.yaml
#         Model not found → check MODEL_PATH env var, re-run fast_quant_download.py
#         Port conflict → lsof -i :8100 -i :8101 -i :8102 -i :8103

# Slow first request → normal (CUDA graph capture + KV cache alloc)

# Check which model path is being used
echo $MODEL_PATH  # should point to your GPTQ model directory
ls $MODEL_PATH     # should contain config.json, model.safetensors, etc.

Future Work

  • Speculative Decoding: Qwen2.5-Coder-0.5B as draft model for ~1.5× decode speedup
  • Guided Generation: Outlines integration for constrained exploit payload formatting
  • Tensor Parallelism: TP=2 across PIX pairs for 64K+ context support

About

Hypothetical Diff Analysis for code

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors