This project evaluates Python code generation across three LLMs — GPT-5.4, Claude Opus 4.6, and Gemini 2.5 Pro — using a multi-prompt iterative methodology. The core research question is whether models produce over-engineered solutions for tasks that could be solved with simpler code.
TestingSuite/
├── prompts/ # 9 task prompts (3 domains × 3 iterations)
├── outputs/ # Generated code (5 runs × 3 models × 9 prompts)
│ ├── run_1/ # Each generation run is independent
│ │ ├── gpt/
│ │ │ └── *_solution.py
│ │ ├── claude/
│ │ └── gemini/
│ ├── run_2/
│ └── ... # NUM_RUNS total (default 5)
├── results/
│ ├── run_1/ # Per-run results
│ │ ├── retry_log.json
│ │ ├── test_results.txt
│ │ ├── test_results.json
│ │ ├── analysis_results.txt
│ │ ├── analysis_results.json
│ │ └── requirement_scores.json
│ ├── run_2/
│ ├── aggregated_results.json # Mean ± std across all runs
│ ├── requirement_coverage.json # Coverage per model per script
│ └── normalised_results.json # Complexity ÷ coverage
├── outputs_verify/ # Verification batch (5 runs, same structure)
│ └── run_N/ # One run per invocation
├── results_verify/
│ ├── run_N/ # Per-run verification results
│ ├── aggregated_results.json # Mean ± std across verify runs
│ └── comparison.json # Original vs verify comparison
├── requirements/
│ └── requirements.json # 45 formal requirements (9 prompts × 4–8 each)
├── scripts/
│ ├── generate_code.py # Step 1: Generate + execute + retry (N runs)
│ ├── test_harness.py # Step 2: Execute + measure performance
│ ├── analyze_results.py # Step 3: Static analysis + aggregation
│ ├── score_requirements.py # Step 4: Requirement coverage + normalised metrics
│ └── verify_run.py # Incremental verification runner
├── requirements.txt
└── README.md
pip install -r requirements.txtCreate a .env file in the project root with your API keys (see .env.example for the template):
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_API_KEY=AIza...
The .env file is git-ignored and loaded automatically by the scripts via python-dotenv.
The requirements/requirements.json file contains 45 formal requirements extracted from the 9 prompts (4–8 per prompt), used by score_requirements.py for automated requirement coverage evaluation. This file is checked into the repository — no setup needed.
All scripts use relative paths and must be run from the scripts/ directory:
cd scripts
python generate_code.py # Step 1 — generates code across N runs
python test_harness.py # Step 2 — tests all runs (3 perf runs per script)
python analyze_results.py # Step 3 — static analysis + aggregation
python score_requirements.py # Step 4 — requirement coverage scoring
python score_requirements.py --normalize # Step 4 + normalised complexity metrics
# Incremental verification (one run per invocation)
python verify_run.py # Run one verification iteration
python verify_run.py --compare # Run + compare against original results
python verify_run.py --compare-only # Just compare (no new run)
python verify_run.py --aggregate-only # Just re-aggregate verify runs
python verify_run.py --run-number N # Force specific run numberTo account for LLM non-determinism, the pipeline runs the full generation process multiple times (default: 5 runs). Each run produces independent code from each model for each prompt. Results are reported as mean ± standard deviation across runs.
| Stage | Repetitions | Purpose |
|---|---|---|
Code generation (generate_code.py) |
5 runs (configurable via NUM_RUNS) |
Captures variance in model output across independent generations |
Runtime/memory (test_harness.py) |
3 executions per script (configurable via PERF_RUNS) |
Eliminates OS scheduling noise; reports median per run |
Static analysis (analyze_results.py) |
1 per script | Deterministic — same code always yields the same metrics |
Final results in results/aggregated_results.json contain mean, std, and sample count for every metric across all runs.
Reads each prompt from prompts/, sends it to all three LLMs with a system instruction requesting raw Python code only (no markdown or explanations). Runs the full generation pipeline NUM_RUNS times (default 5), saving each run to outputs/run_{n}/.
After generating each script, it is immediately executed to check for errors. If execution fails (non-zero exit code), the model receives the error message and is given up to 3 retries to correct the issue. Each retry uses a multi-turn conversation so the model can see its previous code and the resulting error — testing its ability to diagnose and fix its own mistakes.
- Only the final passing version (or the last attempt if all retries fail) is saved as the
_solution.py— failed intermediate attempts are not kept - A structured retry log is written to
results/run_{n}/retry_log.jsonwith per-attempt exit codes, error snippets, and pass/fail status
The retry mechanism uses multi-turn conversations so each model can see its full history of attempts and errors. This tests contextual error correction, not just single-shot code generation.
Attempt 1 (initial):
messages = [
{role: "user", content: "<original prompt>"}
]
→ Model generates code → Execute it → FAILS
Attempt 2 (retry_1):
messages = [
{role: "user", content: "<original prompt>"},
{role: "assistant", content: "<the code that failed>"}, ← model sees its own code
{role: "user", content: "The code you provided produced ← model sees the actual
the following error:\n\n stderr/traceback
<stderr up to 1500 chars>\n\n
Please provide the complete
corrected Python code..."}
]
→ Model generates fixed code → Execute it → FAILS again
Attempt 3 (retry_2):
messages = [
{role: "user", content: "<original prompt>"},
{role: "assistant", content: "<attempt 1 code>"},
{role: "user", content: "<error from attempt 1>"},
{role: "assistant", content: "<attempt 2 code>"}, ← full history preserved
{role: "user", content: "<error from attempt 2>"},
]
→ Model can see ALL previous attempts and errors
Attempt 4 (retry_3 — final):
Same pattern, with the full conversation history from all 3 prior attempts.
Scripts that pass on any attempt stop immediately — retries only happen on failure. The execution timeout is 60 seconds per attempt. All models share a uniform output token cap of 16,384 tokens (MAX_OUTPUT_TOKENS) to ensure no model is advantaged or truncated by differing generation limits.
Models used:
gpt-5.4-2026-03-05(OpenAI)claude-opus-4-6(Anthropic)gemini-2.5-pro(Google)
Iterates over all outputs/run_*/ directories and executes each generated .py file as a standalone subprocess. Each script is run 3 times (PERF_RUNS) to eliminate OS scheduling noise, and the median runtime and peak memory are reported.
| Metric | What it measures | How |
|---|---|---|
| Runtime | Wall-clock execution time (median of 3) | time.time() around subprocess |
| Peak Memory | Maximum resident set size (median of 3) | psutil polling at 50ms intervals |
| Exit Code | Whether the script crashed (non-zero = failure) | subprocess.Popen.returncode |
Per-run results are written to results/run_{n}/test_results.txt and results/run_{n}/test_results.json.
Iterates over all outputs/run_*/ directories and runs six analysis tools on every generated script. Per-run results are written to results/run_{n}/analysis_results.txt and .json.
After all runs are analysed, the script aggregates results across runs into results/aggregated_results.json, computing mean ± standard deviation for every metric per model per script.
What it measures: The number of independent execution paths through the code. Every if, elif, for, while, except, and, or adds a path.
Output: A score and letter grade per function:
| Grade | CC Score | Meaning |
|---|---|---|
| A | 1–5 | Simple, low risk |
| B | 6–10 | Moderate complexity |
| C | 11–15 | Complex, higher risk |
| D | 16–20 | Too complex |
| E | 21–30 | Unstable |
| F | 31+ | Error-prone, untestable |
Why it matters for your research: A function with CC of 25 that could be written with CC of 5 is a direct signal of over-engineering — unnecessary branching and edge-case handling that wasn't asked for.
What it measures: Line-level counts for each file:
| Metric | Meaning |
|---|---|
| LOC | Total lines (including blanks and comments) |
| LLOC | Logical lines of code |
| SLOC | Source lines (non-blank, non-comment) |
| Comments | Number of comment lines |
| Multi | Multi-line string lines |
| Blank | Blank lines |
Why it matters: High SLOC with low CC = lots of code that doesn't actually do much branching logic. This is the clearest indicator of over-engineering — the model generated verbose abstractions (extra classes, wrapper functions, excessive error handling) for something simple.
What it measures: Cognitive complexity based on the operators and operands in the code:
| Metric | Formula / Meaning |
|---|---|
| Vocabulary | Distinct operators + distinct operands |
| Length | Total operators + total operands |
| Volume | Length × log2(Vocabulary) — "size" of the algorithm |
| Difficulty | (distinct operators / 2) × (total operands / distinct operands) |
| Effort | Difficulty × Volume — total mental effort to understand |
| Time | Effort / 18 — estimated seconds to understand the code |
| Bugs | Volume / 3000 — estimated number of delivered bugs |
Why it matters: Two scripts can have the same CC but very different Halstead effort. If a model uses 50 unique operators where 15 would suffice, the effort to understand that code is measurably higher — even if it technically works. The "estimated bugs" metric is also useful: more complexity = more places for bugs.
What it measures: A single composite score (0–100) combining Cyclomatic Complexity, Halstead Volume, and Lines of Code:
| Grade | Score | Meaning |
|---|---|---|
| A | 20–100 | Very maintainable |
| B | 10–19 | Moderately maintainable |
| C | 0–9 | Difficult to maintain |
Why it matters: This is your best single-number comparison metric across models. If GPT produces MI=75 and Claude produces MI=45 for the same task, that's a quantifiable difference in maintainability regardless of whether both solutions are functionally correct.
What it measures: Code quality and PEP 8 conformance on a scale of 0.00 to 10.00. Checks for:
- Convention violations (naming, spacing)
- Refactoring opportunities (duplicated code, too many arguments)
- Warnings (unused variables, unreachable code)
- Errors (undefined variables, wrong types)
Missing docstring warnings (C0114, C0115, C0116) are disabled since generated code rarely includes docstrings and these would dominate the results.
Why it matters: Measures readability and adherence to Python conventions. A working script with a pylint score of 3.0 is harder to maintain than one scoring 8.5, even if both pass tests.
What it measures: Python-specific security vulnerabilities:
- Hard-coded passwords and secrets
- Use of
eval(),exec(),pickle(code injection risk) - Insecure
subprocessusage (shell injection) - Weak cryptographic choices (MD5, SHA1 for hashing passwords)
- SQL injection patterns
- Path traversal vulnerabilities
Each finding is classified by severity (LOW / MEDIUM / HIGH) and confidence level.
Why it matters: Directly maps to your paper's security evaluation criterion. Models that generate code using os.system() or hard-coded keys produce functionally correct but insecure code.
Addresses the distinction between structural complexity and over-engineering. Raw metrics (SLOC, CC, Halstead effort) measure how complex the code is, but not whether that complexity is justified by the requirements. This script adds the missing link: a formal Requirement Coverage Score (RCS) that measures how many explicit prompt requirements each solution fulfils.
How it works:
- Loads
requirements/requirements.json, which contains 45 formal requirements extracted from the 9 prompts (4–8 per prompt). Each requirement is traced back to a specific sentence in the prompt text. - For each generated solution, runs 45 fully automated checks that verify structural properties via AST analysis and source pattern matching — function/class existence, parameter names, import usage, algorithmic patterns, error handling structures — without executing the code. No manual or subjective scoring is used, eliminating evaluator bias.
- Computes Requirement Coverage Score:
requirements passed / total requirementsper model per script. - When
--normalizeis used, divides raw complexity metrics by coverage to produce normalised metrics — complexity per unit of requirement fulfilment.
Key output — Normalised Complexity:
| Metric | Formula | What it reveals |
|---|---|---|
| SLOC / Coverage | Raw SLOC ÷ RCS | Lines of code per requirement fulfilled — separates verbosity from thoroughness |
| CC / Coverage | Raw CC ÷ RCS | Branching complexity per requirement — separates justified branches from unnecessary ones |
| Halstead Effort / Coverage | Raw Effort ÷ RCS | Cognitive effort per requirement — the most direct over-engineering measure |
A model that scores high on raw complexity but also high on coverage may not be over-engineered — its complexity is justified. A model with high complexity and low coverage is genuinely over-engineered (adding complexity beyond what the requirements demand — known in SE literature as "gold plating").
Usage:
cd scripts
# Run automated scoring on all runs
python score_requirements.py
# Run + compute normalised complexity metrics
python score_requirements.py --normalize
# Score a specific run only
python score_requirements.py --run run_1All 45 requirements are evaluated via automated AST and source pattern analysis. No manual scoring is used — the evaluation is fully deterministic and reproducible, eliminating evaluator bias. Check types include function/class existence, parameter verification, import detection, algorithmic pattern matching (e.g., bitmask DP vs brute-force), and behavioural pattern detection (e.g., wrong-passphrase testing, corruption simulation).
Orchestrates the full pipeline (generate → test → analyze) one run at a time into separate outputs_verify/ and results_verify/ directories. This supports reproducibility validation: accumulate verification runs incrementally (e.g. twice daily over several days), then compare against the original 5-run batch to demonstrate result consistency in the FYP paper.
Each invocation auto-detects the next run number (e.g. run_1, run_2, ...) and calls the existing pipeline scripts with overridden output paths. After each run, all existing verify runs are re-aggregated into results_verify/aggregated_results.json.
The existing scripts (generate_code.py, test_harness.py, analyze_results.py) accept optional path parameters but retain their original defaults, so standalone usage is unchanged.
All commands must be run from the scripts/ directory:
cd scripts1. Run one verification iteration (most common — use this to accumulate data):
python verify_run.pyAuto-detects the next run number (run_1 on first use, run_2 next time, etc.). Executes the full pipeline: generate → test → analyze → aggregate across all existing verify runs.
2. Run + compare against original results:
python verify_run.py --compareSame as above, but after finishing prints a side-by-side table showing original vs verification metrics with % differences for every model and script. Saves results_verify/comparison.json.
3. Compare without running anything new:
python verify_run.py --compare-onlyLoads existing results/aggregated_results.json and results_verify/aggregated_results.json, prints the comparison table, and saves results_verify/comparison.json. No code is generated, tested, or analysed.
4. Re-aggregate existing verify runs:
python verify_run.py --aggregate-onlyRecomputes results_verify/aggregated_results.json from all existing per-run JSON files. Useful if you manually deleted a bad run directory and want to recalculate statistics.
5. Force a specific run number:
python verify_run.py --run-number NOverrides auto-detection and uses run number N. Useful to re-do a specific run (e.g. if run_3 had an API outage). Can be combined with --compare:
python verify_run.py --run-number 3 --compare# Day 1 — morning & evening
python verify_run.py # creates outputs_verify/run_1/
python verify_run.py # creates outputs_verify/run_2/
# Day 2 — continue accumulating
python verify_run.py # creates run_3
python verify_run.py # creates run_4
# ... repeat until enough data collected ...
# Final check — compare all verify runs against original 5-run batch
python verify_run.py --compare-onlyThe --compare and --compare-only modes run Welch's t-test (two-sided, unequal variance) on the per-run raw values from both batches. For each model × script × metric combination, the output includes:
| Column | Meaning |
|---|---|
| Orig Mean | Mean across original runs |
| Ver Mean | Mean across verification runs |
| Diff % | Percentage difference between means |
| p-value | Welch's t-test p-value. p > 0.05 = no significant difference |
| Cohen d | Effect size (how large the difference is in practice) |
| Effect | Plain-language interpretation: negligible / small / medium / large |
Cohen's d interpretation:
| d | Effect Size | Meaning |
|---|---|---|
| < 0.2 | Negligible | No meaningful difference |
| 0.2–0.5 | Small | Minor difference |
| 0.5–0.8 | Medium | Moderate difference |
| > 0.8 | Large | Substantial difference |
A summary line reports how many metrics showed significant differences (p < 0.05). If this count is 0 or within the expected false-positive rate (~5%), results are considered reproducible. The full comparison data (including t-statistics, p-values, and effect sizes) is saved to results_verify/comparison.json.
| Criterion (Paper §3.2) | Tool | Key Output |
|---|---|---|
| Correctness | test_harness.py | Exit code (0 = pass), pass rate across runs |
| Resilience | generate_code.py | Retry attempts, error correction rate (retry_log.json) |
| Performance | test_harness.py | Runtime (seconds), Peak Memory (MB) — median of 3 |
| Requirement Coverage | score_requirements.py | RCS (0–1), 45 automated checks against formal requirements |
| Structural Complexity | radon cc | Average cyclomatic complexity per function |
| Structural Complexity | radon raw | SLOC, LOC, function/class count |
| Structural Complexity | radon hal | Halstead effort, difficulty, est. bugs |
| Structural Complexity | radon mi | Maintainability Index (0–100) |
| Over-engineering | score_requirements.py | Normalised metrics: SLOC/Coverage, CC/Coverage, Effort/Coverage |
| Readability | pylint | Score (0–10), issue counts by type |
| Security | bandit | Findings by severity (LOW/MEDIUM/HIGH) |
Raw complexity metrics (CC, SLOC, Halstead) measure structural complexity — how complex the code is. The over-engineering determination requires normalising these by requirement coverage — how much of that complexity is justified by prompt requirements. A model with high complexity and high coverage is thorough; a model with high complexity and low coverage is over-engineered.
All metrics are reported as mean ± std across runs in results/aggregated_results.json. Requirement coverage is reported in results/requirement_coverage.json and normalised metrics in results/normalised_results.json.
Each prompt explicitly specifies Python, includes concrete function signatures, and requires a main() function so every generated script is directly runnable and comparable across models.
Each task domain has 3 iterations that mirror a real developer workflow:
- Iteration 1 — Build the base functionality (correctness focus)
- Iteration 2 — Handle errors, edge cases, constraints (robustness focus). Instructs the model to build upon the previous iteration.
- Iteration 3 — Scale up and optimize (performance focus). Instructs the model to build upon the previous iteration.
| Domain | Iteration 1 | Iteration 2 | Iteration 3 |
|---|---|---|---|
| api_client | Fetch readings from weather station endpoints → fetch_all_stations(urls) -> list[dict] |
Add retry logic, rate limiting, error reporting → returns {"readings": [...], "errors": [...]} |
Concurrent fetching with max_concurrent parameter |
| dynamic_programming | TSP-style drone routing via cost matrix → find_optimal_route(cost_matrix) -> (route, energy) |
Add time-window constraints per station → returns feasibility, arrival times | Optimise for up to 20 stations within reasonable time |
| secure_storage | Encrypt/decrypt files with passphrase → store_file() / retrieve_file() |
Add integrity checks, wrong-passphrase detection, corruption handling | SecureStorage class with multi-user, multi-file support at scale |
All results below are aggregated across 5 independent generation runs (mean ± std) to account for LLM non-determinism. Each run generates 27 scripts (9 prompts × 3 models), tested 3 times each for performance, and analysed with all 6 static analysis tools. All models used a uniform output token cap of 16,384 tokens.
| Model | Pass Rate | Avg Retries per Script | Total Retries (across 5 runs) |
|---|---|---|---|
| GPT-5.4 | 100% | 1.09 | 4 |
| Claude Opus 4.6 | 100% | 1.09 | 4 |
| Gemini 2.5 Pro | 100% | 1.22 | 10 |
- All three models achieved 100% pass rates across all 5 runs. Every script ultimately ran successfully (exit code 0), though some required retries.
- GPT and Claude are nearly identical in reliability — both averaging 1.09 retries per script with just 4 total retries across 45 scripts each.
- Gemini required the most error correction — 10 total retries across 5 runs (2.5× more than GPT or Claude). This suggests Gemini's initial code generation is less robust, particularly for complex prompts, though it always recovers within the retry budget.
- The retry mechanism itself is a useful research signal: models that need retries are generating code they cannot verify internally before outputting.
To distinguish structural complexity from over-engineering, 45 formal requirements were extracted from the 9 prompts and evaluated via fully automated AST and source pattern analysis. Each requirement traces back to a specific sentence in the prompt text. All checks are deterministic and reproducible — no manual or subjective scoring was used, eliminating evaluator bias.
| Model | Avg Coverage | Failures (of 225 checks) | Failure Location |
|---|---|---|---|
| Claude Opus 4.6 | 99.6% | 1 (0.4%) | SS3-R5: multi-user timing pattern in 1 run |
| GPT-5.4 | 99.6% | 1 (0.4%) | DP2-R3: omitted speed=1 default in 1 run |
| Gemini 2.5 Pro | 97.4% | 5 (2.2%) | All in dynamic programming (iterations 2–3) |
- Claude and GPT are tied at 99.6% requirement coverage — near-perfect fulfilment of all prompt specifications across all 5 runs. Both models consistently implement every function signature, parameter, return structure, error handling pattern, and demonstration scenario specified in the prompts.
- Gemini drops to 97.4%, with all failures concentrated in the dynamic programming domain. In Run 1, Gemini's DP iteration 3 failed all 4 requirements: it did not preserve
time_windows/inspection_timeparameters from iteration 2, did not use a scalable algorithm, did not generate the requested 15-station test case, and did not report elapsed time. This means Gemini's lower SLOC is partially explained by skipping requirements, not just writing more efficient code. - The per-script breakdown reveals that all three models achieve 100% coverage on 7 of 9 prompts. Differences only emerge on the hardest prompts (DP iterations 2–3 and SS iteration 3), where requirement complexity tests whether models carry forward constraints from earlier iterations.
Note: The raw SLOC figures below measure structural size. See the Over-Engineering Verdict section for SLOC normalised by requirement coverage, which separates justified verbosity from over-engineering.
| Model | Total SLOC (mean) | Avg SLOC Std per Script |
|---|---|---|
| Claude Opus 4.6 | 1,154 | 11.5 |
| GPT-5.4 | 1,050 | 19.1 |
| Gemini 2.5 Pro | 834 | 18.1 |
- Claude consistently produces the most code — 38% more than Gemini and 10% more than GPT for identical prompts. This is the strongest signal for over-engineering: more code to achieve the same functional outcome.
- Claude's output is the most consistent across runs (lowest std of 11.5 per script), meaning its verbosity is systematic, not random. GPT shows the most variation (std 19.1), suggesting it sometimes generates compact solutions and sometimes more elaborate ones.
- Gemini is the most concise, producing the least code overall.
| Model | Avg CC per Script | Avg Std |
|---|---|---|
| Claude Opus 4.6 | 8.26 | 1.52 |
| Gemini 2.5 Pro | 5.92 | 1.71 |
| GPT-5.4 | 5.43 | 1.18 |
- Claude has the highest branching complexity — averaging 8.26 per function, which falls in the "moderate complexity" range (B grade). This aligns with the SLOC finding: Claude is not just writing more lines, it is adding more decision paths (if/else, error handling, edge cases).
- GPT produces the simplest control flow at 5.43 — well within the "simple" range (A grade). Combined with its moderate SLOC, this suggests GPT tends toward straightforward, linear code rather than deeply nested branching.
- Claude's dynamic_programming iteration 3 showed the highest individual CC across all models (mean 22.27 ± 6.63), placing it in the "unstable" E range — more than double GPT's 9.35 for the same prompt.
| Model | Avg Pylint Score |
|---|---|
| GPT-5.4 | 9.37 |
| Gemini 2.5 Pro | 8.40 |
| Claude Opus 4.6 | 7.29 |
- GPT produces the cleanest, most PEP-8-compliant code by a wide margin (9.37 vs Claude's 7.29). GPT's scores are consistently high across all scripts with low variance.
- Claude scores lowest in code quality despite writing the most code. The additional code Claude generates introduces more style violations, naming convention issues, and refactoring opportunities flagged by pylint.
- This creates an interesting tension: Claude writes more code (potential over-engineering) that is also less clean. A human developer would likely need to refactor Claude's output more than GPT's.
| Model | Avg MI |
|---|---|
| Claude Opus 4.6 | 58.45 |
| Gemini 2.5 Pro | 55.57 |
| GPT-5.4 | 36.78 |
- Claude has the highest maintainability — this is the paradox of the results. Claude writes the most code with the highest complexity, yet scores best on maintainability. This is because the MI formula rewards modularity: many small functions, comments, and lower complexity per function. Claude's approach of breaking tasks into many small pieces inflates the MI even though total code volume is high.
- Gemini is close behind at 55.57, benefiting from its concise code and moderate modularity.
- GPT has the lowest maintainability (36.78) despite the highest pylint score. GPT tends to pack logic into fewer, denser functions. Each function is well-formatted (high pylint) but individually harder to maintain (lower MI).
- This distinction is important for the over-engineering argument: Claude's code is easier to modify in isolation (high MI) but requires reading more total code. GPT's code is compact but denser to understand per function.
| Model | Total Effort | Avg Effort per Script |
|---|---|---|
| Claude Opus 4.6 | 57,780 | 6,420 |
| GPT-5.4 | 47,285 | 5,254 |
| Gemini 2.5 Pro | 32,664 | 3,629 |
- Claude requires the most cognitive effort to understand (57,780 total) — 22% more than GPT and 77% more than Gemini. This is the clearest quantitative measure of over-engineering: Claude's code demands significantly more mental effort to comprehend for the same set of tasks.
- Gemini requires the least mental effort (32,664 total) — 43% less than Claude. This aligns with its minimal code size and suggests Gemini produces the most straightforward implementations.
- Claude's dynamic_programming iteration 3 alone averages 21,013 Halstead effort — more than Gemini's entire secure_storage domain combined.
| Model | Total Findings (mean) |
|---|---|
| Gemini 2.5 Pro | 11.2 |
| Claude Opus 4.6 | 14.8 |
| GPT-5.4 | 15.2 |
- Gemini produces the most secure code with the fewest Bandit findings (11.2 total). Its concise implementations have less surface area for security issues.
- GPT has the most security findings (15.2), with dynamic_programming iteration 3 alone averaging 5.6 findings per run — the highest single-script security count in the dataset.
- Claude and GPT are close (14.8 vs 15.2). Claude's secure_storage iteration 2 is a notable outlier at 5.4 findings — the most in the secure_storage domain across all models.
- The secure_storage domain produces the most security findings across all models, which is expected — encryption code inherently triggers Bandit rules around cryptographic operations (e.g., use of
os.urandom,hashlib,subprocessfor key derivation).
API Client domain:
- Claude's api_client iteration 2 shows the longest runtimes (mean 18.50s) — it tends to implement real HTTP requests with retry logic that actually executes against live endpoints, leading to variable runtimes. Gemini's iteration 2 is even more extreme at 20.87s mean.
- GPT's api_client implementations are more conservative with simulated/mocked data, resulting in faster execution (3.08s mean for iteration 2).
- Gemini's api_client SLOC has the highest variance (std 35.0 for iteration 2), indicating highly inconsistent approaches across runs.
Dynamic Programming domain:
- All three models produce the most complex code here (highest CC across the board), as expected for algorithmically demanding TSP-style problems.
- Claude's DP iteration 3 reaches CC of 22.27 ± 6.63 — the highest single-script complexity in the entire dataset — suggesting it adds extensive optimisation heuristics beyond what was requested. The high std (6.63) also means Claude's approach varies dramatically between runs.
- Gemini's DP iteration 3 has the highest SLOC variance in the entire dataset (std 44.9), meaning it sometimes generates compact algorithms (~35 lines) and sometimes elaborate multi-heuristic solutions (~125 lines).
- GPT's DP iteration 3 is notable for generating the most Bandit findings of any script (5.6 mean), despite not being a security-focused domain.
Secure Storage domain:
- All models are 100% reliable here (no failures across any run), but this domain produces the most Bandit findings across all models, which is inherent to cryptographic code.
- Claude's secure_storage code is the most consistent (low SLOC std of 4.2 for iteration 1) — it has a reliable template for encryption/decryption that it applies each run.
- Claude's secure_storage iteration 2 generates the most security findings (5.4 mean) — its additional abstractions (wrapper classes, key management utilities) introduce more flagged patterns.
Raw complexity metrics alone cannot determine over-engineering — a model that writes more code may simply be implementing more of the prompt's requirements. To address this, all complexity metrics are normalised by Requirement Coverage Score (RCS): complexity per unit of requirement fulfilment. This separates justified complexity (implementing what was asked) from genuine over-engineering (adding complexity beyond what was asked — known in software engineering literature as "gold plating").
All 45 requirements were evaluated via fully automated AST and source pattern analysis, eliminating evaluator bias. The evaluation is deterministic and reproducible.
Normalised complexity metrics:
| Model | Coverage | SLOC / Coverage | CC / Coverage | Halstead Effort / Coverage |
|---|---|---|---|---|
| Claude Opus 4.6 | 99.6% | 129.0 | 8.28 | 6,443 |
| GPT-5.4 | 99.6% | 117.2 | 5.48 | 5,294 |
| Gemini 2.5 Pro | 97.4% | 95.3 | 6.31 | 3,997 |
Claude Opus 4.6 is the most over-engineered, even after normalisation. Claude and GPT have identical requirement coverage (99.6%), so the complexity gap between them is pure over-engineering — Claude writes 10% more code per requirement than GPT with 51% more Halstead effort per requirement, for no additional functional benefit.
Claude's modular style (high MI of 58.45) means the extra complexity takes the form of unnecessary abstractions: additional classes, wrapper functions, and defensive error handling beyond what the prompts specify. Claude's DP iteration 3 exemplifies this: CC of 22.27 with Halstead effort of 21,013 for a problem the other models solve with far less complexity, despite all three achieving the same functional outcome.
GPT-5.4 is the most balanced. It achieves the same 99.6% coverage as Claude while writing 10% less code and requiring 18% less cognitive effort. GPT's approach is fewer but denser functions, resulting in the lowest maintainability (36.78) despite the best code quality (pylint 9.37). This is a trade-off rather than over-engineering: GPT's code is compact but individually harder to maintain per function.
Gemini 2.5 Pro is the most efficient but least thorough. Gemini produces the least complexity per requirement (SLOC/Coverage of 95.3 vs Claude's 129.0), confirming it is genuinely concise rather than merely incomplete. However, its 97.4% coverage reveals that some of its efficiency comes at the cost of requirement fulfilment — it skips constraints on the hardest prompts (DP iterations 2–3), particularly failing to carry forward parameters from earlier iterations. Gemini also has the lowest initial reliability (10 retries vs 4), suggesting its minimalism and correctness exist in tension.
Key finding: Normalising for requirement coverage reduces the SLOC gap between Claude and Gemini from 38% (raw) to 35% (normalised). The gap narrows only slightly because Claude and Gemini's coverage difference is small (2.2%). This confirms that the majority of Claude's additional complexity is genuine over-engineering rather than more thorough requirement fulfilment.
To validate that the above results are not artefacts of a single testing session, an independent verification batch of 5 runs was conducted across separate sessions using verify_run.py. This produced a second complete dataset in outputs_verify/ and results_verify/, which was compared against the original 5-run batch using Welch's t-test (two-sided, unequal variance) and Cohen's d (pooled standard deviation).
Summary of verification runs:
| Run | GPT Retries | Claude Issues | Gemini Retries | Notes |
|---|---|---|---|---|
| 1 | 0 | 0 | 0 | All 27 scripts passed first attempt |
| 2 | 0 | 1 (api_client_2 timeout) | 1 (secure_storage_2) | 2 retries total |
| 3 | 0 | 0 | 0 | All 27 scripts passed first attempt |
| 4 | 1 (dp_iter_2) | 1 API 500 error (dp_iter_1 skipped) | 0 | Anthropic API outage lost 1 script |
| 5 | 1 (secure_storage_1) | 0 | 0 | 1 retry total |
Statistical comparison results:
| Statistic | Value |
|---|---|
| Total tests conducted | 351 (3 models × 9 scripts × 13 metrics) |
| Significant at p < 0.05 | 12 (3.4%) |
| Expected false positives at 5% | ~17.5 |
| Metrics with negligible effect size (d < 0.2) | Majority |
12 out of 351 metrics showed statistically significant differences (p < 0.05) — 3.4%. This is below the expected 5% false-positive rate for this many independent tests, meaning the number of "significant" results is consistent with random chance alone. No systematic differences were detected between the original and verification batches.
The few metrics that showed large Cohen's d values (e.g., memory usage on individual scripts, some bandit findings) all had p-values above 0.05 — the effect sizes were large in magnitude but not statistically significant given the sample variance, indicating run-to-run variability rather than genuine batch-level differences.
Conclusion: The Welch's t-test comparison confirms that the reported results are reproducible. The original 5-run batch and the verification 5-run batch produced statistically indistinguishable outcomes across all models, scripts, and metrics. Full comparison data (t-statistics, p-values, effect sizes) is saved in results_verify/comparison.json.