Toward a Verified Aiur Compiler

Long-term roadmap for proving in Lean 4 that the Aiur compiler in `Ix/Aiur/**` translates source programs to bytecode without changing their observable behavior, where "observable" includes the `IOBuffer`.


## 1. The Theorem

### 1.1 What "verified compiler" means

The end goal is one theorem combining **progress** (every well-formed program compiles successfully) and **preservation** (whenever compilation succeeds, the output matches source semantics):

```lean
theorem Toplevel.compile_correct
    (t : Toplevel) (hwf : WellFormed t) :
    -- (a) Progress: compilation succeeds.
    (∃ ct decls, t.mkDecls = .ok decls ∧ t.compile = .ok ct)
    ∧
    -- (b) Preservation: any compilation output is semantically equivalent.
    (∀ ct decls,
      t.mkDecls = .ok decls →
      t.compile = .ok ct →
      ∀ name funIdx (hname : ct.getFuncIdx name = some funIdx)
        args io₀ fuel,
      InterpResultEq decls ct.getFuncIdx
        (Concrete.Eval.runFunction decls ⟨name⟩ args io₀ fuel)
        (Bytecode.Eval.runFunction  ct.bytecode funIdx
                                    (Flatten.args decls ct.getFuncIdx args)
                                    io₀ fuel))
```

where `Flatten.args` is the natural list-to-flat coercion, defined in `Semantics/Flatten.lean` alongside `flattenValue`:

```lean
def Flatten.args (decls : Decls) (funcIdx : Global → Option Nat)
    (args : List Value) : Array G :=
  args.foldl (fun acc v => acc ++ flattenValue decls funcIdx v) #[]
```

The two conjuncts are independent statements and both are load-bearing:

- **(a) alone is vacuous for meaning.** A compiler that produces a no-op `Bytecode.Toplevel` satisfies (a) trivially.
- **(b) alone is vacuous for usefulness.** A compiler that always returns `.error` satisfies (b) vacuously (the premise `t.compile = .ok ct` is never met).

Only the conjunction guarantees that well-formed source produces a compilation output *and* that output is behaviorally equivalent to the source.

(Because `compile` and `mkDecls` are deterministic functions, (a) and (b) together are logically equivalent to the single-quantifier form `∃ ct decls, ... ∧ InterpResultEq ...`, but stating them as a conjunction makes the progress claim textually visible instead of hidden inside the existential.)

### 1.2 The two halves, named for the proof structure

For §5–§6 and §8 it's useful to name the halves. Conjunct (b) of §1.1 is `Toplevel.compile_preservation`; conjunct (a) is `Toplevel.compile_progress`. They are proved separately — preservation by simultaneous induction with the simulation invariant (§5.2), progress by composing per-pass lemmas (§6.1). `compile_correct` is `compile_progress` followed by `compile_preservation` applied to the witness from progress.

### 1.3 `WellFormed` — the precondition for progress

```lean
def WellFormed (t : Toplevel) : Prop :=
  (∃ decls,      t.mkDecls         = .ok decls)       ∧
  (∃ typedDecls, t.checkAndSimplify = .ok typedDecls) ∧
  MonoTerminates t                                    ∧
  ConstrainedCallGraphAcyclic t
```

**No ghost predicates.** Every conjunct is a computable observation about `t`. A user proves `WellFormed t` by literally running `mkDecls`, `checkAndSimplify`, the monomorphization-termination check, and the call-graph acyclicity check, and observing that all four succeed. `decide` discharges it for any concrete program.

Each conjunct corresponds to exactly one range of passes:

| Passes | Success condition | `WellFormed` clause |
|---|---|---|
| `mkDecls`, `wellFormedDecls` | no alias cycles, no duplicates, every type reference resolves | first ∃ |
| `checkFunction ×2`, `simplifyTerm` | every body well-typed, no non-exhaustive match | `checkAndSimplify = .ok` |
| `concretize` | no polymorphic recursion | `MonoTerminates` |
| `toBytecode` | no unresolved mvars; layouts computable | implied by concretize |
| `deduplicate`, `needsCircuit` | total | — |
| (simulation precondition) | constrained call graph is a DAG | `ConstrainedCallGraphAcyclic` |

Because `checkAndSimplify` is the bridge between `Term` and `TypedTerm`, its *output* is what downstream passes consume. The preservation proof relies on the type annotations in that output being semantically coherent (each annotated type agrees with what the term actually evaluates to). Proving that coherence — **checker soundness** — is M9.5. It is not optional: without it, the downstream preservation proofs (especially `Lower`, which consults `typSize` on the annotations) rest on an unjustified assumption.

### 1.4 What `InterpResultEq` actually says

`InterpResultEq` pins three observables:

1. **Return values agree under flattening.** `flattenValue decls funcIdx v == gs` for source value `v` and bytecode flat output `gs`. The flattening is today's `Interpret.lean:574` definition, relocated to `Ix/Aiur/Semantics/Flatten.lean` and made total in M2.
2. **Final `IOBuffer` is structurally equal.** Both `data : Array G` and `map : HashMap (Array G) IOKeyInfo` (from `Protocol.lean:43`).
3. **Errors match.** `Except.error _` on the source side iff `Except.error _` on the bytecode side. v1 does not require message equality; v2 may refine.

### 1.5 Why the `IOBuffer` is in the theorem

A theorem that only related returned values would admit a compiler that silently dropped `ioWrite`s or reordered `ioSetInfo`s relative to computation. For a zkDSL where the `IOBuffer` is the public witness of a proof, that is catastrophic. Observability must include it.

### 1.6 Two scope clarifications about the statement

**`hname` is surface-level.** It hypothesizes that the entry function has a `FunIdx` — unavoidable, because the bytecode evaluator needs *some* `funIdx` to start at. It does **not** hypothesize that every recursive sub-call has a matching `FunIdx`. That property holds too, but is *derived* from `hct` via the `StructCompatible` invariant (§5.1) which is threaded through the simulation proof.

**Fuel is passed through unchanged.** Both evaluators decrement fuel *only* on function calls, never inside a single function body. Every compiler pass preserves the call count exactly (§5.3), so source and bytecode share a single fuel budget 1-for-1. The relation maps `outOfFuel` to `outOfFuel`.

### 1.7 What the plan does **not** prove

- **Non-termination.** Evaluators are fuel-indexed and total. Divergent programs are observed as `outOfFuel`; nothing stronger is claimed.
- **STARK / constraint-system soundness.** `Compiler/Layout.lean` computes selectors, auxiliaries, lookups; these affect trace generation only. STARK soundness is a separate theorem stacked on top of `compile_correct`.
- **Rust executor ≡ Lean bytecode reference.** The Lean evaluator we build (§4.2) is the *specification*; Rust's `src/aiur/execute.rs` is a performance implementation cross-tested against it (§2.4, M1). A formal Rust↔Lean equivalence is left for a follow-up.
- **Surface parser correctness.** `Ix/Aiur/Meta.lean` elaborates the surface language to `Term`. Macros like `fold(i..j, init, |acc, @v| body)` (`Meta.lean:293-302`) and template-call type-arg dropping (`Meta.lean:101, 304-326`) happen at parse time. The verified compiler operates from `Term` onwards. Parser bugs are out of scope, the same way CompCert excludes the C parser.
- **`debug` observability.** `.debug` nodes call `dbg_trace`; treated as transparent in the semantic relation.
- **Pretty-printers, CLI, statistics, benchmarks.**
- **`needsCircuit` minimality.** It's an optimization; semantics are indifferent.
- **Apps consuming the verified compiler.** Currently `Apps/ZKVoting/` uses Aiur via the Lean API. The theorem covers the compilation of the Aiur source they emit; their own correctness is separate.


## 2. The Code as It Stands

### 2.1 Compile pipeline

From `Ix/Aiur/Compiler.lean:80-91`:

```
Toplevel                                         -- Ix/Aiur/Term.lean:143
  ├─ mkDecls              (Check.lean:113)        expand type aliases
  ├─ wellFormedDecls      (Check.lean:695)
  ├─ checkFunction ×1     (Check.lean:743)        Term → TypedTerm (pre-simplify)
  ├─ simplifyTerm         (Simple.lean:12)        match compiler + decisionToTerm
  └─ checkFunction ×2     (Check.lean:743)        TypedTerm post-simplify
       ↓
TypedDecls                                        -- Ix/Aiur/TypedTerm.lean:77
  └─ concretize           (Concretize.lean:361)   monomorphize; erase params/mvars
       ↓
TypedDecls (monomorphic)
  └─ toBytecode           (Lower.lean:622)        + Layout.lean
       ↓
Bytecode.Toplevel                                 -- Ix/Aiur/Bytecode.lean:88
  ├─ deduplicate          (Dedup.lean:115)        bisimulation / partition refinement
  └─ needsCircuit         (Compiler.lean:59)      reachability over constrained calls
       ↓
CompiledToplevel
```

### 2.2 Source semantics — `Ix/Aiur/Interpret.lean`

`Interpret.lean` is the **debug interpreter**. Its role is fast interactive execution and error diagnosis, not proof reference. It has three features that the reference semantics deliberately omits:

- `callCache : HashMap (Global × List Value) Value` — memoizes `(fun, args)` to make repeated calls fast.
- `Interrupt.error` carrying a full call-stack trace — built up via `callSite` (`:243-246`), surfaced on any runtime error for debug messages like `"in foo(1, 2)"`.
- `partial` recursion — can loop on non-terminating programs, which is fine for an interactive tool.

Structure:

- `Value` inductive (`:10-21`) — unit / field / tuple / array / ctor / fn / pointer.
- `InterpState = { store, ioBuffer, callCache }` (`:199`).
- `InterpM = StateT InterpState (Except Interrupt)` (`:206`).
- Entry: `runFunction decls funcName inputs ioBuffer` (`:646`).
- 9 `partial def`s; cross-function recursion via `applyGlobal`.

**The plan keeps `Interpret.lean` in this role.** It is not refactored into the reference source semantics. The M0 bug fixes that align it with Rust (`ioSetInfo` overwrite, `ioRead` OOB) are applied here because the debug interpreter should agree with production on observable behavior, but the cache, stack trace, and `partial` machinery all stay. The proof-bearing reference semantics lives in separate new files: `Ix/Aiur/Source/Eval.lean` (M0) and `Ix/Aiur/Concrete/Eval.lean` (M3), introduced in §4.1.

### 2.3 Bytecode semantics — `src/aiur/execute.rs` (Rust)

- 436-line stack-machine interpreter over `Function.body : Block`.
- Computes `QueryRecord` (multiplicities, memory queries, byte queries) — trace-side artifacts, **not** part of the observable semantics we will formalize.
- The Lean side has no native bytecode evaluator yet; `Ix/Aiur/Protocol.lean:63-80` exposes `Bytecode.Toplevel.execute` as an opaque FFI to `rs_aiur_toplevel_execute`. Building a Lean-native reference is M1.

### 2.4 Existing test infrastructure

`Tests/Aiur/Common.lean:79-103 runTestCase` already implements a structural cross-check between the Lean source interpreter and the Rust bytecode executor:

1. Run Rust: `compiled.bytecode.execute funIdx input inputIOBuffer`.
2. Run source interpreter: `Aiur.runFunction decls funcName inputs inputIOBuffer`.
3. Assert `flattenValue decls funcIdx output == execOutput` and `state.ioBuffer == execIOBuffer` (`Common.lean:73-76`).
4. Run the prover, verify the proof against the expected output.

`Tests/Aiur/Aiur.lean` (884 lines) drives this on ~70 programs: arithmetic, matches (tail and non-tail), enums with mutual recursion through pointers, parametric templates, `fold` macros, IO, byte/u32 ops, pattern destructuring.

**Implications for the plan:**

- M1 *extends* `runTestCase` with a third party (the Lean-native bytecode reference); it does not create the cross-test from scratch.
- The latent bugs in §2.6 are precisely the cases the existing corpus doesn't exercise. M0 must add tests that trigger them, otherwise the fixes will be invisible to CI.
- The cross-check already verifies `IOBuffer` equality between source and Rust. The verified-compiler theorem asks for the same property, *proved* instead of *tested*.

### 2.5 Surface language vs `Term`

`Ix/Aiur/Term.lean:69 Term` is the *post-elaboration* IR. The surface language users write is parsed by `Ix/Aiur/Meta.lean`, which handles a richer surface:

- **`fold(i..j, init, |acc, @v| body)`** is a parse-time macro (`Meta.lean:293-302`). It unrolls into a chain of `let acc = ...` using literal indices substituted for `@v`. There is no `fold` constructor in `Term`. `@v` outside `fold` is a parse error (`Meta.lean:327-328`).
- **Template type/function calls** (`Wrapper‹G›.Mk(42)`) drop their type arguments at parse time (`Meta.lean:101, 304-326`). The elaborated `Term.app` carries no type arguments; the type checker re-infers them and stores them on `TypedTerm.app : Global → Array Typ → ...`. So the polymorphic-name dispatch problem (§5) is a *TypedTerm* phenomenon, not a `Term` phenomenon.

The verified compiler operates from `Term` onwards. `Meta.lean` is an untrusted parser (§1.7).

### 2.6 Latent bugs and representational gaps

These are the cases where source and Rust diverge today, with disposition for each. Every one is **latent in the existing test corpus** — the corpus passes today because it doesn't exercise the buggy case. Each fix lands together with a test that exercises it.

| # | Location | Observation | Disposition |
|---|---|---|---|
| 1 | `Interpret.lean:530` vs `execute.rs:62-67` | Source `ioSetInfo` overwrites existing key via `HashMap.insert`; Rust panics on `Entry::Occupied`. | **M0.** Source: error on existing key. |
| 2 | `Interpret.lean:533-537` vs `execute.rs:228-235` | Source `ioRead` uses `Array.extract` (silent short read on OOB); Rust slices (panic on OOB). | **M0.** Source: error on OOB. |
| 3 | `Interpret.lean:199-206, 260-263` and `execute.rs:152-171` | Both interpreters memoize `(fun, args)` calls and skip body execution on a hit, eliding any IO effects inside. Caches are *consistent* between source and Rust today, so the existing cross-check passes. | **M0 / policy.** Not a "fix" to `Interpret.lean` — the cache is part of its debug-interpreter contract and stays. The separately-created `Ix/Aiur/Source/Eval.lean` (M0) and `Ix/Aiur/Bytecode/Eval.lean` (M1) are cache-free reference semantics. Production Rust keeps its cache; a follow-up theorem can prove "cache-having Rust ≈ cache-free reference" under a purity assumption on IO. |
| 4 | `Compiler/Simple.lean:22` | `decisionToTerm` returns `none` on non-exhaustive match, bubbling to `unreachable!`. | **M0.** Non-exhaustive match becomes a `CheckError`; `Simple.Term` (§3.4) never carries an "impossible" leaf. |
| 5 | `Concretize.lean:361` | Monomorphization worklist has no termination guarantee; polymorphic recursion at a strictly larger type diverges. | **M0.** Static check rejecting polymorphic recursion. |
| 6 | `Interpret.lean:86, 391-398` vs `execute.rs:36-40, 173-205` | Source `Store` is one global index space holding structured `Value`s; Rust `memory_queries` is bucketed by width holding flattened `Vec<G>`. Pointer values diverge under `.ptrVal` (`Interpret.lean:412`). `grep ptr_val Tests/Aiur` finds nothing — fully latent. | **M3.** Requires `Value.pointer : Nat → Nat → Value` and runtime types at store time; that only becomes available when the reference evaluator moves from `Source.Term` to `Concrete.Term` (i.e., `Concrete/Eval.lean`). `Interpret.lean` is not touched. M3 also adds the first `ptr_val`-using test. |
| 7 | `Compiler/Lower.lean:31-40, 575` and `:45-73` | `Layout.gadget` / `GadgetLayout` exist and `gadgetIdx` is threaded through `layoutMap`, but never populated. Stale scaffolding for an unshipped feature. | **M0 cleanup.** Remove all three; simpler types for later proofs. |

**Two notes on prior misdiagnoses:**

- *`.load` returning `vs[0]!`* (`Interpret.lean:407`) is **not** data loss. Source stores a structured value wrapped in a single-element array (`store.insert #[v] ()` at `:397`); `vs[0]!` unwraps the wrapper. The divergence with Rust is representational (item 6), not data loss.
- *`dataTypeFlatSize`* (`Interpret.lean:565-568`) has **no** cycle check. Direct datatype recursion (without `.pointer` indirection) diverges. Today's programs use `.pointer` for recursion so the bug is latent. Fix in M2 by adding a `visited` set, mirroring `Layout.DataType.size:49-57`.

### 2.7 `unreachable!` sites: where invariants leak through

**26 sites** total — 24 in `Compiler/**`, 2 in `Check.lean`. Each is a pass relying on an invariant established by an earlier pass but not tracked in the type. These are exactly the places where staged datatypes (§3) pay off; in the new design each becomes impossible by construction.

The most informative ones:

- `Compiler/Simple.lean:22` — match compiler non-exhaustive (M0 bug 4 makes this a `CheckError` instead).
- `Compiler/Lower.lean:135` — `.ret` in `toIndex` (non-tail position).
- `Compiler/Lower.lean:136` — `.match` in `toIndex`. Comment says "Non-tail match not yet implemented"; after the recent commit, this is still reached for matches outside let-RHS chains. See §3.4.
- `Compiler/Lower.lean:161, 468` — non-trivial `.let` after simplify.
- `Compiler/Lower.lean:194, 207, 575` — `.ref` / `.app` / `.gadget` resolving to a non-callable layout.
- `Compiler/Lower.lean:502, 519, 586, 599` — non-`.var` subpatterns.
- `Compiler/Layout.lean:41, 93` — unresolved `.mvar` after concretize.
- `Compiler/Match.lean:117-118, 215` — `popNamespace` and `spatternToPattern .pointer` impossibilities.
- `Check.lean:595, 604` — constructor / datatype lookups already resolved by the type checker.


## 3. Staged Datatypes

The plan introduces a sequence of typed IRs. Each pass is a total function between adjacent stages, and the stages **shrink** monotonically: constructors disappear when the prior pass has eliminated them. This kills `unreachable!` arms by construction.

### 3.1 Pipeline of stages

```
                           ┌─────────────┐
   Source.Term             │ Source IR   │  Stage 1  (rename of current Term)
       ↓ check                           │
   Typed.Term              │ Typed IR    │  Stage 2  (rename of current TypedTerm)
       ↓ simplify                        │
   Simple.Term             │ Match-free  │  Stage 3  (NEW)
       ↓ concretize                      │
   Concrete.Term           │ Monomorphic │  Stage 4  (NEW)
       ↓ lower                           │
   Bytecode.Toplevel       │ Flat IR     │  Stage 5
       ↓ dedup                           │
   Bytecode.Toplevel       │ Bisimilar   │  Stage 6
                           └─────────────┘
```

All four `*.Term` types live under `Ix/Aiur/<Stage>/Term.lean` with matching namespaces: `Aiur.Source.Term`, `Aiur.Typed.Term`, `Aiur.Simple.Term`, `Aiur.Concrete.Term`. The rename of the current `Ix/Aiur/Term.lean` → `Ix/Aiur/Source/Term.lean` and `Ix/Aiur/TypedTerm.lean` → `Ix/Aiur/Typed/Term.lean` is part of M3.

### 3.2 Stage 1 — `Source.Term`

Today's `Ix/Aiur/Term.lean:69 Term`, renamed in M3. Full pattern language (`.or`, `.pointer`, nested). Types carry `.mvar` and `.app` with parameters. As §2.5 notes, this is post-elaboration; surface macros never appear here, and `Source.Term.app` carries no type arguments.

Refactor: flatten the `Data` mutual child into `Source.Term.tuple` / `Source.Term.array` constructors. Mechanical; cleaner induction principles.

### 3.3 Stage 2 — `Typed.Term`

Today's `Ix/Aiur/TypedTerm.lean:48 TypedTerm`, renamed in M3. Each subterm carries `typ : Typ` and `escapes : Bool`. Same pattern shapes as Stage 1; the type checker has inferred type arguments and stored them on `app` and `ref`.

**Design choice.** Keep fielded typing for v1 — a plain inductive annotated with types. An indexed alternative — `inductive Typed.Term : Ctx → Typ → Type` — would make `Lower.expectIdx` a `refl` and turn every type invariant into a *type checker* obligation rather than a *lowering pass* obligation. The win is huge for later proofs but the checker has to produce dependently-typed output. Defer to v2; revisit only if M5/M6 stall.

### 3.4 Stage 3 — `Simple.Term` (NEW)

Produced by the (rewritten) match compiler. Grammar:

```
Simple.Pattern ::= .wildcard
                 | .field  G
                 | .ref    Global (Array Local)
                 | .tuple  (Array Local)
                 | .array  (Array Local)

Simple.Term    ::= ... leaves as in Typed.Term ...
                 | .letVar  Local Simple.Term Simple.Term
                 | .letWild       Simple.Term Simple.Term
                 | .letLoad Local Local Simple.Term
                 | .match  Local (Array (Simple.Pattern × Simple.Term))
                                 (Option Simple.Term)
```

Type-level invariants:

- No `.or` subpatterns (match compiler eliminated them).
- No `.pointer` subpatterns (lowered to `.letLoad`).
- No nested subpatterns — pattern children are bare `Local`s.
- `.let` binds only a `Local` or is a wildcard.
- `.match` scrutinee is a `Local`.
- No non-exhaustive matches (after M0 fix 4).

**Non-tail match scope.** `Lower.lean:136` still throws `"Non-tail match not yet implemented"` whenever `.match` appears in a value-producing position other than a `.let` RHS chain. The recent non-tail-match commit added support only via `findNonTailMatch` (`Lower.lean:347-367`), which walks chains of `.let (.var _)` / `.let .wildcard` and lifts an inner match into a `Ctrl.matchContinue`. Matches inside tuples, arrays, function arguments, etc. still throw.

For v1 of the plan, `Simple.Term`'s `.match` admits the same restriction: it appears only in let-RHS positions. Generalizing this is an additive milestone, not in scope here.

**Simplify consumes `Typed.Term`, not `Source.Term`.** Today's `checkAndSimplify` (`Compiler/Simple.lean:38-54`) runs `checkFunction` *twice* — once on the raw `Term` (discarding the result, only used to surface errors at original source positions), then `simplify` on the untyped `Term`, then `checkFunction` again to produce the final `TypedTerm`. The double pass goes away when the match compiler is rewritten to consume `Typed.Term` and produce `Simple.Term` directly, threading types through the decision tree.

### 3.5 Stage 4 — `Concrete.Term` (NEW)

Produced by `concretize`. Same constructors as `Simple.Term` but over a `Concrete.Typ` with no `.mvar` and no parametric `.app`:

```
Concrete.Typ ::= .unit | .field
              | .tuple    (Array Concrete.Typ)
              | .array    Concrete.Typ Nat
              | .pointer  Concrete.Typ
              | .ref      MonoGlobal
              | .function (List Concrete.Typ) Concrete.Typ
```

`Concrete.Function` has no `params`. Every parametric decl has been specialised and renamed via `concretizeName` (`Concretize.lean:43`). The lowering pass `toBytecode` consumes only `Concrete.Term` — `typSize` (`Layout.lean:23`) no longer has an `.mvar` arm to reject.

### 3.6 Stage 5 — `Bytecode.Toplevel`

Already at `Bytecode.lean:88`. Two cleanups:

1. Make `partial def`s on `Block` / `Ctrl` total via `termination_by` on structural subterm (`Dedup.lean:30-92`, `Layout.lean:192`, `Lower.lean` walks).
2. `FunctionLayout` is non-semantic; the bytecode evaluator ignores it. Keep it on `Function` but the semantic relation is blind to it.

### 3.7 Stage 6 — Post-dedup `Bytecode.Toplevel`

Same datatype, post-`deduplicate`. The proof obligation is bisimulation preservation: for every original `funIdx`, the original toplevel at `funIdx` observes the same behavior as the deduplicated toplevel at `remap funIdx`.


## 4. Reference Evaluators

Both evaluators are new Lean code. Both are total, fuel-indexed (call-only accounting), error-returning, and **cache-free** — no `callCache`, no `function_queries`. Caching is an orthogonal optimization layer that the production Rust executor uses for multiplicity bookkeeping in the STARK trace.

### 4.1 Source reference evaluators — `Source/Eval.lean`, `Concrete/Eval.lean`

The proof-bearing source reference lives in **new files**, separate from the debug interpreter `Interpret.lean` (§2.2). There are two stages:

- `Ix/Aiur/Source/Eval.lean` (M0) — reference evaluator on `Source.Term`. Used by M1's cross-test to validate against the Lean bytecode reference and Rust, before the stage refactor lands.
- `Ix/Aiur/Concrete/Eval.lean` (M3) — reference evaluator on `Concrete.Term`. This is the form that participates in `compile_preservation`. Supersedes `Source/Eval.lean` in the proof chain once M3 is done.

Both are fresh implementations. They may mirror `Interpret.lean`'s structure for diffability, but they do not share code — `Interpret.lean` persists unchanged in its debug role, and keeping the two evaluators independent avoids accidental entanglement of debug features with the spec.

**Design common to both (M0 and M3 forms).**

- **No cache.** The cache is an optimization of the debug interpreter (and of Rust's `function_queries` for STARK multiplicity tracking), not part of the semantic model. The reference evaluators always re-execute a body.
- **No stack trace.** Errors are `Except.error SourceError` with a small tag enum; no call-stack construction. Debugging is `Interpret.lean`'s job.
- **Fuel-indexed, call-only accounting.** A `fuel : Nat` parameter decrements only at `applyGlobal`. Intra-body recursion is structural on `Term` / `Concrete.Term`. Exhaustion returns `Except.error .outOfFuel`.
- **Total.** Every recursive def carries an explicit termination measure (`matchPattern` structural on `Pattern`, `interp` / `applyGlobal` lexicographic on `(fuel, term)`, `flattenValue` / `unflattenValue` structural on `Value` / `Typ`, `typFlatSize` / `dataTypeFlatSize` with a `visited : HashSet Global` argument mirroring `Layout.DataType.size:49-57`).
- **Errors return, never panic.** `ioSetInfo` on existing key, `ioRead` OOB, pattern-match failure, type mismatch, `.ret` outside a function body — all produce `Except.error`.

**M0 form (on `Source.Term`).**

Signature:
```
Source.Eval.runFunction :
  Decls → Global → List Value → IOBuffer → Nat →
    Except SourceError (Value × IOBuffer)
```

Covers every `Source.Term` constructor. Does *not* solve the pointer representation divergence (§2.6 item 6): source pointers are still a single global counter, so a program that uses `ptr_val` across different flat widths would disagree with Rust. No current test triggers this.

**M3 form (on `Concrete.Term`).**

With types available at runtime, fix the pointer representation and align with Rust's bucketed memory:

1. `Value.pointer : Nat → Nat → Value` carrying `(width, index)`. `flattenValue` on `.pointer w i = #[.ofNat i]` (width is type-level, never in the flat encoding).
2. `Store : IndexMap Nat (IndexMap (Array G) Unit)` — width buckets keyed over flat `Array G`. Matches Rust's `memory_queries` exactly.
3. `.store t` computes the flat width of `t.typ` via `typFlatSize`, flattens `t`'s value, inserts into the right bucket.
4. `.load ptr` reads the width from the typed `.load` node, dispatches to the right bucket, unflattens via the element type.
5. `.ptrVal` (if kept) returns the within-bucket index as a field element. Source and bytecode now agree on pointer numeric values. (See the discussion of whether to keep `ptrVal` at all in §10 open decisions.)

After M3, `Concrete.Eval` and `Bytecode.Eval` produce identical `IOBuffer`s and identical flat results on every program.

Signature:
```
Concrete.Eval.runFunction :
  Concrete.Decls → Global → List Value → IOBuffer → Nat →
    Except SourceError (Value × IOBuffer)
```

### 4.2 Bytecode reference evaluator — NEW, `Ix/Aiur/Bytecode/Eval.lean`

Mirrors `src/aiur/execute.rs` in big-step style: mutually recursive `evalBlock` / `evalCtrl` / `evalOp`, total, fuel-indexed.

```
structure EvalState where
  map      : Array G
  memory   : IndexMap Nat (IndexMap (Array G) Unit)
  ioBuffer : IOBuffer

Bytecode.Eval.evalBlock :
  Bytecode.Toplevel → Block → EvalState → Nat →
    Except BytecodeError (Array G × EvalState)
```

`memory` has one bucket per width in `Toplevel.memorySizes` (`Bytecode.lean:90`), mirroring Rust's `QueryRecord.memory_queries` (`execute.rs:36-40`). Each `Op.store values` uses `values.size` as the width key.

Differences from `execute.rs`:

- **No `QueryRecord`.** Drop multiplicities, byte queries, per-function call queries — trace-side bookkeeping.
- **No call cache.** §4.1 explanation applies symmetrically.
- **No `unconstrained` branching.** Both branches of every `if unconstrained` produce the same value; they differ only in whether a query is logged.
- **No stack machine.** Direct big-step. Easier to reason about than small-step, and the two coincide here because the language is deterministic.
- **Call-only fuel.** Decrements only on `Op.call`.
- **Errors return, never panic.** All failures `Except.error`.

```
Bytecode.Eval.runFunction :
  Bytecode.Toplevel → FunIdx → Array G → IOBuffer → Nat →
    Except BytecodeError (Array G × IOBuffer)
```

### 4.3 The equivalence relation

`Ix/Aiur/Semantics/Relation.lean` (NEW):

```lean
inductive ValueEq (decls : Decls) (funcIdx : Global → Option Nat) :
    Value → Typ → Array G → Prop
  | unit    : ValueEq .. .unit .unit #[]
  | field   : ValueEq .. (.field g) .field #[g]
  | pointer : ValueEq .. (.pointer w i) (.pointer _) #[.ofNat i]
  | fn      : ValueEq .. (.fn g) _ #[.ofNat (funcIdx g).getD 0]
  | tuple   : ... -- concat of element encodings
  | array   : ...
  | ctor    : ... -- tag + args + padding, matching dataTypeFlatSize

def InterpResultEq decls funcIdx
    (src : Except SourceError (Value × IOBuffer))
    (bc  : Except BytecodeError (Array G × IOBuffer)) (retTyp : Typ) : Prop :=
  match src, bc with
  | .ok (v, io₁),  .ok (gs, io₂)  => ValueEq decls funcIdx v retTyp gs ∧ io₁ = io₂
  | .error _,      .error _       => True
  | _,             _              => False
```

`ValueEq` is a propositional mirror of `flattenValue`. The equivalence proof for `Lower` (M6) goes by induction on `Concrete.Term` and maintains the local invariant that the bytecode `map` slice at the compiled indices contains the flattened form of every live binding.


## 5. The Simulation Invariant

### 5.1 `StructCompatible`

The simulation between source and bytecode must thread through every sub-call. We split this into a *structural* part — pure facts about the compilation output, provable as a standalone lemma — and a *semantic* part, established by simultaneous induction with `compile_preservation`.

```lean
-- Ix/Aiur/Semantics/Compatible.lean (NEW)
structure StructCompatible
    (decls   : Decls)
    (bc      : Bytecode.Toplevel)
    (nameMap : Global → Option Bytecode.FunIdx) : Prop where
  total_on_reachable :
    ∀ name f, decls.getByKey name = some (.function f) →
    reachableFromEntries decls name →
    ∃ fi, nameMap name = some fi
  funIdx_valid :
    ∀ name fi, nameMap name = some fi → fi < bc.functions.size
  input_layout_matches :
    ∀ name fi f, nameMap name = some fi →
    decls.getByKey name = some (.function f) →
    (f.inputs.map Prod.snd).foldl
      (fun acc t => acc + typFlatSize decls t) 0 =
    bc.functions[fi]!.layout.inputSize
  call_graph_closed :
    ∀ fi, fi < bc.functions.size →
    ∀ callee, callee ∈ collectAllCallees bc.functions[fi]!.body →
    callee < bc.functions.size

theorem compile_ok_implies_struct_compatible :
    t.compile = .ok ct →
    StructCompatible t.decls ct.bytecode ct.getFuncIdx
```

This lemma is proved standalone, by induction on the compilation passes.

### 5.2 The full simulation pattern

The semantic clause — *for every compiled function pair, source and bytecode bodies produce matching observations* — references the theorem we're proving. So it cannot be proved standalone; it is proved *simultaneously* with `compile_preservation` by induction on `fuel`.

`compile_preservation` (see the full signature in §1.2) goes by strong induction on `fuel`. At the outer call site, we use `StructCompatible` (§5.1) to look up `funIdx`, equate flat input sizes, and bound the callee index. At every inner `Op.call`, the induction hypothesis at `fuel - 1` discharges the recursive equivalence.

This is CompCert's `match_states` / `fsim_match_states` pattern. The adaptation uses big-step semantics with fuel as the well-founded measure, but the structure is identical: `StructCompatible` plays the *global* part of `match_states`, and the per-state "register file matches flattened env" condition (§4.3) plays the *local* part.

### 5.3 Fuel discipline

**Claim:** the fuel transformer `φ` at every pass is `id`. Source and bytecode share a single fuel budget 1-for-1. Why:

1. **Call-only accounting.** Both evaluators decrement fuel only at function-call boundaries. Intra-body recursion is structural.
2. **Calls are preserved 1-to-1 by every pass.**
   - `simplify` touches the pattern language; never adds or removes calls.
   - `concretize` is name substitution: every polymorphic call becomes exactly one monomorphic call.
   - `lower` (`Compiler/Lower.lean:177-207`) compiles every `.app` to exactly one `Op.call`.
   - `dedup` merges bisimilar bodies but leaves call *sites* intact.
   - `needsCircuit` toggles a flag the evaluator ignores.

For every source evaluation that makes `k` calls before terminating, the compiled bytecode makes *exactly* `k` calls. The two terminate at the same fuel bound.

If a future pass blows up calls (e.g., inlining), introduce a distinct `φ` *only for that pass* — do not weaken the global invariant.


## 6. The Proof

### 6.1 Pass-by-pass obligations

For each compiler pass `p : StageN → StageN+1`, prove **two** companion lemmas:

```lean
-- preservation
theorem pass_preserves {src : StageN} {tgt : StageN+1}
    (h : p src = .ok tgt) :
    ∀ env args io fuel,
      evalN src env args io fuel ≈ evalN+1 tgt env args io fuel

-- progress
theorem pass_progress {src : StageN} (hwf : Stage_N_WellFormed src) :
    ∃ tgt, p src = .ok tgt
```

The two compose by transitivity of `InterpResultEq` for preservation and by `bind` chaining on `Except` for progress. Each per-pass progress lemma is typically ≤ 30% of the corresponding preservation proof: the case analysis is identical, only the conclusion changes.

### 6.2 Proof order — easy to hard

1. **`needsCircuit` semantic irrelevance.** Sets a `constrained` flag the bytecode evaluator ignores. One-liner, absorbed into M9 as a trivial `rfl`-level step — no milestone of its own.
2. **`deduplicate` preservation.** Bytecode-only pass; bisimulation over `Block` up to call-index renaming, with cycles handled by well-founded induction on fuel. **(M4)**
3. **`toBytecode` (Lower).** The largest single proof. Source is `Concrete.Term`; target is a flat `Block` with a `map : Array G` register file. Local invariant: the bytecode `map` slice at the compiled indices contains the flattened form of every live binding. **(M5 straight-line, M6 full)**
4. **`concretize`.** Values are type-erased; substitution changes types but not value denotation. Short proof. **(M7)**
5. **`simplify` / match compiler.** Adapt Maranget's correctness proof; CompCert `Cminorgenproof` and CakeML `pat_to_dec` are the reference adaptations. Hardest single proof. **(M8)**
6. **Checker soundness.** The type annotations on `Typed.Term` must agree with what the evaluator actually produces — otherwise `Lower` cannot trust `typSize` during layout. This is M9.5; full statement in §1.3 and §8. With intrinsic typing (v2 of §3.3) the milestone vanishes because well-typedness becomes structural.

The order and dependencies above are realized as a concrete milestone schedule in §8. How the proof obligations sequence into parallel work — and where each one sits on the critical path — is captured by the DAG in §8.


## 7. Refactoring Work (Pre-Proof)

### 7.1 File layout

```
Ix/Aiur/
  Source/Term.lean                (Stage 1; renamed from Term.lean; flatten Data layer)
  Source/Eval.lean                NEW (M0; reference evaluator on Source.Term)
  Typed/Term.lean                 (Stage 2; renamed from TypedTerm.lean; flatten TypedData layer)
  Simple/Term.lean                NEW (Stage 3; typed)
  Concrete/Term.lean              NEW (Stage 4 + Concrete.Typ)
  Concrete/Eval.lean              NEW (M3; reference evaluator on Concrete.Term; proof-bearing)
  Bytecode.lean                   (Stage 5; unchanged)
  Bytecode/Eval.lean              NEW (M1; cache-free Lean reference)
  Bytecode/ExecuteFfi.lean        NEW (opaque Rust FFI; split from Protocol.lean)
  Interpret.lean                  UNCHANGED in role (debug interpreter with cache +
                                  stack trace); M0 applies io bug fixes for Rust parity
  Semantics/
    Relation.lean                 NEW (ValueEq, InterpResultEq)
    Compatible.lean               NEW (StructCompatible)
    Flatten.lean                  NEW (total flattenValue / unflattenValue)
  Compiler/
    Check.lean                    unchanged API; total internals
    Match.lean                    consumes Typed.Term, produces Simple.Term
    Simple.lean                   Typed.Term → Simple.Term bridge
    Concretize.lean               Simple.Term → Concrete.Term
    Lower.lean                    Concrete.Term → Bytecode.Toplevel
    Dedup.lean                    unchanged; total
    Layout.lean                   gadget arm removed
  Proofs/
    ValueEqFlatten.lean
    StructCompatible.lean         compile_ok_implies_struct_compatible
    WellFormed.lean
    DedupSound.lean               M4
    LowerSoundCore.lean           M5
    LowerSoundControl.lean        M6
    ConcretizeSound.lean          M7
    SimplifySound.lean            M8
    CompilerPreservation.lean     M9
    CheckSound.lean               M9.5 (checker soundness; required)
    CompilerProgress.lean         M10
    CompilerCorrect.lean          M10 (final theorem)
```

### 7.2 Killing `partial`

Grep finds **92** `partial def`s across `Ix/Aiur/**`. They fall into three termination categories, each handled differently in M2:

1. **Structural recursion on the term / block / type tree.** The bulk (80+). Become `def ... termination_by` on the obvious structural measure. The pain point is Lean's derived `sizeOf` not seeing through nested `Array.map` / `foldl`; expect ~1 AI-day on a small `Ix/Array/Induction.lean` helper library before the rest can be discharged.

2. **Cross-function recursion in the reference evaluators.** Today's pattern `Interpret.lean:251 applyGlobal → :279 interp → applyGlobal` is genuinely unbounded. The reference evaluators (`Source/Eval.lean`, `Concrete/Eval.lean`) mirror this structure and handle it with an explicit `fuel` parameter that strictly decreases on every `applyGlobal`, plus a termination proof on `(fuel, term)` lexicographic.

3. **Genuinely non-trivial termination.** Two cases:
   - `Concretize.lean:361 TypedDecls.concretize` — worklist over newly discovered instantiations, no finiteness guarantee today. The M0 static check rejects polymorphic recursion; termination of the worklist then follows by structural induction on the type-argument DAG.
   - `matchPattern` through `.pointer` recurses into the store, but is bounded by pattern depth (the recursion sees a structurally smaller pattern at every step). Already `def` (total) in `Interpret.lean:138`; the reference evaluators reuse the same approach.

**Disposition by file:**

| File | Partials | Fate |
|---|---|---|
| `Check.lean` | 40 | Total in M2 (category 1). |
| `Compiler/**` | 39 | Total in M2 (category 1, plus `Concretize` in category 3). |
| `Compiler.lean` | 1 | Total in M2. |
| `Interpret.lean` utilities (`flattenValue`, `unflattenValue`, `typFlatSize`, `dataTypeFlatSize`) | 4 | Move to `Semantics/Flatten.lean` in M2 and made total there (category 1, with a `visited` set for `dataTypeFlatSize`). |
| `Interpret.lean` core (`interp`, `applyGlobal`, `applyLocal`, `ppValue`, `natToHexGo`) | 5 | **Stay `partial`.** Debug tool, may loop. |
| `Meta.lean` | 3 | **Stay `partial`.** Parser, not proof-load-bearing. |

After M2, the proof stack is `partial`-free. The 8 remaining partials (5 in `Interpret.lean`, 3 in `Meta.lean`) participate in no proof.

### 7.3 Incidental refactors

These land *alongside* their driving milestones but are worth calling out so reviewers know they're planned:

- **Flatten the `Data` / `TypedData` mutual children** into direct `Source.Term.tuple` / `Source.Term.array` (and `Typed.Term` equivalents). Cleaner induction principles. Lands with the M3 rename.
- **Split `Ix/Aiur/Protocol.lean`** into `Protocol.lean` (keeps `AiurSystem`, `Proof`, FRI parameters, `buildClaim`) and a new `Ix/Aiur/Bytecode/ExecuteFfi.lean` (keeps the opaque Rust executor). Makes room for `Bytecode/Eval.lean` (M1) to stand alone.
- **Delete the double `checkFunction` pass.** Today `checkAndSimplify` (`Compiler/Simple.lean:38-54`) runs the checker twice because `simplify` operates on untyped `Term`. Once `Match.lean` consumes `Typed.Term` (M3), the second check disappears.

All three are folded into the milestone that enables them; there is no separate "refactor milestone".


## 8. Milestones

**Time unit: AI-day.** One AI-day is a focused interactive day driving a Lean proof session, including edit/elaborate/iterate cycles. It is not equivalent to an engineer-day: compile latency and proof iteration depth dominate, and much of the "thinking" time is compressed or amortized across parallel attempts.

Dependencies form the DAG in §8.1 below. Headline numbers:

| Metric | Value |
|---|---|
| Sequential sum (all milestones back-to-back) | ~26–42 AI-days |
| Critical path (M0→M1→M2→M3→M5→M6→M9→M9.5→M10) | ~20–32 AI-days |
| Wall clock with parallelism (M4/M7/M8 alongside) | ~5–7 weeks |
| Risk-adjusted (M3, M6, M8, or M9.5 stalling) | ~8 weeks |

M3, M6, M8, and M9.5 are the four milestones most likely to slip. M9.5 is **on the critical path and mandatory** — skipping it would leave a hole in the simulation proof at M6 (Lower consults `typSize` on type annotations that nothing guarantees are accurate).

### 8.1 Dependency graph

```
                   ┌─→ M4 ───────────────────┐
                   │                         │
M0 ─→ M1 ─→ M2 ────┤                         │
                   │                         │
                   └─→ M3 ─┬─→ M5 ─→ M6 ─────┼─→ M9 ─→ M9.5 ─→ M10
                           ├─→ M7 ───────────┤
                           └─→ M8 ───────────┘
```

- **M4** (dedup) needs `Bytecode.Eval` (M1) and totality (M2), but *not* the staged datatypes. It starts as soon as M2 is done, running in parallel with M3.
- **M5/M6**, **M7**, **M8** all need M3 (staged datatypes, `Concrete.Eval`). They run concurrently after M3.
- **M9** composes preservation from M4 + M6 + M7 + M8 using `StructCompatible` (the standalone invariant from §5.1).
- **M9.5** proves `checkAndSimplify` sound — every type annotation in its output is semantically accurate. Required by M6 (Lower consults `typSize` on annotations) and by M10 (`WellFormed` decomposition).
- **M10** adds progress across all passes and yields `compile_correct`.

Each milestone below ships independently and keeps the tree green.

### M0 — Semantic-parity fixes and source reference evaluator — *~1–2 AI-days*

Two concerns land together in M0:

**Part A: fix the latent bugs in §2.6.** Each fix lives in whichever file originally contained the bug, and each lands with the test that exercises it — none of these bugs is observable on the existing corpus, so without new tests the PRs would be invisible to CI.

| File | Fix | New test |
|---|---|---|
| `Ix/Aiur/Interpret.lean:530` | `ioSetInfo` errors on existing key | Set `[0]` then set `[0]` again. |
| `Ix/Aiur/Interpret.lean:533-537` | `ioRead` errors on OOB | Read `len=4` past `data.size`. |
| `Ix/Aiur/Compiler/Simple.lean:22` + `Compiler/Match.lean` | Non-exhaustive match becomes a `CheckError` instead of bubbling to `unreachable!` | Program with non-exhaustive match on `G`. |
| `Ix/Aiur/Compiler/Concretize.lean:361` | Static monomorphization-termination check (rejects polymorphic recursion) | Polymorphic-recursive template. |
| `Ix/Aiur/Compiler/Lower.lean:31-40, 575` + `:45-73` | Remove dead `Layout.gadget` / `GadgetLayout` / `gadgetIdx` | None new — code deletion. |

The two `Interpret.lean` fixes keep `Interpret.lean` aligned with Rust on observable behavior. **The `callCache` and stack trace are not touched** — they are part of the debug interpreter's contract. The other three fixes are in the compiler pipeline, not in the interpreter.

**Part B: create `Ix/Aiur/Source/Eval.lean`.** Fresh cache-free, stack-trace-free, fuel-indexed reference evaluator on `Source.Term`. Structure may mirror `Interpret.lean` for diffability but the two files share no code. §4.1 lists the design.

`partial` still permitted in this file through M0; totality comes in M2 (where `Source/Eval.lean` is one of the files made total, `Interpret.lean` is not).

**Exit:** every Part A fix has a test that fails on `main` and passes on its branch. `Ix/Aiur/Source/Eval.lean` compiles and agrees with `Interpret.lean` on the full corpus (on value and `IOBuffer`) via a new cross-test added to `Tests/Aiur/Common.lean`. `lake test` green.

### M1 — Lean-native bytecode reference — *~1–2 AI-days*

`Ix/Aiur/Bytecode/Eval.lean`: total, fuel-indexed, error-returning, big-step, **cache-free**. Per-width memory buckets keyed from `Toplevel.memorySizes`. `partial` still allowed; totality in M2.

**Extends the cross-test.** `Tests/Aiur/Common.lean:runTestCase` already validates `Interpret.lean` against Rust (on value + IOBuffer). M0 added a check that `Source.Eval` agrees with `Interpret.lean`. M1 adds the fourth node: `Bytecode.Eval` against the other three.

The full cross-test at M1:
```
           Rust.executeFfi
          ╱
Interpret.lean ─── (existing)
          ╲
           Source.Eval ─── (M0 test)
              ╲
               Bytecode.Eval ─── (M1 test)
```

Every test asserts that the four agree on value and `IOBuffer`. Any divergence is a ticket.

Expected divergence at M1: a program that calls an IO-effecting function twice with the same args would show `Interpret.lean` / `Rust.executeFfi` (cached, one write) disagreeing with `Source.Eval` / `Bytecode.Eval` (cache-free, two writes). No current test exercises this; if one does, it becomes a known divergence rather than a bug.

**Exit:** four-way agreement on the M0-extended corpus, or documented divergences with tickets.

### M2 — Totality — *~3–5 AI-days*

Make every proof-load-bearing `partial def` total. Scope:

- `Ix/Aiur/Check.lean` (40 partials) — the largest mechanical lift.
- `Ix/Aiur/Compiler/**` (39 partials across `Concretize`, `Dedup`, `Lower`, `Layout`, `Match`, `Simple`).
- `Ix/Aiur/Source/Eval.lean` (the reference evaluator created in M0).
- `Ix/Aiur/Bytecode/Eval.lean` (the reference evaluator created in M1).
- **Move the flattening utilities out of `Interpret.lean` into a new `Ix/Aiur/Semantics/Flatten.lean`**: `flattenValue`, `unflattenValue`, `typFlatSize`, `dataTypeFlatSize`. These are used by the reference evaluators and by the theorem statement, so they must be total. Add a `visited : HashSet Global` parameter to `dataTypeFlatSize` / `typFlatSize` mirroring `Layout.DataType.size:49-57`. Update `Tests/Aiur/Common.lean` and `Interpret.lean` to import from the new location.

Fuel where genuinely needed (§7.2 category 2). `Check.lean`'s 40 `partial`s through nested arrays need real iteration; budget ~1 AI-day on `Ix/Array/Induction.lean` helpers before the bulk can be discharged.

**Out of scope: `Ix/Aiur/Interpret.lean` itself.** After M2 it still contains `partial def interp`, `partial def applyGlobal`, `partial def applyLocal`, `partial def ppValue`, `partial def natToHexGo` (5 partials; `matchPattern` is already `def`). The core interpreter loop can loop on divergent programs — that is a feature of the debug tool, not a bug. The pretty printers are not semantic. None of these participate in any proof.

**Exit:** `grep -n "^partial def" Ix/Aiur/` returns only:

- `Interpret.lean`: `interp`, `applyGlobal`, `applyLocal`, `ppValue`, `natToHexGo` — 5 deliberate `partial`s (debug interpreter core + pretty printers; `matchPattern` is already `def`).
- `Meta.lean`: 3 `partial`s (parser, not proof-load-bearing).

Every other file in `Ix/Aiur/**` is `partial`-free.

### M3 — Stage datatypes and typed source reference — *~4–6 AI-days*

The largest refactor milestone. Two coupled changes, each with sub-tasks:

**Part A: staged datatypes.**
- Rename `Ix/Aiur/Term.lean` → `Source/Term.lean` and `Ix/Aiur/TypedTerm.lean` → `Typed/Term.lean`; update every import.
- Introduce new files `Simple/Term.lean` (§3.4) and `Concrete/Term.lean` + `Concrete/Typ` (§3.5).
- Rewrite `Compiler/Match.lean` to consume `Typed.Term` and produce `Simple.Term` directly (pure decision-tree compiler, see §9 risk 3).
- Rewrite `Compiler/Concretize.lean` from `TypedDecls → TypedDecls` to `Simple.Term → Concrete.Term`.
- Rewrite `Compiler/Lower.lean` (`toIndex`, `TypedTerm.compile`) to consume `Concrete.Term`.
- Delete the double `checkFunction` pass in `Compiler/Simple.lean:38-54`.
- Delete the 20+ `unreachable!` / `Impossible` / `Should not happen` sites in `Compiler/**` now ruled out by types (§2.7).
- Flatten `Data` / `TypedData` mutual children into direct constructors (§7.3).
- Split `Ix/Aiur/Protocol.lean` into `Protocol.lean` + `Bytecode/ExecuteFfi.lean` (§7.3).

**Part B: typed reference evaluator.**
- Create `Ix/Aiur/Concrete/Eval.lean`, a fresh evaluator on `Concrete.Term` (§4.1 M3 form). Width-bucketed `Store`, `Value.pointer (w, i)`, typed `.store` / `.load` / `.ptrVal`.
- Add the first `ptr_val`-using test to lock the new pointer semantics (skip if `ptr_val` is removed per §10 decision 3).

`Concrete.Eval.lean` supersedes `Source/Eval.lean` in the proof chain; `Source/Eval.lean` stays around only as the M1 cross-test witness.

This is the milestone most likely to slip. Total code churn is on the order of 2,000–4,000 lines, spread across ~15 files. The budget assumes that M2 totality work has already smoothed out `Check.lean` and `Compiler/**` — if any residual `partial def` blocks a type refactor, expect extra iteration.

**Exit:** `Toplevel.compile` type-checks through the new stages; existing tests pass; `grep -rn "unreachable!\|Impossible\|Should not happen" Ix/Aiur/Compiler/` returns ≤ a handful with documented reasons; `Concrete.Eval` and `Bytecode.Eval` agree pointer-for-pointer on every program in the cross-test.

### M4 — Dedup soundness — *~1–2 AI-days* — parallel with M5/M7

Preservation: `Bytecode.Eval.run t idx args io fuel ≈ Bytecode.Eval.run t.dedup.1 (t.dedup.2 idx) args io fuel`. Progress: `deduplicate` is total (trivial). Bisimulation up to call-index renaming, cycles handled by well-founded induction on fuel.

**Exit:** `Proofs/DedupSound.lean` compiles.

### M5 — Lower proof, straight-line subset — *~2–3 AI-days*

Preservation + progress for `Lower.toIndex` on the fragment of `Concrete.Term` excluding `.match`, function calls, `.store` / `.load`, IO, and u8/u32 ops. Establishes the local simulation relation: *bytecode `map` slice at compiled indices = flattened bound value*.

**Exit:** lemma covers every non-control constructor.

### M6 — Lower proof, full — *~3–5 AI-days* — parallel with M8

Extend M5 to:

- `.match` in tail position.
- Non-tail `.match` via `Ctrl.matchContinue`, *only for the let-RHS form* matching `findNonTailMatch`. Matches in other positions are excluded by the `Concrete.Term` shape established in M3.
- `.ret` / `Ctrl.return` / `Ctrl.yield`.
- Function calls (`Op.call`); the `unconstrained` flag is semantically equivalent in both branches (§4.2).
- `.store` / `.load` with width-bucketed memory (M3 payoff).
- IO operations — the `IOBuffer` clause of the main theorem is discharged here.
- u8/u32 ops — `bv_decide` or a library of bitvector lemmas.

**Exit:** `Lower.compile_preservation` and `Lower.compile_progress` for every `Concrete.Term`.

### M7 — Concretize soundness — *~1 AI-day* — parallel with M5/M6

Preservation: `substInTypedTerm subst body` has the same value denotation as `body` under the original type environment (type erasure). Progress: the M0 termination check bounds the worklist.

**Exit:** `Proofs/ConcretizeSound.lean` compiles.

### M8 — Match compiler soundness — *~3–5 AI-days* — parallel with M6

Adapt Maranget's correctness proof for decision-tree pattern compilation. For every `Simple.Term` produced from a `Typed.Term` by the rewritten `simplifyTerm`, the decision tree matches exactly when the original nested pattern would have matched, and binds the same locals to the same values.

References: CompCert `Cminorgenproof`, CakeML `pat_to_dec`, Maranget
2008. Largest single proof; iteration depth is the bottleneck.

**Exit:** `Proofs/SimplifySound.lean` compiles.

### M9 — Top-level preservation — *~1 AI-day*

`Toplevel.compile_preservation` from M4 + M6 + M7 + M8 by transitivity of `InterpResultEq`, threading fuel / `StructCompatible` / typing assumptions through the composition.

**Exit:** `Proofs/CompilerPreservation.lean` compiles.

### M9.5 — Checker soundness — *~3–5 AI-days*

Prove that `checkAndSimplify` produces semantically coherent annotations:

```lean
theorem checkAndSimplify_sound
    {t : Toplevel} {decls : Decls} (hdecls : t.mkDecls = .ok decls)
    {typedDecls : TypedDecls} (h : t.checkAndSimplify = .ok typedDecls) :
    ∀ {name f}, typedDecls.get? name = some (.function f) →
    ∀ env, WellFormedEnv env →
    let ⟨v, _⟩ := Source.Eval.runTypedBody decls f env
    ValueShapeMatches v f.output ∧
    (flattenValue decls funcIdx v).size = typSize decls f.output
```

This is the property `Lower.preservation` (M6) needs when it consults `typSize` on `TypedTerm.typ`. Without it, the Lower proof has no grounds to claim its layout computation matches the runtime value.

The proof is by induction on the `Check.lean` `inferTerm` structure (40 mutual arms) plus the simplify pass (which preserves annotations through the match compiler). Heavy use of the M2 totality proofs for the recursion structure; each arm has a bespoke soundness argument.

**Required, not optional.** Without M9.5, M6 has a hole: "type annotations are accurate" is an unjustified assumption, and the claim that Lower preserves semantics cannot be discharged. The full-stack theorem is a chain that breaks at this link.

If intrinsic typing is adopted (risk 1 / §3.3), M9.5 vanishes: the `Typed.Term : Ctx → Typ → Type` datatype makes coherence structural. This is the one path that avoids the work.

**Exit:** `Proofs/CheckSound.lean` compiles. M6 and M7 can cite `checkAndSimplify_sound` in their preservation proofs.

### M10 — Progress and `compile_correct` — *~1–2 AI-days*

1. Define `WellFormed` (§1.3) in `Proofs/WellFormed.lean`. Since every conjunct is a computable `.ok` observation, no ghost predicates appear.
2. Per-pass progress lemmas (most are byproducts of preservation). The `mkDecls` and `checkAndSimplify` lemmas are trivial: their success is already hypothesized by `WellFormed`.
3. Compose into `compile_progress`.
4. Combine with M9 (preservation) and M9.5 (checker soundness) into `compile_correct` in `Proofs/CompilerCorrect.lean`.

**Exit:** the §1.1 theorem compiles, with no `sorry` and no trusted components beyond Lean's kernel and the Rust FFI boundary.


## 9. Risks

1. **Indexed Stage 2 fallback.** If M5/M6 stall on the simulation relation, intrinsic typing (`inductive Typed.Term : Ctx → Typ → Type`) becomes attractive. Budget ~3 AI-days to reboot the checker — the largest single contingency.
2. **Monomorphization-termination check rejects real programs.** Run the M0 check against the existing corpus before merging. Fall back to a bounded-instance-count policy if anything legitimate breaks.
3. **`runMatchCompiler` is stateful** (id supply, diagnostics). Pure versions are easier to reason about. Pure-ify in M3 when rewriting it to produce `Simple.Term` directly.
4. **Array induction ergonomics.** Lean 4's derived `sizeOf` is finicky around nested `Array`s. Budget ~1 AI-day in M2 for an `Ix.Array.Induction` helper library — already included in the M2 estimate but worth calling out.
5. **u8/u32 byte-level reasoning.** The u8/u32 ops in `Interpret.lean:427-505` and the Rust gadgets use native bit operations. The Lower proof at M6 needs either `bv_decide` or a hand-built library of byte-level lemmas. Budget within M6.
6. **M9.5 (checker soundness) is open-ended.** 3–5 AI-days is a point estimate; the real cost depends on whether the fielded `TypedTerm` supports a clean soundness proof or whether the checker has to be partially rewritten. If the latter, the Stage-2 intrinsic-typing rewrite (risk 1) may end up being cheaper overall — both obviate M9.5 and fix the same root cause. Decide after spiking the first few `inferTerm` arms.


## 10. Open Decisions

Sign-off needed before M0:

1. **Indexed `Typed.Term` — now or v2.** Recommend v2 (§3.3). If adopted now, M9.5 vanishes entirely (intrinsic well-typedness), at the cost of a dependently-typed checker rewrite. The plan currently assumes v1 (fielded typing) + M9.5.
2. **Monomorphization-termination policy — static reject vs bounded count.** Recommend static reject (§2.6 item 5).
3. **`ptrVal` — keep or remove?** Zero call sites today. Keeping it costs ~25 lines of proof work across the pipeline; removing it costs a handful of surface-language deletions. Recommend **remove** in M0 cleanup; §4.1 M3 step and §2.6 item 6 become no-ops.

All other design choices are already folded into the plan:

- **Fuel-indexed evaluation, call-only accounting.** §1.6, §4.1, §4.2, §5.3.
- **`callCache` absent from the Lean reference evaluators, kept in `Interpret.lean` and production Rust.** §2.2, §4.1.
- **Non-exhaustive match is a `CheckError`.** M0 bug fix 4.
- **`.debug` is transparent in the semantic relation.** §1.7.
- **Constrained-call-graph acyclicity is a `WellFormed` clause**, not a checker obligation. §1.3.
- **Non-tail match in `Simple.Term` is restricted to let-RHS chains** (matches the current compiler). §3.4.
- **Store bucketing is deferred to M3** (requires runtime types). §2.6 item 6, §4.1 M3 step.


## 11. Appendix — Concrete Numbers

Verified by `grep` over `Ix/Aiur/**`:

- **92** total `partial def`s — per-file breakdown and disposition in §7.2.
- **26** `unreachable!` / `throw "Impossible ..."` / `"Should not happen ..."` sites — 24 in `Compiler/**`, 2 in `Check.lean`; `Lower.lean` alone has 20. Enumerated in §2.7.
- Source LOC to touch: ~3,500 (pipeline) + ~2,800 (tests to extend).
- New LOC estimate: ~1,200 (datatypes + two reference evaluators + `StructCompatible` + `WellFormed`) + ~5,000–7,000 (proofs).


#	Location	Observation	Disposition
1	`Interpret.lean:530` vs `execute.rs:62-67`	Source `ioSetInfo` overwrites existing key via `HashMap.insert`; Rust panics on `Entry::Occupied`.	M0. Source: error on existing key.
2	`Interpret.lean:533-537` vs `execute.rs:228-235`	Source `ioRead` uses `Array.extract` (silent short read on OOB); Rust slices (panic on OOB).	M0. Source: error on OOB.
3	`Interpret.lean:199-206, 260-263` and `execute.rs:152-171`	Both interpreters memoize `(fun, args)` calls and skip body execution on a hit, eliding any IO effects inside. Caches are consistent between source and Rust today, so the existing cross-check passes.	M0 / policy. Not a "fix" to `Interpret.lean` — the cache is part of its debug-interpreter contract and stays. The separately-created `Ix/Aiur/Source/Eval.lean` (M0) and `Ix/Aiur/Bytecode/Eval.lean` (M1) are cache-free reference semantics. Production Rust keeps its cache; a follow-up theorem can prove "cache-having Rust ≈ cache-free reference" under a purity assumption on IO.
4	`Compiler/Simple.lean:22`	`decisionToTerm` returns `none` on non-exhaustive match, bubbling to `unreachable!`.	M0. Non-exhaustive match becomes a `CheckError`; `Simple.Term` (§3.4) never carries an "impossible" leaf.
5	`Concretize.lean:361`	Monomorphization worklist has no termination guarantee; polymorphic recursion at a strictly larger type diverges.	M0. Static check rejecting polymorphic recursion.
6	`Interpret.lean:86, 391-398` vs `execute.rs:36-40, 173-205`	Source `Store` is one global index space holding structured `Value`s; Rust `memory_queries` is bucketed by width holding flattened `Vec<G>`. Pointer values diverge under `.ptrVal` (`Interpret.lean:412`). `grep ptr_val Tests/Aiur` finds nothing — fully latent.	M3. Requires `Value.pointer : Nat → Nat → Value` and runtime types at store time; that only becomes available when the reference evaluator moves from `Source.Term` to `Concrete.Term` (i.e., `Concrete/Eval.lean`). `Interpret.lean` is not touched. M3 also adds the first `ptr_val`-using test.
7	`Compiler/Lower.lean:31-40, 575` and `:45-73`	`Layout.gadget` / `GadgetLayout` exist and `gadgetIdx` is threaded through `layoutMap`, but never populated. Stale scaffolding for an unshipped feature.	M0 cleanup. Remove all three; simpler types for later proofs.

File	Partials	Fate
`Check.lean`	40	Total in M2 (category 1).
`Compiler/**`	39	Total in M2 (category 1, plus `Concretize` in category 3).
`Compiler.lean`	1	Total in M2.
`Interpret.lean` utilities (`flattenValue`, `unflattenValue`, `typFlatSize`, `dataTypeFlatSize`)	4	Move to `Semantics/Flatten.lean` in M2 and made total there (category 1, with a `visited` set for `dataTypeFlatSize`).
`Interpret.lean` core (`interp`, `applyGlobal`, `applyLocal`, `ppValue`, `natToHexGo`)	5	Stay `partial`. Debug tool, may loop.
`Meta.lean`	3	Stay `partial`. Parser, not proof-load-bearing.

File	Fix	New test
`Ix/Aiur/Interpret.lean:530`	`ioSetInfo` errors on existing key	Set `[0]` then set `[0]` again.
`Ix/Aiur/Interpret.lean:533-537`	`ioRead` errors on OOB	Read `len=4` past `data.size`.
`Ix/Aiur/Compiler/Simple.lean:22` + `Compiler/Match.lean`	Non-exhaustive match becomes a `CheckError` instead of bubbling to `unreachable!`	Program with non-exhaustive match on `G`.
`Ix/Aiur/Compiler/Concretize.lean:361`	Static monomorphization-termination check (rejects polymorphic recursion)	Polymorphic-recursive template.
`Ix/Aiur/Compiler/Lower.lean:31-40, 575` + `:45-73`	Remove dead `Layout.gadget` / `GadgetLayout` / `gadgetIdx`	None new — code deletion.

Passes	Success condition	`WellFormed` clause
`mkDecls`, `wellFormedDecls`	no alias cycles, no duplicates, every type reference resolves	first ∃
`checkFunction ×2`, `simplifyTerm`	every body well-typed, no non-exhaustive match	`checkAndSimplify = .ok`
`concretize`	no polymorphic recursion	`MonoTerminates`
`toBytecode`	no unresolved mvars; layouts computable	implied by concretize
`deduplicate`, `needsCircuit`	total	—
(simulation precondition)	constrained call graph is a DAG	`ConstrainedCallGraphAcyclic`

Metric	Value
Sequential sum (all milestones back-to-back)	~26–42 AI-days
Critical path (M0→M1→M2→M3→M5→M6→M9→M9.5→M10)	~20–32 AI-days
Wall clock with parallelism (M4/M7/M8 alongside)	~5–7 weeks
Risk-adjusted (M3, M6, M8, or M9.5 stalling)	~8 weeks

Toward a Verified Aiur Compiler #373

Description

1. The Theorem

1.1 What "verified compiler" means

1.2 The two halves, named for the proof structure

1.3 WellFormed — the precondition for progress

1.4 What InterpResultEq actually says

1.5 Why the IOBuffer is in the theorem

1.6 Two scope clarifications about the statement

1.7 What the plan does not prove

2. The Code as It Stands

2.1 Compile pipeline

2.2 Source semantics — Ix/Aiur/Interpret.lean

2.3 Bytecode semantics — src/aiur/execute.rs (Rust)

2.4 Existing test infrastructure

2.5 Surface language vs Term

2.6 Latent bugs and representational gaps

2.7 unreachable! sites: where invariants leak through

3. Staged Datatypes

3.1 Pipeline of stages

3.2 Stage 1 — Source.Term

3.3 Stage 2 — Typed.Term

3.4 Stage 3 — Simple.Term (NEW)

3.5 Stage 4 — Concrete.Term (NEW)

3.6 Stage 5 — Bytecode.Toplevel

3.7 Stage 6 — Post-dedup Bytecode.Toplevel

4. Reference Evaluators

4.1 Source reference evaluators — Source/Eval.lean, Concrete/Eval.lean

4.2 Bytecode reference evaluator — NEW, Ix/Aiur/Bytecode/Eval.lean

4.3 The equivalence relation

5. The Simulation Invariant

5.1 StructCompatible

5.2 The full simulation pattern

5.3 Fuel discipline

6. The Proof

6.1 Pass-by-pass obligations

6.2 Proof order — easy to hard

7. Refactoring Work (Pre-Proof)

7.1 File layout

7.2 Killing partial

7.3 Incidental refactors

8. Milestones

8.1 Dependency graph

M0 — Semantic-parity fixes and source reference evaluator — ~1–2 AI-days

M1 — Lean-native bytecode reference — ~1–2 AI-days

M2 — Totality — ~3–5 AI-days

M3 — Stage datatypes and typed source reference — ~4–6 AI-days

M4 — Dedup soundness — ~1–2 AI-days — parallel with M5/M7

M5 — Lower proof, straight-line subset — ~2–3 AI-days

M6 — Lower proof, full — ~3–5 AI-days — parallel with M8

M7 — Concretize soundness — ~1 AI-day — parallel with M5/M6

M8 — Match compiler soundness — ~3–5 AI-days — parallel with M6

M9 — Top-level preservation — ~1 AI-day

M9.5 — Checker soundness — ~3–5 AI-days

M10 — Progress and compile_correct — ~1–2 AI-days

9. Risks

10. Open Decisions

11. Appendix — Concrete Numbers

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1.3 `WellFormed` — the precondition for progress

1.4 What `InterpResultEq` actually says

1.5 Why the `IOBuffer` is in the theorem

2.2 Source semantics — `Ix/Aiur/Interpret.lean`

2.3 Bytecode semantics — `src/aiur/execute.rs` (Rust)

2.5 Surface language vs `Term`

2.7 `unreachable!` sites: where invariants leak through

3.2 Stage 1 — `Source.Term`

3.3 Stage 2 — `Typed.Term`

3.4 Stage 3 — `Simple.Term` (NEW)

3.5 Stage 4 — `Concrete.Term` (NEW)

3.6 Stage 5 — `Bytecode.Toplevel`

3.7 Stage 6 — Post-dedup `Bytecode.Toplevel`

4.1 Source reference evaluators — `Source/Eval.lean`, `Concrete/Eval.lean`

4.2 Bytecode reference evaluator — NEW, `Ix/Aiur/Bytecode/Eval.lean`

5.1 `StructCompatible`

7.2 Killing `partial`

M10 — Progress and `compile_correct` — ~1–2 AI-days