Long-term roadmap for proving in Lean 4 that the Aiur compiler in Ix/Aiur/** translates source programs to bytecode without changing their observable behavior, where "observable" includes the IOBuffer.
1. The Theorem
1.1 What "verified compiler" means
The end goal is one theorem combining progress (every well-formed program compiles successfully) and preservation (whenever compilation succeeds, the output matches source semantics):
theorem Toplevel.compile_correct
(t : Toplevel) (hwf : WellFormed t) :
-- (a) Progress: compilation succeeds.
(∃ ct decls, t.mkDecls = .ok decls ∧ t.compile = .ok ct)
∧
-- (b) Preservation: any compilation output is semantically equivalent.
(∀ ct decls,
t.mkDecls = .ok decls →
t.compile = .ok ct →
∀ name funIdx (hname : ct.getFuncIdx name = some funIdx)
args io₀ fuel,
InterpResultEq decls ct.getFuncIdx
(Concrete.Eval.runFunction decls ⟨name⟩ args io₀ fuel)
(Bytecode.Eval.runFunction ct.bytecode funIdx
(Flatten.args decls ct.getFuncIdx args)
io₀ fuel))
where Flatten.args is the natural list-to-flat coercion, defined in Semantics/Flatten.lean alongside flattenValue:
def Flatten.args (decls : Decls) (funcIdx : Global → Option Nat)
(args : List Value) : Array G :=
args.foldl (fun acc v => acc ++ flattenValue decls funcIdx v) #[]
The two conjuncts are independent statements and both are load-bearing:
- (a) alone is vacuous for meaning. A compiler that produces a no-op
Bytecode.Toplevel satisfies (a) trivially.
- (b) alone is vacuous for usefulness. A compiler that always returns
.error satisfies (b) vacuously (the premise t.compile = .ok ct is never met).
Only the conjunction guarantees that well-formed source produces a compilation output and that output is behaviorally equivalent to the source.
(Because compile and mkDecls are deterministic functions, (a) and (b) together are logically equivalent to the single-quantifier form ∃ ct decls, ... ∧ InterpResultEq ..., but stating them as a conjunction makes the progress claim textually visible instead of hidden inside the existential.)
1.2 The two halves, named for the proof structure
For §5–§6 and §8 it's useful to name the halves. Conjunct (b) of §1.1 is Toplevel.compile_preservation; conjunct (a) is Toplevel.compile_progress. They are proved separately — preservation by simultaneous induction with the simulation invariant (§5.2), progress by composing per-pass lemmas (§6.1). compile_correct is compile_progress followed by compile_preservation applied to the witness from progress.
1.3 WellFormed — the precondition for progress
def WellFormed (t : Toplevel) : Prop :=
(∃ decls, t.mkDecls = .ok decls) ∧
(∃ typedDecls, t.checkAndSimplify = .ok typedDecls) ∧
MonoTerminates t ∧
ConstrainedCallGraphAcyclic t
No ghost predicates. Every conjunct is a computable observation about t. A user proves WellFormed t by literally running mkDecls, checkAndSimplify, the monomorphization-termination check, and the call-graph acyclicity check, and observing that all four succeed. decide discharges it for any concrete program.
Each conjunct corresponds to exactly one range of passes:
| Passes |
Success condition |
WellFormed clause |
mkDecls, wellFormedDecls |
no alias cycles, no duplicates, every type reference resolves |
first ∃ |
checkFunction ×2, simplifyTerm |
every body well-typed, no non-exhaustive match |
checkAndSimplify = .ok |
concretize |
no polymorphic recursion |
MonoTerminates |
toBytecode |
no unresolved mvars; layouts computable |
implied by concretize |
deduplicate, needsCircuit |
total |
— |
| (simulation precondition) |
constrained call graph is a DAG |
ConstrainedCallGraphAcyclic |
Because checkAndSimplify is the bridge between Term and TypedTerm, its output is what downstream passes consume. The preservation proof relies on the type annotations in that output being semantically coherent (each annotated type agrees with what the term actually evaluates to). Proving that coherence — checker soundness — is M9.5. It is not optional: without it, the downstream preservation proofs (especially Lower, which consults typSize on the annotations) rest on an unjustified assumption.
1.4 What InterpResultEq actually says
InterpResultEq pins three observables:
- Return values agree under flattening.
flattenValue decls funcIdx v == gs for source value v and bytecode flat output gs. The flattening is today's Interpret.lean:574 definition, relocated to Ix/Aiur/Semantics/Flatten.lean and made total in M2.
- Final
IOBuffer is structurally equal. Both data : Array G and map : HashMap (Array G) IOKeyInfo (from Protocol.lean:43).
- Errors match.
Except.error _ on the source side iff Except.error _ on the bytecode side. v1 does not require message equality; v2 may refine.
1.5 Why the IOBuffer is in the theorem
A theorem that only related returned values would admit a compiler that silently dropped ioWrites or reordered ioSetInfos relative to computation. For a zkDSL where the IOBuffer is the public witness of a proof, that is catastrophic. Observability must include it.
1.6 Two scope clarifications about the statement
hname is surface-level. It hypothesizes that the entry function has a FunIdx — unavoidable, because the bytecode evaluator needs some funIdx to start at. It does not hypothesize that every recursive sub-call has a matching FunIdx. That property holds too, but is derived from hct via the StructCompatible invariant (§5.1) which is threaded through the simulation proof.
Fuel is passed through unchanged. Both evaluators decrement fuel only on function calls, never inside a single function body. Every compiler pass preserves the call count exactly (§5.3), so source and bytecode share a single fuel budget 1-for-1. The relation maps outOfFuel to outOfFuel.
1.7 What the plan does not prove
- Non-termination. Evaluators are fuel-indexed and total. Divergent programs are observed as
outOfFuel; nothing stronger is claimed.
- STARK / constraint-system soundness.
Compiler/Layout.lean computes selectors, auxiliaries, lookups; these affect trace generation only. STARK soundness is a separate theorem stacked on top of compile_correct.
- Rust executor ≡ Lean bytecode reference. The Lean evaluator we build (§4.2) is the specification; Rust's
src/aiur/execute.rs is a performance implementation cross-tested against it (§2.4, M1). A formal Rust↔Lean equivalence is left for a follow-up.
- Surface parser correctness.
Ix/Aiur/Meta.lean elaborates the surface language to Term. Macros like fold(i..j, init, |acc, @v| body) (Meta.lean:293-302) and template-call type-arg dropping (Meta.lean:101, 304-326) happen at parse time. The verified compiler operates from Term onwards. Parser bugs are out of scope, the same way CompCert excludes the C parser.
debug observability. .debug nodes call dbg_trace; treated as transparent in the semantic relation.
- Pretty-printers, CLI, statistics, benchmarks.
needsCircuit minimality. It's an optimization; semantics are indifferent.
- Apps consuming the verified compiler. Currently
Apps/ZKVoting/ uses Aiur via the Lean API. The theorem covers the compilation of the Aiur source they emit; their own correctness is separate.
2. The Code as It Stands
2.1 Compile pipeline
From Ix/Aiur/Compiler.lean:80-91:
Toplevel -- Ix/Aiur/Term.lean:143
├─ mkDecls (Check.lean:113) expand type aliases
├─ wellFormedDecls (Check.lean:695)
├─ checkFunction ×1 (Check.lean:743) Term → TypedTerm (pre-simplify)
├─ simplifyTerm (Simple.lean:12) match compiler + decisionToTerm
└─ checkFunction ×2 (Check.lean:743) TypedTerm post-simplify
↓
TypedDecls -- Ix/Aiur/TypedTerm.lean:77
└─ concretize (Concretize.lean:361) monomorphize; erase params/mvars
↓
TypedDecls (monomorphic)
└─ toBytecode (Lower.lean:622) + Layout.lean
↓
Bytecode.Toplevel -- Ix/Aiur/Bytecode.lean:88
├─ deduplicate (Dedup.lean:115) bisimulation / partition refinement
└─ needsCircuit (Compiler.lean:59) reachability over constrained calls
↓
CompiledToplevel
2.2 Source semantics — Ix/Aiur/Interpret.lean
Interpret.lean is the debug interpreter. Its role is fast interactive execution and error diagnosis, not proof reference. It has three features that the reference semantics deliberately omits:
callCache : HashMap (Global × List Value) Value — memoizes (fun, args) to make repeated calls fast.
Interrupt.error carrying a full call-stack trace — built up via callSite (:243-246), surfaced on any runtime error for debug messages like "in foo(1, 2)".
partial recursion — can loop on non-terminating programs, which is fine for an interactive tool.
Structure:
Value inductive (:10-21) — unit / field / tuple / array / ctor / fn / pointer.
InterpState = { store, ioBuffer, callCache } (:199).
InterpM = StateT InterpState (Except Interrupt) (:206).
- Entry:
runFunction decls funcName inputs ioBuffer (:646).
- 9
partial defs; cross-function recursion via applyGlobal.
The plan keeps Interpret.lean in this role. It is not refactored into the reference source semantics. The M0 bug fixes that align it with Rust (ioSetInfo overwrite, ioRead OOB) are applied here because the debug interpreter should agree with production on observable behavior, but the cache, stack trace, and partial machinery all stay. The proof-bearing reference semantics lives in separate new files: Ix/Aiur/Source/Eval.lean (M0) and Ix/Aiur/Concrete/Eval.lean (M3), introduced in §4.1.
2.3 Bytecode semantics — src/aiur/execute.rs (Rust)
- 436-line stack-machine interpreter over
Function.body : Block.
- Computes
QueryRecord (multiplicities, memory queries, byte queries) — trace-side artifacts, not part of the observable semantics we will formalize.
- The Lean side has no native bytecode evaluator yet;
Ix/Aiur/Protocol.lean:63-80 exposes Bytecode.Toplevel.execute as an opaque FFI to rs_aiur_toplevel_execute. Building a Lean-native reference is M1.
2.4 Existing test infrastructure
Tests/Aiur/Common.lean:79-103 runTestCase already implements a structural cross-check between the Lean source interpreter and the Rust bytecode executor:
- Run Rust:
compiled.bytecode.execute funIdx input inputIOBuffer.
- Run source interpreter:
Aiur.runFunction decls funcName inputs inputIOBuffer.
- Assert
flattenValue decls funcIdx output == execOutput and state.ioBuffer == execIOBuffer (Common.lean:73-76).
- Run the prover, verify the proof against the expected output.
Tests/Aiur/Aiur.lean (884 lines) drives this on ~70 programs: arithmetic, matches (tail and non-tail), enums with mutual recursion through pointers, parametric templates, fold macros, IO, byte/u32 ops, pattern destructuring.
Implications for the plan:
- M1 extends
runTestCase with a third party (the Lean-native bytecode reference); it does not create the cross-test from scratch.
- The latent bugs in §2.6 are precisely the cases the existing corpus doesn't exercise. M0 must add tests that trigger them, otherwise the fixes will be invisible to CI.
- The cross-check already verifies
IOBuffer equality between source and Rust. The verified-compiler theorem asks for the same property, proved instead of tested.
2.5 Surface language vs Term
Ix/Aiur/Term.lean:69 Term is the post-elaboration IR. The surface language users write is parsed by Ix/Aiur/Meta.lean, which handles a richer surface:
fold(i..j, init, |acc, @v| body) is a parse-time macro (Meta.lean:293-302). It unrolls into a chain of let acc = ... using literal indices substituted for @v. There is no fold constructor in Term. @v outside fold is a parse error (Meta.lean:327-328).
- Template type/function calls (
Wrapper‹G›.Mk(42)) drop their type arguments at parse time (Meta.lean:101, 304-326). The elaborated Term.app carries no type arguments; the type checker re-infers them and stores them on TypedTerm.app : Global → Array Typ → .... So the polymorphic-name dispatch problem (§5) is a TypedTerm phenomenon, not a Term phenomenon.
The verified compiler operates from Term onwards. Meta.lean is an untrusted parser (§1.7).
2.6 Latent bugs and representational gaps
These are the cases where source and Rust diverge today, with disposition for each. Every one is latent in the existing test corpus — the corpus passes today because it doesn't exercise the buggy case. Each fix lands together with a test that exercises it.
| # |
Location |
Observation |
Disposition |
| 1 |
Interpret.lean:530 vs execute.rs:62-67 |
Source ioSetInfo overwrites existing key via HashMap.insert; Rust panics on Entry::Occupied. |
M0. Source: error on existing key. |
| 2 |
Interpret.lean:533-537 vs execute.rs:228-235 |
Source ioRead uses Array.extract (silent short read on OOB); Rust slices (panic on OOB). |
M0. Source: error on OOB. |
| 3 |
Interpret.lean:199-206, 260-263 and execute.rs:152-171 |
Both interpreters memoize (fun, args) calls and skip body execution on a hit, eliding any IO effects inside. Caches are consistent between source and Rust today, so the existing cross-check passes. |
M0 / policy. Not a "fix" to Interpret.lean — the cache is part of its debug-interpreter contract and stays. The separately-created Ix/Aiur/Source/Eval.lean (M0) and Ix/Aiur/Bytecode/Eval.lean (M1) are cache-free reference semantics. Production Rust keeps its cache; a follow-up theorem can prove "cache-having Rust ≈ cache-free reference" under a purity assumption on IO. |
| 4 |
Compiler/Simple.lean:22 |
decisionToTerm returns none on non-exhaustive match, bubbling to unreachable!. |
M0. Non-exhaustive match becomes a CheckError; Simple.Term (§3.4) never carries an "impossible" leaf. |
| 5 |
Concretize.lean:361 |
Monomorphization worklist has no termination guarantee; polymorphic recursion at a strictly larger type diverges. |
M0. Static check rejecting polymorphic recursion. |
| 6 |
Interpret.lean:86, 391-398 vs execute.rs:36-40, 173-205 |
Source Store is one global index space holding structured Values; Rust memory_queries is bucketed by width holding flattened Vec<G>. Pointer values diverge under .ptrVal (Interpret.lean:412). grep ptr_val Tests/Aiur finds nothing — fully latent. |
M3. Requires Value.pointer : Nat → Nat → Value and runtime types at store time; that only becomes available when the reference evaluator moves from Source.Term to Concrete.Term (i.e., Concrete/Eval.lean). Interpret.lean is not touched. M3 also adds the first ptr_val-using test. |
| 7 |
Compiler/Lower.lean:31-40, 575 and :45-73 |
Layout.gadget / GadgetLayout exist and gadgetIdx is threaded through layoutMap, but never populated. Stale scaffolding for an unshipped feature. |
M0 cleanup. Remove all three; simpler types for later proofs. |
Two notes on prior misdiagnoses:
.load returning vs[0]! (Interpret.lean:407) is not data loss. Source stores a structured value wrapped in a single-element array (store.insert #[v] () at :397); vs[0]! unwraps the wrapper. The divergence with Rust is representational (item 6), not data loss.
dataTypeFlatSize (Interpret.lean:565-568) has no cycle check. Direct datatype recursion (without .pointer indirection) diverges. Today's programs use .pointer for recursion so the bug is latent. Fix in M2 by adding a visited set, mirroring Layout.DataType.size:49-57.
2.7 unreachable! sites: where invariants leak through
26 sites total — 24 in Compiler/**, 2 in Check.lean. Each is a pass relying on an invariant established by an earlier pass but not tracked in the type. These are exactly the places where staged datatypes (§3) pay off; in the new design each becomes impossible by construction.
The most informative ones:
Compiler/Simple.lean:22 — match compiler non-exhaustive (M0 bug 4 makes this a CheckError instead).
Compiler/Lower.lean:135 — .ret in toIndex (non-tail position).
Compiler/Lower.lean:136 — .match in toIndex. Comment says "Non-tail match not yet implemented"; after the recent commit, this is still reached for matches outside let-RHS chains. See §3.4.
Compiler/Lower.lean:161, 468 — non-trivial .let after simplify.
Compiler/Lower.lean:194, 207, 575 — .ref / .app / .gadget resolving to a non-callable layout.
Compiler/Lower.lean:502, 519, 586, 599 — non-.var subpatterns.
Compiler/Layout.lean:41, 93 — unresolved .mvar after concretize.
Compiler/Match.lean:117-118, 215 — popNamespace and spatternToPattern .pointer impossibilities.
Check.lean:595, 604 — constructor / datatype lookups already resolved by the type checker.
3. Staged Datatypes
The plan introduces a sequence of typed IRs. Each pass is a total function between adjacent stages, and the stages shrink monotonically: constructors disappear when the prior pass has eliminated them. This kills unreachable! arms by construction.
3.1 Pipeline of stages
┌─────────────┐
Source.Term │ Source IR │ Stage 1 (rename of current Term)
↓ check │
Typed.Term │ Typed IR │ Stage 2 (rename of current TypedTerm)
↓ simplify │
Simple.Term │ Match-free │ Stage 3 (NEW)
↓ concretize │
Concrete.Term │ Monomorphic │ Stage 4 (NEW)
↓ lower │
Bytecode.Toplevel │ Flat IR │ Stage 5
↓ dedup │
Bytecode.Toplevel │ Bisimilar │ Stage 6
└─────────────┘
All four *.Term types live under Ix/Aiur/<Stage>/Term.lean with matching namespaces: Aiur.Source.Term, Aiur.Typed.Term, Aiur.Simple.Term, Aiur.Concrete.Term. The rename of the current Ix/Aiur/Term.lean → Ix/Aiur/Source/Term.lean and Ix/Aiur/TypedTerm.lean → Ix/Aiur/Typed/Term.lean is part of M3.
3.2 Stage 1 — Source.Term
Today's Ix/Aiur/Term.lean:69 Term, renamed in M3. Full pattern language (.or, .pointer, nested). Types carry .mvar and .app with parameters. As §2.5 notes, this is post-elaboration; surface macros never appear here, and Source.Term.app carries no type arguments.
Refactor: flatten the Data mutual child into Source.Term.tuple / Source.Term.array constructors. Mechanical; cleaner induction principles.
3.3 Stage 2 — Typed.Term
Today's Ix/Aiur/TypedTerm.lean:48 TypedTerm, renamed in M3. Each subterm carries typ : Typ and escapes : Bool. Same pattern shapes as Stage 1; the type checker has inferred type arguments and stored them on app and ref.
Design choice. Keep fielded typing for v1 — a plain inductive annotated with types. An indexed alternative — inductive Typed.Term : Ctx → Typ → Type — would make Lower.expectIdx a refl and turn every type invariant into a type checker obligation rather than a lowering pass obligation. The win is huge for later proofs but the checker has to produce dependently-typed output. Defer to v2; revisit only if M5/M6 stall.
3.4 Stage 3 — Simple.Term (NEW)
Produced by the (rewritten) match compiler. Grammar:
Simple.Pattern ::= .wildcard
| .field G
| .ref Global (Array Local)
| .tuple (Array Local)
| .array (Array Local)
Simple.Term ::= ... leaves as in Typed.Term ...
| .letVar Local Simple.Term Simple.Term
| .letWild Simple.Term Simple.Term
| .letLoad Local Local Simple.Term
| .match Local (Array (Simple.Pattern × Simple.Term))
(Option Simple.Term)
Type-level invariants:
- No
.or subpatterns (match compiler eliminated them).
- No
.pointer subpatterns (lowered to .letLoad).
- No nested subpatterns — pattern children are bare
Locals.
.let binds only a Local or is a wildcard.
.match scrutinee is a Local.
- No non-exhaustive matches (after M0 fix 4).
Non-tail match scope. Lower.lean:136 still throws "Non-tail match not yet implemented" whenever .match appears in a value-producing position other than a .let RHS chain. The recent non-tail-match commit added support only via findNonTailMatch (Lower.lean:347-367), which walks chains of .let (.var _) / .let .wildcard and lifts an inner match into a Ctrl.matchContinue. Matches inside tuples, arrays, function arguments, etc. still throw.
For v1 of the plan, Simple.Term's .match admits the same restriction: it appears only in let-RHS positions. Generalizing this is an additive milestone, not in scope here.
Simplify consumes Typed.Term, not Source.Term. Today's checkAndSimplify (Compiler/Simple.lean:38-54) runs checkFunction twice — once on the raw Term (discarding the result, only used to surface errors at original source positions), then simplify on the untyped Term, then checkFunction again to produce the final TypedTerm. The double pass goes away when the match compiler is rewritten to consume Typed.Term and produce Simple.Term directly, threading types through the decision tree.
3.5 Stage 4 — Concrete.Term (NEW)
Produced by concretize. Same constructors as Simple.Term but over a Concrete.Typ with no .mvar and no parametric .app:
Concrete.Typ ::= .unit | .field
| .tuple (Array Concrete.Typ)
| .array Concrete.Typ Nat
| .pointer Concrete.Typ
| .ref MonoGlobal
| .function (List Concrete.Typ) Concrete.Typ
Concrete.Function has no params. Every parametric decl has been specialised and renamed via concretizeName (Concretize.lean:43). The lowering pass toBytecode consumes only Concrete.Term — typSize (Layout.lean:23) no longer has an .mvar arm to reject.
3.6 Stage 5 — Bytecode.Toplevel
Already at Bytecode.lean:88. Two cleanups:
- Make
partial defs on Block / Ctrl total via termination_by on structural subterm (Dedup.lean:30-92, Layout.lean:192, Lower.lean walks).
FunctionLayout is non-semantic; the bytecode evaluator ignores it. Keep it on Function but the semantic relation is blind to it.
3.7 Stage 6 — Post-dedup Bytecode.Toplevel
Same datatype, post-deduplicate. The proof obligation is bisimulation preservation: for every original funIdx, the original toplevel at funIdx observes the same behavior as the deduplicated toplevel at remap funIdx.
4. Reference Evaluators
Both evaluators are new Lean code. Both are total, fuel-indexed (call-only accounting), error-returning, and cache-free — no callCache, no function_queries. Caching is an orthogonal optimization layer that the production Rust executor uses for multiplicity bookkeeping in the STARK trace.
4.1 Source reference evaluators — Source/Eval.lean, Concrete/Eval.lean
The proof-bearing source reference lives in new files, separate from the debug interpreter Interpret.lean (§2.2). There are two stages:
Ix/Aiur/Source/Eval.lean (M0) — reference evaluator on Source.Term. Used by M1's cross-test to validate against the Lean bytecode reference and Rust, before the stage refactor lands.
Ix/Aiur/Concrete/Eval.lean (M3) — reference evaluator on Concrete.Term. This is the form that participates in compile_preservation. Supersedes Source/Eval.lean in the proof chain once M3 is done.
Both are fresh implementations. They may mirror Interpret.lean's structure for diffability, but they do not share code — Interpret.lean persists unchanged in its debug role, and keeping the two evaluators independent avoids accidental entanglement of debug features with the spec.
Design common to both (M0 and M3 forms).
- No cache. The cache is an optimization of the debug interpreter (and of Rust's
function_queries for STARK multiplicity tracking), not part of the semantic model. The reference evaluators always re-execute a body.
- No stack trace. Errors are
Except.error SourceError with a small tag enum; no call-stack construction. Debugging is Interpret.lean's job.
- Fuel-indexed, call-only accounting. A
fuel : Nat parameter decrements only at applyGlobal. Intra-body recursion is structural on Term / Concrete.Term. Exhaustion returns Except.error .outOfFuel.
- Total. Every recursive def carries an explicit termination measure (
matchPattern structural on Pattern, interp / applyGlobal lexicographic on (fuel, term), flattenValue / unflattenValue structural on Value / Typ, typFlatSize / dataTypeFlatSize with a visited : HashSet Global argument mirroring Layout.DataType.size:49-57).
- Errors return, never panic.
ioSetInfo on existing key, ioRead OOB, pattern-match failure, type mismatch, .ret outside a function body — all produce Except.error.
M0 form (on Source.Term).
Signature:
Source.Eval.runFunction :
Decls → Global → List Value → IOBuffer → Nat →
Except SourceError (Value × IOBuffer)
Covers every Source.Term constructor. Does not solve the pointer representation divergence (§2.6 item 6): source pointers are still a single global counter, so a program that uses ptr_val across different flat widths would disagree with Rust. No current test triggers this.
M3 form (on Concrete.Term).
With types available at runtime, fix the pointer representation and align with Rust's bucketed memory:
Value.pointer : Nat → Nat → Value carrying (width, index). flattenValue on .pointer w i = #[.ofNat i] (width is type-level, never in the flat encoding).
Store : IndexMap Nat (IndexMap (Array G) Unit) — width buckets keyed over flat Array G. Matches Rust's memory_queries exactly.
.store t computes the flat width of t.typ via typFlatSize, flattens t's value, inserts into the right bucket.
.load ptr reads the width from the typed .load node, dispatches to the right bucket, unflattens via the element type.
.ptrVal (if kept) returns the within-bucket index as a field element. Source and bytecode now agree on pointer numeric values. (See the discussion of whether to keep ptrVal at all in §10 open decisions.)
After M3, Concrete.Eval and Bytecode.Eval produce identical IOBuffers and identical flat results on every program.
Signature:
Concrete.Eval.runFunction :
Concrete.Decls → Global → List Value → IOBuffer → Nat →
Except SourceError (Value × IOBuffer)
4.2 Bytecode reference evaluator — NEW, Ix/Aiur/Bytecode/Eval.lean
Mirrors src/aiur/execute.rs in big-step style: mutually recursive evalBlock / evalCtrl / evalOp, total, fuel-indexed.
structure EvalState where
map : Array G
memory : IndexMap Nat (IndexMap (Array G) Unit)
ioBuffer : IOBuffer
Bytecode.Eval.evalBlock :
Bytecode.Toplevel → Block → EvalState → Nat →
Except BytecodeError (Array G × EvalState)
memory has one bucket per width in Toplevel.memorySizes (Bytecode.lean:90), mirroring Rust's QueryRecord.memory_queries (execute.rs:36-40). Each Op.store values uses values.size as the width key.
Differences from execute.rs:
- No
QueryRecord. Drop multiplicities, byte queries, per-function call queries — trace-side bookkeeping.
- No call cache. §4.1 explanation applies symmetrically.
- No
unconstrained branching. Both branches of every if unconstrained produce the same value; they differ only in whether a query is logged.
- No stack machine. Direct big-step. Easier to reason about than small-step, and the two coincide here because the language is deterministic.
- Call-only fuel. Decrements only on
Op.call.
- Errors return, never panic. All failures
Except.error.
Bytecode.Eval.runFunction :
Bytecode.Toplevel → FunIdx → Array G → IOBuffer → Nat →
Except BytecodeError (Array G × IOBuffer)
4.3 The equivalence relation
Ix/Aiur/Semantics/Relation.lean (NEW):
inductive ValueEq (decls : Decls) (funcIdx : Global → Option Nat) :
Value → Typ → Array G → Prop
| unit : ValueEq .. .unit .unit #[]
| field : ValueEq .. (.field g) .field #[g]
| pointer : ValueEq .. (.pointer w i) (.pointer _) #[.ofNat i]
| fn : ValueEq .. (.fn g) _ #[.ofNat (funcIdx g).getD 0]
| tuple : ... -- concat of element encodings
| array : ...
| ctor : ... -- tag + args + padding, matching dataTypeFlatSize
def InterpResultEq decls funcIdx
(src : Except SourceError (Value × IOBuffer))
(bc : Except BytecodeError (Array G × IOBuffer)) (retTyp : Typ) : Prop :=
match src, bc with
| .ok (v, io₁), .ok (gs, io₂) => ValueEq decls funcIdx v retTyp gs ∧ io₁ = io₂
| .error _, .error _ => True
| _, _ => False
ValueEq is a propositional mirror of flattenValue. The equivalence proof for Lower (M6) goes by induction on Concrete.Term and maintains the local invariant that the bytecode map slice at the compiled indices contains the flattened form of every live binding.
5. The Simulation Invariant
5.1 StructCompatible
The simulation between source and bytecode must thread through every sub-call. We split this into a structural part — pure facts about the compilation output, provable as a standalone lemma — and a semantic part, established by simultaneous induction with compile_preservation.
-- Ix/Aiur/Semantics/Compatible.lean (NEW)
structure StructCompatible
(decls : Decls)
(bc : Bytecode.Toplevel)
(nameMap : Global → Option Bytecode.FunIdx) : Prop where
total_on_reachable :
∀ name f, decls.getByKey name = some (.function f) →
reachableFromEntries decls name →
∃ fi, nameMap name = some fi
funIdx_valid :
∀ name fi, nameMap name = some fi → fi < bc.functions.size
input_layout_matches :
∀ name fi f, nameMap name = some fi →
decls.getByKey name = some (.function f) →
(f.inputs.map Prod.snd).foldl
(fun acc t => acc + typFlatSize decls t) 0 =
bc.functions[fi]!.layout.inputSize
call_graph_closed :
∀ fi, fi < bc.functions.size →
∀ callee, callee ∈ collectAllCallees bc.functions[fi]!.body →
callee < bc.functions.size
theorem compile_ok_implies_struct_compatible :
t.compile = .ok ct →
StructCompatible t.decls ct.bytecode ct.getFuncIdx
This lemma is proved standalone, by induction on the compilation passes.
5.2 The full simulation pattern
The semantic clause — for every compiled function pair, source and bytecode bodies produce matching observations — references the theorem we're proving. So it cannot be proved standalone; it is proved simultaneously with compile_preservation by induction on fuel.
compile_preservation (see the full signature in §1.2) goes by strong induction on fuel. At the outer call site, we use StructCompatible (§5.1) to look up funIdx, equate flat input sizes, and bound the callee index. At every inner Op.call, the induction hypothesis at fuel - 1 discharges the recursive equivalence.
This is CompCert's match_states / fsim_match_states pattern. The adaptation uses big-step semantics with fuel as the well-founded measure, but the structure is identical: StructCompatible plays the global part of match_states, and the per-state "register file matches flattened env" condition (§4.3) plays the local part.
5.3 Fuel discipline
Claim: the fuel transformer φ at every pass is id. Source and bytecode share a single fuel budget 1-for-1. Why:
- Call-only accounting. Both evaluators decrement fuel only at function-call boundaries. Intra-body recursion is structural.
- Calls are preserved 1-to-1 by every pass.
simplify touches the pattern language; never adds or removes calls.
concretize is name substitution: every polymorphic call becomes exactly one monomorphic call.
lower (Compiler/Lower.lean:177-207) compiles every .app to exactly one Op.call.
dedup merges bisimilar bodies but leaves call sites intact.
needsCircuit toggles a flag the evaluator ignores.
For every source evaluation that makes k calls before terminating, the compiled bytecode makes exactly k calls. The two terminate at the same fuel bound.
If a future pass blows up calls (e.g., inlining), introduce a distinct φ only for that pass — do not weaken the global invariant.
6. The Proof
6.1 Pass-by-pass obligations
For each compiler pass p : StageN → StageN+1, prove two companion lemmas:
-- preservation
theorem pass_preserves {src : StageN} {tgt : StageN+1}
(h : p src = .ok tgt) :
∀ env args io fuel,
evalN src env args io fuel ≈ evalN+1 tgt env args io fuel
-- progress
theorem pass_progress {src : StageN} (hwf : Stage_N_WellFormed src) :
∃ tgt, p src = .ok tgt
The two compose by transitivity of InterpResultEq for preservation and by bind chaining on Except for progress. Each per-pass progress lemma is typically ≤ 30% of the corresponding preservation proof: the case analysis is identical, only the conclusion changes.
6.2 Proof order — easy to hard
needsCircuit semantic irrelevance. Sets a constrained flag the bytecode evaluator ignores. One-liner, absorbed into M9 as a trivial rfl-level step — no milestone of its own.
deduplicate preservation. Bytecode-only pass; bisimulation over Block up to call-index renaming, with cycles handled by well-founded induction on fuel. (M4)
toBytecode (Lower). The largest single proof. Source is Concrete.Term; target is a flat Block with a map : Array G register file. Local invariant: the bytecode map slice at the compiled indices contains the flattened form of every live binding. (M5 straight-line, M6 full)
concretize. Values are type-erased; substitution changes types but not value denotation. Short proof. (M7)
simplify / match compiler. Adapt Maranget's correctness proof; CompCert Cminorgenproof and CakeML pat_to_dec are the reference adaptations. Hardest single proof. (M8)
- Checker soundness. The type annotations on
Typed.Term must agree with what the evaluator actually produces — otherwise Lower cannot trust typSize during layout. This is M9.5; full statement in §1.3 and §8. With intrinsic typing (v2 of §3.3) the milestone vanishes because well-typedness becomes structural.
The order and dependencies above are realized as a concrete milestone schedule in §8. How the proof obligations sequence into parallel work — and where each one sits on the critical path — is captured by the DAG in §8.
7. Refactoring Work (Pre-Proof)
7.1 File layout
Ix/Aiur/
Source/Term.lean (Stage 1; renamed from Term.lean; flatten Data layer)
Source/Eval.lean NEW (M0; reference evaluator on Source.Term)
Typed/Term.lean (Stage 2; renamed from TypedTerm.lean; flatten TypedData layer)
Simple/Term.lean NEW (Stage 3; typed)
Concrete/Term.lean NEW (Stage 4 + Concrete.Typ)
Concrete/Eval.lean NEW (M3; reference evaluator on Concrete.Term; proof-bearing)
Bytecode.lean (Stage 5; unchanged)
Bytecode/Eval.lean NEW (M1; cache-free Lean reference)
Bytecode/ExecuteFfi.lean NEW (opaque Rust FFI; split from Protocol.lean)
Interpret.lean UNCHANGED in role (debug interpreter with cache +
stack trace); M0 applies io bug fixes for Rust parity
Semantics/
Relation.lean NEW (ValueEq, InterpResultEq)
Compatible.lean NEW (StructCompatible)
Flatten.lean NEW (total flattenValue / unflattenValue)
Compiler/
Check.lean unchanged API; total internals
Match.lean consumes Typed.Term, produces Simple.Term
Simple.lean Typed.Term → Simple.Term bridge
Concretize.lean Simple.Term → Concrete.Term
Lower.lean Concrete.Term → Bytecode.Toplevel
Dedup.lean unchanged; total
Layout.lean gadget arm removed
Proofs/
ValueEqFlatten.lean
StructCompatible.lean compile_ok_implies_struct_compatible
WellFormed.lean
DedupSound.lean M4
LowerSoundCore.lean M5
LowerSoundControl.lean M6
ConcretizeSound.lean M7
SimplifySound.lean M8
CompilerPreservation.lean M9
CheckSound.lean M9.5 (checker soundness; required)
CompilerProgress.lean M10
CompilerCorrect.lean M10 (final theorem)
7.2 Killing partial
Grep finds 92 partial defs across Ix/Aiur/**. They fall into three termination categories, each handled differently in M2:
-
Structural recursion on the term / block / type tree. The bulk (80+). Become def ... termination_by on the obvious structural measure. The pain point is Lean's derived sizeOf not seeing through nested Array.map / foldl; expect ~1 AI-day on a small Ix/Array/Induction.lean helper library before the rest can be discharged.
-
Cross-function recursion in the reference evaluators. Today's pattern Interpret.lean:251 applyGlobal → :279 interp → applyGlobal is genuinely unbounded. The reference evaluators (Source/Eval.lean, Concrete/Eval.lean) mirror this structure and handle it with an explicit fuel parameter that strictly decreases on every applyGlobal, plus a termination proof on (fuel, term) lexicographic.
-
Genuinely non-trivial termination. Two cases:
Concretize.lean:361 TypedDecls.concretize — worklist over newly discovered instantiations, no finiteness guarantee today. The M0 static check rejects polymorphic recursion; termination of the worklist then follows by structural induction on the type-argument DAG.
matchPattern through .pointer recurses into the store, but is bounded by pattern depth (the recursion sees a structurally smaller pattern at every step). Already def (total) in Interpret.lean:138; the reference evaluators reuse the same approach.
Disposition by file:
| File |
Partials |
Fate |
Check.lean |
40 |
Total in M2 (category 1). |
Compiler/** |
39 |
Total in M2 (category 1, plus Concretize in category 3). |
Compiler.lean |
1 |
Total in M2. |
Interpret.lean utilities (flattenValue, unflattenValue, typFlatSize, dataTypeFlatSize) |
4 |
Move to Semantics/Flatten.lean in M2 and made total there (category 1, with a visited set for dataTypeFlatSize). |
Interpret.lean core (interp, applyGlobal, applyLocal, ppValue, natToHexGo) |
5 |
Stay partial. Debug tool, may loop. |
Meta.lean |
3 |
Stay partial. Parser, not proof-load-bearing. |
After M2, the proof stack is partial-free. The 8 remaining partials (5 in Interpret.lean, 3 in Meta.lean) participate in no proof.
7.3 Incidental refactors
These land alongside their driving milestones but are worth calling out so reviewers know they're planned:
- Flatten the
Data / TypedData mutual children into direct Source.Term.tuple / Source.Term.array (and Typed.Term equivalents). Cleaner induction principles. Lands with the M3 rename.
- Split
Ix/Aiur/Protocol.lean into Protocol.lean (keeps AiurSystem, Proof, FRI parameters, buildClaim) and a new Ix/Aiur/Bytecode/ExecuteFfi.lean (keeps the opaque Rust executor). Makes room for Bytecode/Eval.lean (M1) to stand alone.
- Delete the double
checkFunction pass. Today checkAndSimplify (Compiler/Simple.lean:38-54) runs the checker twice because simplify operates on untyped Term. Once Match.lean consumes Typed.Term (M3), the second check disappears.
All three are folded into the milestone that enables them; there is no separate "refactor milestone".
8. Milestones
Time unit: AI-day. One AI-day is a focused interactive day driving a Lean proof session, including edit/elaborate/iterate cycles. It is not equivalent to an engineer-day: compile latency and proof iteration depth dominate, and much of the "thinking" time is compressed or amortized across parallel attempts.
Dependencies form the DAG in §8.1 below. Headline numbers:
| Metric |
Value |
| Sequential sum (all milestones back-to-back) |
~26–42 AI-days |
| Critical path (M0→M1→M2→M3→M5→M6→M9→M9.5→M10) |
~20–32 AI-days |
| Wall clock with parallelism (M4/M7/M8 alongside) |
~5–7 weeks |
| Risk-adjusted (M3, M6, M8, or M9.5 stalling) |
~8 weeks |
M3, M6, M8, and M9.5 are the four milestones most likely to slip. M9.5 is on the critical path and mandatory — skipping it would leave a hole in the simulation proof at M6 (Lower consults typSize on type annotations that nothing guarantees are accurate).
8.1 Dependency graph
┌─→ M4 ───────────────────┐
│ │
M0 ─→ M1 ─→ M2 ────┤ │
│ │
└─→ M3 ─┬─→ M5 ─→ M6 ─────┼─→ M9 ─→ M9.5 ─→ M10
├─→ M7 ───────────┤
└─→ M8 ───────────┘
- M4 (dedup) needs
Bytecode.Eval (M1) and totality (M2), but not the staged datatypes. It starts as soon as M2 is done, running in parallel with M3.
- M5/M6, M7, M8 all need M3 (staged datatypes,
Concrete.Eval). They run concurrently after M3.
- M9 composes preservation from M4 + M6 + M7 + M8 using
StructCompatible (the standalone invariant from §5.1).
- M9.5 proves
checkAndSimplify sound — every type annotation in its output is semantically accurate. Required by M6 (Lower consults typSize on annotations) and by M10 (WellFormed decomposition).
- M10 adds progress across all passes and yields
compile_correct.
Each milestone below ships independently and keeps the tree green.
M0 — Semantic-parity fixes and source reference evaluator — ~1–2 AI-days
Two concerns land together in M0:
Part A: fix the latent bugs in §2.6. Each fix lives in whichever file originally contained the bug, and each lands with the test that exercises it — none of these bugs is observable on the existing corpus, so without new tests the PRs would be invisible to CI.
| File |
Fix |
New test |
Ix/Aiur/Interpret.lean:530 |
ioSetInfo errors on existing key |
Set [0] then set [0] again. |
Ix/Aiur/Interpret.lean:533-537 |
ioRead errors on OOB |
Read len=4 past data.size. |
Ix/Aiur/Compiler/Simple.lean:22 + Compiler/Match.lean |
Non-exhaustive match becomes a CheckError instead of bubbling to unreachable! |
Program with non-exhaustive match on G. |
Ix/Aiur/Compiler/Concretize.lean:361 |
Static monomorphization-termination check (rejects polymorphic recursion) |
Polymorphic-recursive template. |
Ix/Aiur/Compiler/Lower.lean:31-40, 575 + :45-73 |
Remove dead Layout.gadget / GadgetLayout / gadgetIdx |
None new — code deletion. |
The two Interpret.lean fixes keep Interpret.lean aligned with Rust on observable behavior. The callCache and stack trace are not touched — they are part of the debug interpreter's contract. The other three fixes are in the compiler pipeline, not in the interpreter.
Part B: create Ix/Aiur/Source/Eval.lean. Fresh cache-free, stack-trace-free, fuel-indexed reference evaluator on Source.Term. Structure may mirror Interpret.lean for diffability but the two files share no code. §4.1 lists the design.
partial still permitted in this file through M0; totality comes in M2 (where Source/Eval.lean is one of the files made total, Interpret.lean is not).
Exit: every Part A fix has a test that fails on main and passes on its branch. Ix/Aiur/Source/Eval.lean compiles and agrees with Interpret.lean on the full corpus (on value and IOBuffer) via a new cross-test added to Tests/Aiur/Common.lean. lake test green.
M1 — Lean-native bytecode reference — ~1–2 AI-days
Ix/Aiur/Bytecode/Eval.lean: total, fuel-indexed, error-returning, big-step, cache-free. Per-width memory buckets keyed from Toplevel.memorySizes. partial still allowed; totality in M2.
Extends the cross-test. Tests/Aiur/Common.lean:runTestCase already validates Interpret.lean against Rust (on value + IOBuffer). M0 added a check that Source.Eval agrees with Interpret.lean. M1 adds the fourth node: Bytecode.Eval against the other three.
The full cross-test at M1:
Rust.executeFfi
╱
Interpret.lean ─── (existing)
╲
Source.Eval ─── (M0 test)
╲
Bytecode.Eval ─── (M1 test)
Every test asserts that the four agree on value and IOBuffer. Any divergence is a ticket.
Expected divergence at M1: a program that calls an IO-effecting function twice with the same args would show Interpret.lean / Rust.executeFfi (cached, one write) disagreeing with Source.Eval / Bytecode.Eval (cache-free, two writes). No current test exercises this; if one does, it becomes a known divergence rather than a bug.
Exit: four-way agreement on the M0-extended corpus, or documented divergences with tickets.
M2 — Totality — ~3–5 AI-days
Make every proof-load-bearing partial def total. Scope:
Ix/Aiur/Check.lean (40 partials) — the largest mechanical lift.
Ix/Aiur/Compiler/** (39 partials across Concretize, Dedup, Lower, Layout, Match, Simple).
Ix/Aiur/Source/Eval.lean (the reference evaluator created in M0).
Ix/Aiur/Bytecode/Eval.lean (the reference evaluator created in M1).
- Move the flattening utilities out of
Interpret.lean into a new Ix/Aiur/Semantics/Flatten.lean: flattenValue, unflattenValue, typFlatSize, dataTypeFlatSize. These are used by the reference evaluators and by the theorem statement, so they must be total. Add a visited : HashSet Global parameter to dataTypeFlatSize / typFlatSize mirroring Layout.DataType.size:49-57. Update Tests/Aiur/Common.lean and Interpret.lean to import from the new location.
Fuel where genuinely needed (§7.2 category 2). Check.lean's 40 partials through nested arrays need real iteration; budget ~1 AI-day on Ix/Array/Induction.lean helpers before the bulk can be discharged.
Out of scope: Ix/Aiur/Interpret.lean itself. After M2 it still contains partial def interp, partial def applyGlobal, partial def applyLocal, partial def ppValue, partial def natToHexGo (5 partials; matchPattern is already def). The core interpreter loop can loop on divergent programs — that is a feature of the debug tool, not a bug. The pretty printers are not semantic. None of these participate in any proof.
Exit: grep -n "^partial def" Ix/Aiur/ returns only:
Interpret.lean: interp, applyGlobal, applyLocal, ppValue, natToHexGo — 5 deliberate partials (debug interpreter core + pretty printers; matchPattern is already def).
Meta.lean: 3 partials (parser, not proof-load-bearing).
Every other file in Ix/Aiur/** is partial-free.
M3 — Stage datatypes and typed source reference — ~4–6 AI-days
The largest refactor milestone. Two coupled changes, each with sub-tasks:
Part A: staged datatypes.
- Rename
Ix/Aiur/Term.lean → Source/Term.lean and Ix/Aiur/TypedTerm.lean → Typed/Term.lean; update every import.
- Introduce new files
Simple/Term.lean (§3.4) and Concrete/Term.lean + Concrete/Typ (§3.5).
- Rewrite
Compiler/Match.lean to consume Typed.Term and produce Simple.Term directly (pure decision-tree compiler, see §9 risk 3).
- Rewrite
Compiler/Concretize.lean from TypedDecls → TypedDecls to Simple.Term → Concrete.Term.
- Rewrite
Compiler/Lower.lean (toIndex, TypedTerm.compile) to consume Concrete.Term.
- Delete the double
checkFunction pass in Compiler/Simple.lean:38-54.
- Delete the 20+
unreachable! / Impossible / Should not happen sites in Compiler/** now ruled out by types (§2.7).
- Flatten
Data / TypedData mutual children into direct constructors (§7.3).
- Split
Ix/Aiur/Protocol.lean into Protocol.lean + Bytecode/ExecuteFfi.lean (§7.3).
Part B: typed reference evaluator.
- Create
Ix/Aiur/Concrete/Eval.lean, a fresh evaluator on Concrete.Term (§4.1 M3 form). Width-bucketed Store, Value.pointer (w, i), typed .store / .load / .ptrVal.
- Add the first
ptr_val-using test to lock the new pointer semantics (skip if ptr_val is removed per §10 decision 3).
Concrete.Eval.lean supersedes Source/Eval.lean in the proof chain; Source/Eval.lean stays around only as the M1 cross-test witness.
This is the milestone most likely to slip. Total code churn is on the order of 2,000–4,000 lines, spread across ~15 files. The budget assumes that M2 totality work has already smoothed out Check.lean and Compiler/** — if any residual partial def blocks a type refactor, expect extra iteration.
Exit: Toplevel.compile type-checks through the new stages; existing tests pass; grep -rn "unreachable!\|Impossible\|Should not happen" Ix/Aiur/Compiler/ returns ≤ a handful with documented reasons; Concrete.Eval and Bytecode.Eval agree pointer-for-pointer on every program in the cross-test.
M4 — Dedup soundness — ~1–2 AI-days — parallel with M5/M7
Preservation: Bytecode.Eval.run t idx args io fuel ≈ Bytecode.Eval.run t.dedup.1 (t.dedup.2 idx) args io fuel. Progress: deduplicate is total (trivial). Bisimulation up to call-index renaming, cycles handled by well-founded induction on fuel.
Exit: Proofs/DedupSound.lean compiles.
M5 — Lower proof, straight-line subset — ~2–3 AI-days
Preservation + progress for Lower.toIndex on the fragment of Concrete.Term excluding .match, function calls, .store / .load, IO, and u8/u32 ops. Establishes the local simulation relation: bytecode map slice at compiled indices = flattened bound value.
Exit: lemma covers every non-control constructor.
M6 — Lower proof, full — ~3–5 AI-days — parallel with M8
Extend M5 to:
.match in tail position.
- Non-tail
.match via Ctrl.matchContinue, only for the let-RHS form matching findNonTailMatch. Matches in other positions are excluded by the Concrete.Term shape established in M3.
.ret / Ctrl.return / Ctrl.yield.
- Function calls (
Op.call); the unconstrained flag is semantically equivalent in both branches (§4.2).
.store / .load with width-bucketed memory (M3 payoff).
- IO operations — the
IOBuffer clause of the main theorem is discharged here.
- u8/u32 ops —
bv_decide or a library of bitvector lemmas.
Exit: Lower.compile_preservation and Lower.compile_progress for every Concrete.Term.
M7 — Concretize soundness — ~1 AI-day — parallel with M5/M6
Preservation: substInTypedTerm subst body has the same value denotation as body under the original type environment (type erasure). Progress: the M0 termination check bounds the worklist.
Exit: Proofs/ConcretizeSound.lean compiles.
M8 — Match compiler soundness — ~3–5 AI-days — parallel with M6
Adapt Maranget's correctness proof for decision-tree pattern compilation. For every Simple.Term produced from a Typed.Term by the rewritten simplifyTerm, the decision tree matches exactly when the original nested pattern would have matched, and binds the same locals to the same values.
References: CompCert Cminorgenproof, CakeML pat_to_dec, Maranget
2008. Largest single proof; iteration depth is the bottleneck.
Exit: Proofs/SimplifySound.lean compiles.
M9 — Top-level preservation — ~1 AI-day
Toplevel.compile_preservation from M4 + M6 + M7 + M8 by transitivity of InterpResultEq, threading fuel / StructCompatible / typing assumptions through the composition.
Exit: Proofs/CompilerPreservation.lean compiles.
M9.5 — Checker soundness — ~3–5 AI-days
Prove that checkAndSimplify produces semantically coherent annotations:
theorem checkAndSimplify_sound
{t : Toplevel} {decls : Decls} (hdecls : t.mkDecls = .ok decls)
{typedDecls : TypedDecls} (h : t.checkAndSimplify = .ok typedDecls) :
∀ {name f}, typedDecls.get? name = some (.function f) →
∀ env, WellFormedEnv env →
let ⟨v, _⟩ := Source.Eval.runTypedBody decls f env
ValueShapeMatches v f.output ∧
(flattenValue decls funcIdx v).size = typSize decls f.output
This is the property Lower.preservation (M6) needs when it consults typSize on TypedTerm.typ. Without it, the Lower proof has no grounds to claim its layout computation matches the runtime value.
The proof is by induction on the Check.lean inferTerm structure (40 mutual arms) plus the simplify pass (which preserves annotations through the match compiler). Heavy use of the M2 totality proofs for the recursion structure; each arm has a bespoke soundness argument.
Required, not optional. Without M9.5, M6 has a hole: "type annotations are accurate" is an unjustified assumption, and the claim that Lower preserves semantics cannot be discharged. The full-stack theorem is a chain that breaks at this link.
If intrinsic typing is adopted (risk 1 / §3.3), M9.5 vanishes: the Typed.Term : Ctx → Typ → Type datatype makes coherence structural. This is the one path that avoids the work.
Exit: Proofs/CheckSound.lean compiles. M6 and M7 can cite checkAndSimplify_sound in their preservation proofs.
M10 — Progress and compile_correct — ~1–2 AI-days
- Define
WellFormed (§1.3) in Proofs/WellFormed.lean. Since every conjunct is a computable .ok observation, no ghost predicates appear.
- Per-pass progress lemmas (most are byproducts of preservation). The
mkDecls and checkAndSimplify lemmas are trivial: their success is already hypothesized by WellFormed.
- Compose into
compile_progress.
- Combine with M9 (preservation) and M9.5 (checker soundness) into
compile_correct in Proofs/CompilerCorrect.lean.
Exit: the §1.1 theorem compiles, with no sorry and no trusted components beyond Lean's kernel and the Rust FFI boundary.
9. Risks
- Indexed Stage 2 fallback. If M5/M6 stall on the simulation relation, intrinsic typing (
inductive Typed.Term : Ctx → Typ → Type) becomes attractive. Budget ~3 AI-days to reboot the checker — the largest single contingency.
- Monomorphization-termination check rejects real programs. Run the M0 check against the existing corpus before merging. Fall back to a bounded-instance-count policy if anything legitimate breaks.
runMatchCompiler is stateful (id supply, diagnostics). Pure versions are easier to reason about. Pure-ify in M3 when rewriting it to produce Simple.Term directly.
- Array induction ergonomics. Lean 4's derived
sizeOf is finicky around nested Arrays. Budget ~1 AI-day in M2 for an Ix.Array.Induction helper library — already included in the M2 estimate but worth calling out.
- u8/u32 byte-level reasoning. The u8/u32 ops in
Interpret.lean:427-505 and the Rust gadgets use native bit operations. The Lower proof at M6 needs either bv_decide or a hand-built library of byte-level lemmas. Budget within M6.
- M9.5 (checker soundness) is open-ended. 3–5 AI-days is a point estimate; the real cost depends on whether the fielded
TypedTerm supports a clean soundness proof or whether the checker has to be partially rewritten. If the latter, the Stage-2 intrinsic-typing rewrite (risk 1) may end up being cheaper overall — both obviate M9.5 and fix the same root cause. Decide after spiking the first few inferTerm arms.
10. Open Decisions
Sign-off needed before M0:
- Indexed
Typed.Term — now or v2. Recommend v2 (§3.3). If adopted now, M9.5 vanishes entirely (intrinsic well-typedness), at the cost of a dependently-typed checker rewrite. The plan currently assumes v1 (fielded typing) + M9.5.
- Monomorphization-termination policy — static reject vs bounded count. Recommend static reject (§2.6 item 5).
ptrVal — keep or remove? Zero call sites today. Keeping it costs ~25 lines of proof work across the pipeline; removing it costs a handful of surface-language deletions. Recommend remove in M0 cleanup; §4.1 M3 step and §2.6 item 6 become no-ops.
All other design choices are already folded into the plan:
- Fuel-indexed evaluation, call-only accounting. §1.6, §4.1, §4.2, §5.3.
callCache absent from the Lean reference evaluators, kept in Interpret.lean and production Rust. §2.2, §4.1.
- Non-exhaustive match is a
CheckError. M0 bug fix 4.
.debug is transparent in the semantic relation. §1.7.
- Constrained-call-graph acyclicity is a
WellFormed clause, not a checker obligation. §1.3.
- Non-tail match in
Simple.Term is restricted to let-RHS chains (matches the current compiler). §3.4.
- Store bucketing is deferred to M3 (requires runtime types). §2.6 item 6, §4.1 M3 step.
11. Appendix — Concrete Numbers
Verified by grep over Ix/Aiur/**:
- 92 total
partial defs — per-file breakdown and disposition in §7.2.
- 26
unreachable! / throw "Impossible ..." / "Should not happen ..." sites — 24 in Compiler/**, 2 in Check.lean; Lower.lean alone has 20. Enumerated in §2.7.
- Source LOC to touch: ~3,500 (pipeline) + ~2,800 (tests to extend).
- New LOC estimate: ~1,200 (datatypes + two reference evaluators +
StructCompatible + WellFormed) + ~5,000–7,000 (proofs).
Long-term roadmap for proving in Lean 4 that the Aiur compiler in
Ix/Aiur/**translates source programs to bytecode without changing their observable behavior, where "observable" includes theIOBuffer.1. The Theorem
1.1 What "verified compiler" means
The end goal is one theorem combining progress (every well-formed program compiles successfully) and preservation (whenever compilation succeeds, the output matches source semantics):
where
Flatten.argsis the natural list-to-flat coercion, defined inSemantics/Flatten.leanalongsideflattenValue:The two conjuncts are independent statements and both are load-bearing:
Bytecode.Toplevelsatisfies (a) trivially..errorsatisfies (b) vacuously (the premiset.compile = .ok ctis never met).Only the conjunction guarantees that well-formed source produces a compilation output and that output is behaviorally equivalent to the source.
(Because
compileandmkDeclsare deterministic functions, (a) and (b) together are logically equivalent to the single-quantifier form∃ ct decls, ... ∧ InterpResultEq ..., but stating them as a conjunction makes the progress claim textually visible instead of hidden inside the existential.)1.2 The two halves, named for the proof structure
For §5–§6 and §8 it's useful to name the halves. Conjunct (b) of §1.1 is
Toplevel.compile_preservation; conjunct (a) isToplevel.compile_progress. They are proved separately — preservation by simultaneous induction with the simulation invariant (§5.2), progress by composing per-pass lemmas (§6.1).compile_correctiscompile_progressfollowed bycompile_preservationapplied to the witness from progress.1.3
WellFormed— the precondition for progressNo ghost predicates. Every conjunct is a computable observation about
t. A user provesWellFormed tby literally runningmkDecls,checkAndSimplify, the monomorphization-termination check, and the call-graph acyclicity check, and observing that all four succeed.decidedischarges it for any concrete program.Each conjunct corresponds to exactly one range of passes:
WellFormedclausemkDecls,wellFormedDeclscheckFunction ×2,simplifyTermcheckAndSimplify = .okconcretizeMonoTerminatestoBytecodededuplicate,needsCircuitConstrainedCallGraphAcyclicBecause
checkAndSimplifyis the bridge betweenTermandTypedTerm, its output is what downstream passes consume. The preservation proof relies on the type annotations in that output being semantically coherent (each annotated type agrees with what the term actually evaluates to). Proving that coherence — checker soundness — is M9.5. It is not optional: without it, the downstream preservation proofs (especiallyLower, which consultstypSizeon the annotations) rest on an unjustified assumption.1.4 What
InterpResultEqactually saysInterpResultEqpins three observables:flattenValue decls funcIdx v == gsfor source valuevand bytecode flat outputgs. The flattening is today'sInterpret.lean:574definition, relocated toIx/Aiur/Semantics/Flatten.leanand made total in M2.IOBufferis structurally equal. Bothdata : Array Gandmap : HashMap (Array G) IOKeyInfo(fromProtocol.lean:43).Except.error _on the source side iffExcept.error _on the bytecode side. v1 does not require message equality; v2 may refine.1.5 Why the
IOBufferis in the theoremA theorem that only related returned values would admit a compiler that silently dropped
ioWrites or reorderedioSetInfos relative to computation. For a zkDSL where theIOBufferis the public witness of a proof, that is catastrophic. Observability must include it.1.6 Two scope clarifications about the statement
hnameis surface-level. It hypothesizes that the entry function has aFunIdx— unavoidable, because the bytecode evaluator needs somefunIdxto start at. It does not hypothesize that every recursive sub-call has a matchingFunIdx. That property holds too, but is derived fromhctvia theStructCompatibleinvariant (§5.1) which is threaded through the simulation proof.Fuel is passed through unchanged. Both evaluators decrement fuel only on function calls, never inside a single function body. Every compiler pass preserves the call count exactly (§5.3), so source and bytecode share a single fuel budget 1-for-1. The relation maps
outOfFueltooutOfFuel.1.7 What the plan does not prove
outOfFuel; nothing stronger is claimed.Compiler/Layout.leancomputes selectors, auxiliaries, lookups; these affect trace generation only. STARK soundness is a separate theorem stacked on top ofcompile_correct.src/aiur/execute.rsis a performance implementation cross-tested against it (§2.4, M1). A formal Rust↔Lean equivalence is left for a follow-up.Ix/Aiur/Meta.leanelaborates the surface language toTerm. Macros likefold(i..j, init, |acc, @v| body)(Meta.lean:293-302) and template-call type-arg dropping (Meta.lean:101, 304-326) happen at parse time. The verified compiler operates fromTermonwards. Parser bugs are out of scope, the same way CompCert excludes the C parser.debugobservability..debugnodes calldbg_trace; treated as transparent in the semantic relation.needsCircuitminimality. It's an optimization; semantics are indifferent.Apps/ZKVoting/uses Aiur via the Lean API. The theorem covers the compilation of the Aiur source they emit; their own correctness is separate.2. The Code as It Stands
2.1 Compile pipeline
From
Ix/Aiur/Compiler.lean:80-91:2.2 Source semantics —
Ix/Aiur/Interpret.leanInterpret.leanis the debug interpreter. Its role is fast interactive execution and error diagnosis, not proof reference. It has three features that the reference semantics deliberately omits:callCache : HashMap (Global × List Value) Value— memoizes(fun, args)to make repeated calls fast.Interrupt.errorcarrying a full call-stack trace — built up viacallSite(:243-246), surfaced on any runtime error for debug messages like"in foo(1, 2)".partialrecursion — can loop on non-terminating programs, which is fine for an interactive tool.Structure:
Valueinductive (:10-21) — unit / field / tuple / array / ctor / fn / pointer.InterpState = { store, ioBuffer, callCache }(:199).InterpM = StateT InterpState (Except Interrupt)(:206).runFunction decls funcName inputs ioBuffer(:646).partial defs; cross-function recursion viaapplyGlobal.The plan keeps
Interpret.leanin this role. It is not refactored into the reference source semantics. The M0 bug fixes that align it with Rust (ioSetInfooverwrite,ioReadOOB) are applied here because the debug interpreter should agree with production on observable behavior, but the cache, stack trace, andpartialmachinery all stay. The proof-bearing reference semantics lives in separate new files:Ix/Aiur/Source/Eval.lean(M0) andIx/Aiur/Concrete/Eval.lean(M3), introduced in §4.1.2.3 Bytecode semantics —
src/aiur/execute.rs(Rust)Function.body : Block.QueryRecord(multiplicities, memory queries, byte queries) — trace-side artifacts, not part of the observable semantics we will formalize.Ix/Aiur/Protocol.lean:63-80exposesBytecode.Toplevel.executeas an opaque FFI tors_aiur_toplevel_execute. Building a Lean-native reference is M1.2.4 Existing test infrastructure
Tests/Aiur/Common.lean:79-103 runTestCasealready implements a structural cross-check between the Lean source interpreter and the Rust bytecode executor:compiled.bytecode.execute funIdx input inputIOBuffer.Aiur.runFunction decls funcName inputs inputIOBuffer.flattenValue decls funcIdx output == execOutputandstate.ioBuffer == execIOBuffer(Common.lean:73-76).Tests/Aiur/Aiur.lean(884 lines) drives this on ~70 programs: arithmetic, matches (tail and non-tail), enums with mutual recursion through pointers, parametric templates,foldmacros, IO, byte/u32 ops, pattern destructuring.Implications for the plan:
runTestCasewith a third party (the Lean-native bytecode reference); it does not create the cross-test from scratch.IOBufferequality between source and Rust. The verified-compiler theorem asks for the same property, proved instead of tested.2.5 Surface language vs
TermIx/Aiur/Term.lean:69 Termis the post-elaboration IR. The surface language users write is parsed byIx/Aiur/Meta.lean, which handles a richer surface:fold(i..j, init, |acc, @v| body)is a parse-time macro (Meta.lean:293-302). It unrolls into a chain oflet acc = ...using literal indices substituted for@v. There is nofoldconstructor inTerm.@voutsidefoldis a parse error (Meta.lean:327-328).Wrapper‹G›.Mk(42)) drop their type arguments at parse time (Meta.lean:101, 304-326). The elaboratedTerm.appcarries no type arguments; the type checker re-infers them and stores them onTypedTerm.app : Global → Array Typ → .... So the polymorphic-name dispatch problem (§5) is a TypedTerm phenomenon, not aTermphenomenon.The verified compiler operates from
Termonwards.Meta.leanis an untrusted parser (§1.7).2.6 Latent bugs and representational gaps
These are the cases where source and Rust diverge today, with disposition for each. Every one is latent in the existing test corpus — the corpus passes today because it doesn't exercise the buggy case. Each fix lands together with a test that exercises it.
Interpret.lean:530vsexecute.rs:62-67ioSetInfooverwrites existing key viaHashMap.insert; Rust panics onEntry::Occupied.Interpret.lean:533-537vsexecute.rs:228-235ioReadusesArray.extract(silent short read on OOB); Rust slices (panic on OOB).Interpret.lean:199-206, 260-263andexecute.rs:152-171(fun, args)calls and skip body execution on a hit, eliding any IO effects inside. Caches are consistent between source and Rust today, so the existing cross-check passes.Interpret.lean— the cache is part of its debug-interpreter contract and stays. The separately-createdIx/Aiur/Source/Eval.lean(M0) andIx/Aiur/Bytecode/Eval.lean(M1) are cache-free reference semantics. Production Rust keeps its cache; a follow-up theorem can prove "cache-having Rust ≈ cache-free reference" under a purity assumption on IO.Compiler/Simple.lean:22decisionToTermreturnsnoneon non-exhaustive match, bubbling tounreachable!.CheckError;Simple.Term(§3.4) never carries an "impossible" leaf.Concretize.lean:361Interpret.lean:86, 391-398vsexecute.rs:36-40, 173-205Storeis one global index space holding structuredValues; Rustmemory_queriesis bucketed by width holding flattenedVec<G>. Pointer values diverge under.ptrVal(Interpret.lean:412).grep ptr_val Tests/Aiurfinds nothing — fully latent.Value.pointer : Nat → Nat → Valueand runtime types at store time; that only becomes available when the reference evaluator moves fromSource.TermtoConcrete.Term(i.e.,Concrete/Eval.lean).Interpret.leanis not touched. M3 also adds the firstptr_val-using test.Compiler/Lower.lean:31-40, 575and:45-73Layout.gadget/GadgetLayoutexist andgadgetIdxis threaded throughlayoutMap, but never populated. Stale scaffolding for an unshipped feature.Two notes on prior misdiagnoses:
.loadreturningvs[0]!(Interpret.lean:407) is not data loss. Source stores a structured value wrapped in a single-element array (store.insert #[v] ()at:397);vs[0]!unwraps the wrapper. The divergence with Rust is representational (item 6), not data loss.dataTypeFlatSize(Interpret.lean:565-568) has no cycle check. Direct datatype recursion (without.pointerindirection) diverges. Today's programs use.pointerfor recursion so the bug is latent. Fix in M2 by adding avisitedset, mirroringLayout.DataType.size:49-57.2.7
unreachable!sites: where invariants leak through26 sites total — 24 in
Compiler/**, 2 inCheck.lean. Each is a pass relying on an invariant established by an earlier pass but not tracked in the type. These are exactly the places where staged datatypes (§3) pay off; in the new design each becomes impossible by construction.The most informative ones:
Compiler/Simple.lean:22— match compiler non-exhaustive (M0 bug 4 makes this aCheckErrorinstead).Compiler/Lower.lean:135—.retintoIndex(non-tail position).Compiler/Lower.lean:136—.matchintoIndex. Comment says "Non-tail match not yet implemented"; after the recent commit, this is still reached for matches outside let-RHS chains. See §3.4.Compiler/Lower.lean:161, 468— non-trivial.letafter simplify.Compiler/Lower.lean:194, 207, 575—.ref/.app/.gadgetresolving to a non-callable layout.Compiler/Lower.lean:502, 519, 586, 599— non-.varsubpatterns.Compiler/Layout.lean:41, 93— unresolved.mvarafter concretize.Compiler/Match.lean:117-118, 215—popNamespaceandspatternToPattern .pointerimpossibilities.Check.lean:595, 604— constructor / datatype lookups already resolved by the type checker.3. Staged Datatypes
The plan introduces a sequence of typed IRs. Each pass is a total function between adjacent stages, and the stages shrink monotonically: constructors disappear when the prior pass has eliminated them. This kills
unreachable!arms by construction.3.1 Pipeline of stages
All four
*.Termtypes live underIx/Aiur/<Stage>/Term.leanwith matching namespaces:Aiur.Source.Term,Aiur.Typed.Term,Aiur.Simple.Term,Aiur.Concrete.Term. The rename of the currentIx/Aiur/Term.lean→Ix/Aiur/Source/Term.leanandIx/Aiur/TypedTerm.lean→Ix/Aiur/Typed/Term.leanis part of M3.3.2 Stage 1 —
Source.TermToday's
Ix/Aiur/Term.lean:69 Term, renamed in M3. Full pattern language (.or,.pointer, nested). Types carry.mvarand.appwith parameters. As §2.5 notes, this is post-elaboration; surface macros never appear here, andSource.Term.appcarries no type arguments.Refactor: flatten the
Datamutual child intoSource.Term.tuple/Source.Term.arrayconstructors. Mechanical; cleaner induction principles.3.3 Stage 2 —
Typed.TermToday's
Ix/Aiur/TypedTerm.lean:48 TypedTerm, renamed in M3. Each subterm carriestyp : Typandescapes : Bool. Same pattern shapes as Stage 1; the type checker has inferred type arguments and stored them onappandref.Design choice. Keep fielded typing for v1 — a plain inductive annotated with types. An indexed alternative —
inductive Typed.Term : Ctx → Typ → Type— would makeLower.expectIdxarefland turn every type invariant into a type checker obligation rather than a lowering pass obligation. The win is huge for later proofs but the checker has to produce dependently-typed output. Defer to v2; revisit only if M5/M6 stall.3.4 Stage 3 —
Simple.Term(NEW)Produced by the (rewritten) match compiler. Grammar:
Type-level invariants:
.orsubpatterns (match compiler eliminated them)..pointersubpatterns (lowered to.letLoad).Locals..letbinds only aLocalor is a wildcard..matchscrutinee is aLocal.Non-tail match scope.
Lower.lean:136still throws"Non-tail match not yet implemented"whenever.matchappears in a value-producing position other than a.letRHS chain. The recent non-tail-match commit added support only viafindNonTailMatch(Lower.lean:347-367), which walks chains of.let (.var _)/.let .wildcardand lifts an inner match into aCtrl.matchContinue. Matches inside tuples, arrays, function arguments, etc. still throw.For v1 of the plan,
Simple.Term's.matchadmits the same restriction: it appears only in let-RHS positions. Generalizing this is an additive milestone, not in scope here.Simplify consumes
Typed.Term, notSource.Term. Today'scheckAndSimplify(Compiler/Simple.lean:38-54) runscheckFunctiontwice — once on the rawTerm(discarding the result, only used to surface errors at original source positions), thensimplifyon the untypedTerm, thencheckFunctionagain to produce the finalTypedTerm. The double pass goes away when the match compiler is rewritten to consumeTyped.Termand produceSimple.Termdirectly, threading types through the decision tree.3.5 Stage 4 —
Concrete.Term(NEW)Produced by
concretize. Same constructors asSimple.Termbut over aConcrete.Typwith no.mvarand no parametric.app:Concrete.Functionhas noparams. Every parametric decl has been specialised and renamed viaconcretizeName(Concretize.lean:43). The lowering passtoBytecodeconsumes onlyConcrete.Term—typSize(Layout.lean:23) no longer has an.mvararm to reject.3.6 Stage 5 —
Bytecode.ToplevelAlready at
Bytecode.lean:88. Two cleanups:partial defs onBlock/Ctrltotal viatermination_byon structural subterm (Dedup.lean:30-92,Layout.lean:192,Lower.leanwalks).FunctionLayoutis non-semantic; the bytecode evaluator ignores it. Keep it onFunctionbut the semantic relation is blind to it.3.7 Stage 6 — Post-dedup
Bytecode.ToplevelSame datatype, post-
deduplicate. The proof obligation is bisimulation preservation: for every originalfunIdx, the original toplevel atfunIdxobserves the same behavior as the deduplicated toplevel atremap funIdx.4. Reference Evaluators
Both evaluators are new Lean code. Both are total, fuel-indexed (call-only accounting), error-returning, and cache-free — no
callCache, nofunction_queries. Caching is an orthogonal optimization layer that the production Rust executor uses for multiplicity bookkeeping in the STARK trace.4.1 Source reference evaluators —
Source/Eval.lean,Concrete/Eval.leanThe proof-bearing source reference lives in new files, separate from the debug interpreter
Interpret.lean(§2.2). There are two stages:Ix/Aiur/Source/Eval.lean(M0) — reference evaluator onSource.Term. Used by M1's cross-test to validate against the Lean bytecode reference and Rust, before the stage refactor lands.Ix/Aiur/Concrete/Eval.lean(M3) — reference evaluator onConcrete.Term. This is the form that participates incompile_preservation. SupersedesSource/Eval.leanin the proof chain once M3 is done.Both are fresh implementations. They may mirror
Interpret.lean's structure for diffability, but they do not share code —Interpret.leanpersists unchanged in its debug role, and keeping the two evaluators independent avoids accidental entanglement of debug features with the spec.Design common to both (M0 and M3 forms).
function_queriesfor STARK multiplicity tracking), not part of the semantic model. The reference evaluators always re-execute a body.Except.error SourceErrorwith a small tag enum; no call-stack construction. Debugging isInterpret.lean's job.fuel : Natparameter decrements only atapplyGlobal. Intra-body recursion is structural onTerm/Concrete.Term. Exhaustion returnsExcept.error .outOfFuel.matchPatternstructural onPattern,interp/applyGloballexicographic on(fuel, term),flattenValue/unflattenValuestructural onValue/Typ,typFlatSize/dataTypeFlatSizewith avisited : HashSet Globalargument mirroringLayout.DataType.size:49-57).ioSetInfoon existing key,ioReadOOB, pattern-match failure, type mismatch,.retoutside a function body — all produceExcept.error.M0 form (on
Source.Term).Signature:
Covers every
Source.Termconstructor. Does not solve the pointer representation divergence (§2.6 item 6): source pointers are still a single global counter, so a program that usesptr_valacross different flat widths would disagree with Rust. No current test triggers this.M3 form (on
Concrete.Term).With types available at runtime, fix the pointer representation and align with Rust's bucketed memory:
Value.pointer : Nat → Nat → Valuecarrying(width, index).flattenValueon.pointer w i = #[.ofNat i](width is type-level, never in the flat encoding).Store : IndexMap Nat (IndexMap (Array G) Unit)— width buckets keyed over flatArray G. Matches Rust'smemory_queriesexactly..store tcomputes the flat width oft.typviatypFlatSize, flattenst's value, inserts into the right bucket..load ptrreads the width from the typed.loadnode, dispatches to the right bucket, unflattens via the element type..ptrVal(if kept) returns the within-bucket index as a field element. Source and bytecode now agree on pointer numeric values. (See the discussion of whether to keepptrValat all in §10 open decisions.)After M3,
Concrete.EvalandBytecode.Evalproduce identicalIOBuffers and identical flat results on every program.Signature:
4.2 Bytecode reference evaluator — NEW,
Ix/Aiur/Bytecode/Eval.leanMirrors
src/aiur/execute.rsin big-step style: mutually recursiveevalBlock/evalCtrl/evalOp, total, fuel-indexed.memoryhas one bucket per width inToplevel.memorySizes(Bytecode.lean:90), mirroring Rust'sQueryRecord.memory_queries(execute.rs:36-40). EachOp.store valuesusesvalues.sizeas the width key.Differences from
execute.rs:QueryRecord. Drop multiplicities, byte queries, per-function call queries — trace-side bookkeeping.unconstrainedbranching. Both branches of everyif unconstrainedproduce the same value; they differ only in whether a query is logged.Op.call.Except.error.4.3 The equivalence relation
Ix/Aiur/Semantics/Relation.lean(NEW):ValueEqis a propositional mirror offlattenValue. The equivalence proof forLower(M6) goes by induction onConcrete.Termand maintains the local invariant that the bytecodemapslice at the compiled indices contains the flattened form of every live binding.5. The Simulation Invariant
5.1
StructCompatibleThe simulation between source and bytecode must thread through every sub-call. We split this into a structural part — pure facts about the compilation output, provable as a standalone lemma — and a semantic part, established by simultaneous induction with
compile_preservation.This lemma is proved standalone, by induction on the compilation passes.
5.2 The full simulation pattern
The semantic clause — for every compiled function pair, source and bytecode bodies produce matching observations — references the theorem we're proving. So it cannot be proved standalone; it is proved simultaneously with
compile_preservationby induction onfuel.compile_preservation(see the full signature in §1.2) goes by strong induction onfuel. At the outer call site, we useStructCompatible(§5.1) to look upfunIdx, equate flat input sizes, and bound the callee index. At every innerOp.call, the induction hypothesis atfuel - 1discharges the recursive equivalence.This is CompCert's
match_states/fsim_match_statespattern. The adaptation uses big-step semantics with fuel as the well-founded measure, but the structure is identical:StructCompatibleplays the global part ofmatch_states, and the per-state "register file matches flattened env" condition (§4.3) plays the local part.5.3 Fuel discipline
Claim: the fuel transformer
φat every pass isid. Source and bytecode share a single fuel budget 1-for-1. Why:simplifytouches the pattern language; never adds or removes calls.concretizeis name substitution: every polymorphic call becomes exactly one monomorphic call.lower(Compiler/Lower.lean:177-207) compiles every.appto exactly oneOp.call.dedupmerges bisimilar bodies but leaves call sites intact.needsCircuittoggles a flag the evaluator ignores.For every source evaluation that makes
kcalls before terminating, the compiled bytecode makes exactlykcalls. The two terminate at the same fuel bound.If a future pass blows up calls (e.g., inlining), introduce a distinct
φonly for that pass — do not weaken the global invariant.6. The Proof
6.1 Pass-by-pass obligations
For each compiler pass
p : StageN → StageN+1, prove two companion lemmas:The two compose by transitivity of
InterpResultEqfor preservation and bybindchaining onExceptfor progress. Each per-pass progress lemma is typically ≤ 30% of the corresponding preservation proof: the case analysis is identical, only the conclusion changes.6.2 Proof order — easy to hard
needsCircuitsemantic irrelevance. Sets aconstrainedflag the bytecode evaluator ignores. One-liner, absorbed into M9 as a trivialrfl-level step — no milestone of its own.deduplicatepreservation. Bytecode-only pass; bisimulation overBlockup to call-index renaming, with cycles handled by well-founded induction on fuel. (M4)toBytecode(Lower). The largest single proof. Source isConcrete.Term; target is a flatBlockwith amap : Array Gregister file. Local invariant: the bytecodemapslice at the compiled indices contains the flattened form of every live binding. (M5 straight-line, M6 full)concretize. Values are type-erased; substitution changes types but not value denotation. Short proof. (M7)simplify/ match compiler. Adapt Maranget's correctness proof; CompCertCminorgenproofand CakeMLpat_to_decare the reference adaptations. Hardest single proof. (M8)Typed.Termmust agree with what the evaluator actually produces — otherwiseLowercannot trusttypSizeduring layout. This is M9.5; full statement in §1.3 and §8. With intrinsic typing (v2 of §3.3) the milestone vanishes because well-typedness becomes structural.The order and dependencies above are realized as a concrete milestone schedule in §8. How the proof obligations sequence into parallel work — and where each one sits on the critical path — is captured by the DAG in §8.
7. Refactoring Work (Pre-Proof)
7.1 File layout
7.2 Killing
partialGrep finds 92
partial defs acrossIx/Aiur/**. They fall into three termination categories, each handled differently in M2:Structural recursion on the term / block / type tree. The bulk (80+). Become
def ... termination_byon the obvious structural measure. The pain point is Lean's derivedsizeOfnot seeing through nestedArray.map/foldl; expect ~1 AI-day on a smallIx/Array/Induction.leanhelper library before the rest can be discharged.Cross-function recursion in the reference evaluators. Today's pattern
Interpret.lean:251 applyGlobal → :279 interp → applyGlobalis genuinely unbounded. The reference evaluators (Source/Eval.lean,Concrete/Eval.lean) mirror this structure and handle it with an explicitfuelparameter that strictly decreases on everyapplyGlobal, plus a termination proof on(fuel, term)lexicographic.Genuinely non-trivial termination. Two cases:
Concretize.lean:361 TypedDecls.concretize— worklist over newly discovered instantiations, no finiteness guarantee today. The M0 static check rejects polymorphic recursion; termination of the worklist then follows by structural induction on the type-argument DAG.matchPatternthrough.pointerrecurses into the store, but is bounded by pattern depth (the recursion sees a structurally smaller pattern at every step). Alreadydef(total) inInterpret.lean:138; the reference evaluators reuse the same approach.Disposition by file:
Check.leanCompiler/**Concretizein category 3).Compiler.leanInterpret.leanutilities (flattenValue,unflattenValue,typFlatSize,dataTypeFlatSize)Semantics/Flatten.leanin M2 and made total there (category 1, with avisitedset fordataTypeFlatSize).Interpret.leancore (interp,applyGlobal,applyLocal,ppValue,natToHexGo)partial. Debug tool, may loop.Meta.leanpartial. Parser, not proof-load-bearing.After M2, the proof stack is
partial-free. The 8 remaining partials (5 inInterpret.lean, 3 inMeta.lean) participate in no proof.7.3 Incidental refactors
These land alongside their driving milestones but are worth calling out so reviewers know they're planned:
Data/TypedDatamutual children into directSource.Term.tuple/Source.Term.array(andTyped.Termequivalents). Cleaner induction principles. Lands with the M3 rename.Ix/Aiur/Protocol.leanintoProtocol.lean(keepsAiurSystem,Proof, FRI parameters,buildClaim) and a newIx/Aiur/Bytecode/ExecuteFfi.lean(keeps the opaque Rust executor). Makes room forBytecode/Eval.lean(M1) to stand alone.checkFunctionpass. TodaycheckAndSimplify(Compiler/Simple.lean:38-54) runs the checker twice becausesimplifyoperates on untypedTerm. OnceMatch.leanconsumesTyped.Term(M3), the second check disappears.All three are folded into the milestone that enables them; there is no separate "refactor milestone".
8. Milestones
Time unit: AI-day. One AI-day is a focused interactive day driving a Lean proof session, including edit/elaborate/iterate cycles. It is not equivalent to an engineer-day: compile latency and proof iteration depth dominate, and much of the "thinking" time is compressed or amortized across parallel attempts.
Dependencies form the DAG in §8.1 below. Headline numbers:
M3, M6, M8, and M9.5 are the four milestones most likely to slip. M9.5 is on the critical path and mandatory — skipping it would leave a hole in the simulation proof at M6 (Lower consults
typSizeon type annotations that nothing guarantees are accurate).8.1 Dependency graph
Bytecode.Eval(M1) and totality (M2), but not the staged datatypes. It starts as soon as M2 is done, running in parallel with M3.Concrete.Eval). They run concurrently after M3.StructCompatible(the standalone invariant from §5.1).checkAndSimplifysound — every type annotation in its output is semantically accurate. Required by M6 (Lower consultstypSizeon annotations) and by M10 (WellFormeddecomposition).compile_correct.Each milestone below ships independently and keeps the tree green.
M0 — Semantic-parity fixes and source reference evaluator — ~1–2 AI-days
Two concerns land together in M0:
Part A: fix the latent bugs in §2.6. Each fix lives in whichever file originally contained the bug, and each lands with the test that exercises it — none of these bugs is observable on the existing corpus, so without new tests the PRs would be invisible to CI.
Ix/Aiur/Interpret.lean:530ioSetInfoerrors on existing key[0]then set[0]again.Ix/Aiur/Interpret.lean:533-537ioReaderrors on OOBlen=4pastdata.size.Ix/Aiur/Compiler/Simple.lean:22+Compiler/Match.leanCheckErrorinstead of bubbling tounreachable!G.Ix/Aiur/Compiler/Concretize.lean:361Ix/Aiur/Compiler/Lower.lean:31-40, 575+:45-73Layout.gadget/GadgetLayout/gadgetIdxThe two
Interpret.leanfixes keepInterpret.leanaligned with Rust on observable behavior. ThecallCacheand stack trace are not touched — they are part of the debug interpreter's contract. The other three fixes are in the compiler pipeline, not in the interpreter.Part B: create
Ix/Aiur/Source/Eval.lean. Fresh cache-free, stack-trace-free, fuel-indexed reference evaluator onSource.Term. Structure may mirrorInterpret.leanfor diffability but the two files share no code. §4.1 lists the design.partialstill permitted in this file through M0; totality comes in M2 (whereSource/Eval.leanis one of the files made total,Interpret.leanis not).Exit: every Part A fix has a test that fails on
mainand passes on its branch.Ix/Aiur/Source/Eval.leancompiles and agrees withInterpret.leanon the full corpus (on value andIOBuffer) via a new cross-test added toTests/Aiur/Common.lean.lake testgreen.M1 — Lean-native bytecode reference — ~1–2 AI-days
Ix/Aiur/Bytecode/Eval.lean: total, fuel-indexed, error-returning, big-step, cache-free. Per-width memory buckets keyed fromToplevel.memorySizes.partialstill allowed; totality in M2.Extends the cross-test.
Tests/Aiur/Common.lean:runTestCasealready validatesInterpret.leanagainst Rust (on value + IOBuffer). M0 added a check thatSource.Evalagrees withInterpret.lean. M1 adds the fourth node:Bytecode.Evalagainst the other three.The full cross-test at M1:
Every test asserts that the four agree on value and
IOBuffer. Any divergence is a ticket.Expected divergence at M1: a program that calls an IO-effecting function twice with the same args would show
Interpret.lean/Rust.executeFfi(cached, one write) disagreeing withSource.Eval/Bytecode.Eval(cache-free, two writes). No current test exercises this; if one does, it becomes a known divergence rather than a bug.Exit: four-way agreement on the M0-extended corpus, or documented divergences with tickets.
M2 — Totality — ~3–5 AI-days
Make every proof-load-bearing
partial deftotal. Scope:Ix/Aiur/Check.lean(40 partials) — the largest mechanical lift.Ix/Aiur/Compiler/**(39 partials acrossConcretize,Dedup,Lower,Layout,Match,Simple).Ix/Aiur/Source/Eval.lean(the reference evaluator created in M0).Ix/Aiur/Bytecode/Eval.lean(the reference evaluator created in M1).Interpret.leaninto a newIx/Aiur/Semantics/Flatten.lean:flattenValue,unflattenValue,typFlatSize,dataTypeFlatSize. These are used by the reference evaluators and by the theorem statement, so they must be total. Add avisited : HashSet Globalparameter todataTypeFlatSize/typFlatSizemirroringLayout.DataType.size:49-57. UpdateTests/Aiur/Common.leanandInterpret.leanto import from the new location.Fuel where genuinely needed (§7.2 category 2).
Check.lean's 40partials through nested arrays need real iteration; budget ~1 AI-day onIx/Array/Induction.leanhelpers before the bulk can be discharged.Out of scope:
Ix/Aiur/Interpret.leanitself. After M2 it still containspartial def interp,partial def applyGlobal,partial def applyLocal,partial def ppValue,partial def natToHexGo(5 partials;matchPatternis alreadydef). The core interpreter loop can loop on divergent programs — that is a feature of the debug tool, not a bug. The pretty printers are not semantic. None of these participate in any proof.Exit:
grep -n "^partial def" Ix/Aiur/returns only:Interpret.lean:interp,applyGlobal,applyLocal,ppValue,natToHexGo— 5 deliberatepartials (debug interpreter core + pretty printers;matchPatternis alreadydef).Meta.lean: 3partials (parser, not proof-load-bearing).Every other file in
Ix/Aiur/**ispartial-free.M3 — Stage datatypes and typed source reference — ~4–6 AI-days
The largest refactor milestone. Two coupled changes, each with sub-tasks:
Part A: staged datatypes.
Ix/Aiur/Term.lean→Source/Term.leanandIx/Aiur/TypedTerm.lean→Typed/Term.lean; update every import.Simple/Term.lean(§3.4) andConcrete/Term.lean+Concrete/Typ(§3.5).Compiler/Match.leanto consumeTyped.Termand produceSimple.Termdirectly (pure decision-tree compiler, see §9 risk 3).Compiler/Concretize.leanfromTypedDecls → TypedDeclstoSimple.Term → Concrete.Term.Compiler/Lower.lean(toIndex,TypedTerm.compile) to consumeConcrete.Term.checkFunctionpass inCompiler/Simple.lean:38-54.unreachable!/Impossible/Should not happensites inCompiler/**now ruled out by types (§2.7).Data/TypedDatamutual children into direct constructors (§7.3).Ix/Aiur/Protocol.leanintoProtocol.lean+Bytecode/ExecuteFfi.lean(§7.3).Part B: typed reference evaluator.
Ix/Aiur/Concrete/Eval.lean, a fresh evaluator onConcrete.Term(§4.1 M3 form). Width-bucketedStore,Value.pointer (w, i), typed.store/.load/.ptrVal.ptr_val-using test to lock the new pointer semantics (skip ifptr_valis removed per §10 decision 3).Concrete.Eval.leansupersedesSource/Eval.leanin the proof chain;Source/Eval.leanstays around only as the M1 cross-test witness.This is the milestone most likely to slip. Total code churn is on the order of 2,000–4,000 lines, spread across ~15 files. The budget assumes that M2 totality work has already smoothed out
Check.leanandCompiler/**— if any residualpartial defblocks a type refactor, expect extra iteration.Exit:
Toplevel.compiletype-checks through the new stages; existing tests pass;grep -rn "unreachable!\|Impossible\|Should not happen" Ix/Aiur/Compiler/returns ≤ a handful with documented reasons;Concrete.EvalandBytecode.Evalagree pointer-for-pointer on every program in the cross-test.M4 — Dedup soundness — ~1–2 AI-days — parallel with M5/M7
Preservation:
Bytecode.Eval.run t idx args io fuel ≈ Bytecode.Eval.run t.dedup.1 (t.dedup.2 idx) args io fuel. Progress:deduplicateis total (trivial). Bisimulation up to call-index renaming, cycles handled by well-founded induction on fuel.Exit:
Proofs/DedupSound.leancompiles.M5 — Lower proof, straight-line subset — ~2–3 AI-days
Preservation + progress for
Lower.toIndexon the fragment ofConcrete.Termexcluding.match, function calls,.store/.load, IO, and u8/u32 ops. Establishes the local simulation relation: bytecodemapslice at compiled indices = flattened bound value.Exit: lemma covers every non-control constructor.
M6 — Lower proof, full — ~3–5 AI-days — parallel with M8
Extend M5 to:
.matchin tail position..matchviaCtrl.matchContinue, only for the let-RHS form matchingfindNonTailMatch. Matches in other positions are excluded by theConcrete.Termshape established in M3..ret/Ctrl.return/Ctrl.yield.Op.call); theunconstrainedflag is semantically equivalent in both branches (§4.2)..store/.loadwith width-bucketed memory (M3 payoff).IOBufferclause of the main theorem is discharged here.bv_decideor a library of bitvector lemmas.Exit:
Lower.compile_preservationandLower.compile_progressfor everyConcrete.Term.M7 — Concretize soundness — ~1 AI-day — parallel with M5/M6
Preservation:
substInTypedTerm subst bodyhas the same value denotation asbodyunder the original type environment (type erasure). Progress: the M0 termination check bounds the worklist.Exit:
Proofs/ConcretizeSound.leancompiles.M8 — Match compiler soundness — ~3–5 AI-days — parallel with M6
Adapt Maranget's correctness proof for decision-tree pattern compilation. For every
Simple.Termproduced from aTyped.Termby the rewrittensimplifyTerm, the decision tree matches exactly when the original nested pattern would have matched, and binds the same locals to the same values.References: CompCert
Cminorgenproof, CakeMLpat_to_dec, Maranget2008. Largest single proof; iteration depth is the bottleneck.
Exit:
Proofs/SimplifySound.leancompiles.M9 — Top-level preservation — ~1 AI-day
Toplevel.compile_preservationfrom M4 + M6 + M7 + M8 by transitivity ofInterpResultEq, threading fuel /StructCompatible/ typing assumptions through the composition.Exit:
Proofs/CompilerPreservation.leancompiles.M9.5 — Checker soundness — ~3–5 AI-days
Prove that
checkAndSimplifyproduces semantically coherent annotations:This is the property
Lower.preservation(M6) needs when it consultstypSizeonTypedTerm.typ. Without it, the Lower proof has no grounds to claim its layout computation matches the runtime value.The proof is by induction on the
Check.leaninferTermstructure (40 mutual arms) plus the simplify pass (which preserves annotations through the match compiler). Heavy use of the M2 totality proofs for the recursion structure; each arm has a bespoke soundness argument.Required, not optional. Without M9.5, M6 has a hole: "type annotations are accurate" is an unjustified assumption, and the claim that Lower preserves semantics cannot be discharged. The full-stack theorem is a chain that breaks at this link.
If intrinsic typing is adopted (risk 1 / §3.3), M9.5 vanishes: the
Typed.Term : Ctx → Typ → Typedatatype makes coherence structural. This is the one path that avoids the work.Exit:
Proofs/CheckSound.leancompiles. M6 and M7 can citecheckAndSimplify_soundin their preservation proofs.M10 — Progress and
compile_correct— ~1–2 AI-daysWellFormed(§1.3) inProofs/WellFormed.lean. Since every conjunct is a computable.okobservation, no ghost predicates appear.mkDeclsandcheckAndSimplifylemmas are trivial: their success is already hypothesized byWellFormed.compile_progress.compile_correctinProofs/CompilerCorrect.lean.Exit: the §1.1 theorem compiles, with no
sorryand no trusted components beyond Lean's kernel and the Rust FFI boundary.9. Risks
inductive Typed.Term : Ctx → Typ → Type) becomes attractive. Budget ~3 AI-days to reboot the checker — the largest single contingency.runMatchCompileris stateful (id supply, diagnostics). Pure versions are easier to reason about. Pure-ify in M3 when rewriting it to produceSimple.Termdirectly.sizeOfis finicky around nestedArrays. Budget ~1 AI-day in M2 for anIx.Array.Inductionhelper library — already included in the M2 estimate but worth calling out.Interpret.lean:427-505and the Rust gadgets use native bit operations. The Lower proof at M6 needs eitherbv_decideor a hand-built library of byte-level lemmas. Budget within M6.TypedTermsupports a clean soundness proof or whether the checker has to be partially rewritten. If the latter, the Stage-2 intrinsic-typing rewrite (risk 1) may end up being cheaper overall — both obviate M9.5 and fix the same root cause. Decide after spiking the first fewinferTermarms.10. Open Decisions
Sign-off needed before M0:
Typed.Term— now or v2. Recommend v2 (§3.3). If adopted now, M9.5 vanishes entirely (intrinsic well-typedness), at the cost of a dependently-typed checker rewrite. The plan currently assumes v1 (fielded typing) + M9.5.ptrVal— keep or remove? Zero call sites today. Keeping it costs ~25 lines of proof work across the pipeline; removing it costs a handful of surface-language deletions. Recommend remove in M0 cleanup; §4.1 M3 step and §2.6 item 6 become no-ops.All other design choices are already folded into the plan:
callCacheabsent from the Lean reference evaluators, kept inInterpret.leanand production Rust. §2.2, §4.1.CheckError. M0 bug fix 4..debugis transparent in the semantic relation. §1.7.WellFormedclause, not a checker obligation. §1.3.Simple.Termis restricted to let-RHS chains (matches the current compiler). §3.4.11. Appendix — Concrete Numbers
Verified by
grepoverIx/Aiur/**:partial defs — per-file breakdown and disposition in §7.2.unreachable!/throw "Impossible ..."/"Should not happen ..."sites — 24 inCompiler/**, 2 inCheck.lean;Lower.leanalone has 20. Enumerated in §2.7.StructCompatible+WellFormed) + ~5,000–7,000 (proofs).