Skip to content

Toward a Verified Aiur Compiler #373

@arthurpaulino

Description

@arthurpaulino

Long-term roadmap for proving in Lean 4 that the Aiur compiler in Ix/Aiur/** translates source programs to bytecode without changing their observable behavior, where "observable" includes the IOBuffer.

1. The Theorem

1.1 What "verified compiler" means

The end goal is one theorem combining progress (every well-formed program compiles successfully) and preservation (whenever compilation succeeds, the output matches source semantics):

theorem Toplevel.compile_correct
    (t : Toplevel) (hwf : WellFormed t) :
    -- (a) Progress: compilation succeeds.
    (∃ ct decls, t.mkDecls = .ok decls ∧ t.compile = .ok ct)
    ∧
    -- (b) Preservation: any compilation output is semantically equivalent.
    (∀ ct decls,
      t.mkDecls = .ok decls →
      t.compile = .ok ct →
      ∀ name funIdx (hname : ct.getFuncIdx name = some funIdx)
        args io₀ fuel,
      InterpResultEq decls ct.getFuncIdx
        (Concrete.Eval.runFunction decls ⟨name⟩ args io₀ fuel)
        (Bytecode.Eval.runFunction  ct.bytecode funIdx
                                    (Flatten.args decls ct.getFuncIdx args)
                                    io₀ fuel))

where Flatten.args is the natural list-to-flat coercion, defined in Semantics/Flatten.lean alongside flattenValue:

def Flatten.args (decls : Decls) (funcIdx : Global → Option Nat)
    (args : List Value) : Array G :=
  args.foldl (fun acc v => acc ++ flattenValue decls funcIdx v) #[]

The two conjuncts are independent statements and both are load-bearing:

  • (a) alone is vacuous for meaning. A compiler that produces a no-op Bytecode.Toplevel satisfies (a) trivially.
  • (b) alone is vacuous for usefulness. A compiler that always returns .error satisfies (b) vacuously (the premise t.compile = .ok ct is never met).

Only the conjunction guarantees that well-formed source produces a compilation output and that output is behaviorally equivalent to the source.

(Because compile and mkDecls are deterministic functions, (a) and (b) together are logically equivalent to the single-quantifier form ∃ ct decls, ... ∧ InterpResultEq ..., but stating them as a conjunction makes the progress claim textually visible instead of hidden inside the existential.)

1.2 The two halves, named for the proof structure

For §5–§6 and §8 it's useful to name the halves. Conjunct (b) of §1.1 is Toplevel.compile_preservation; conjunct (a) is Toplevel.compile_progress. They are proved separately — preservation by simultaneous induction with the simulation invariant (§5.2), progress by composing per-pass lemmas (§6.1). compile_correct is compile_progress followed by compile_preservation applied to the witness from progress.

1.3 WellFormed — the precondition for progress

def WellFormed (t : Toplevel) : Prop :=
  (∃ decls,      t.mkDecls         = .ok decls)       ∧
  (∃ typedDecls, t.checkAndSimplify = .ok typedDecls) ∧
  MonoTerminates t                                    ∧
  ConstrainedCallGraphAcyclic t

No ghost predicates. Every conjunct is a computable observation about t. A user proves WellFormed t by literally running mkDecls, checkAndSimplify, the monomorphization-termination check, and the call-graph acyclicity check, and observing that all four succeed. decide discharges it for any concrete program.

Each conjunct corresponds to exactly one range of passes:

Passes Success condition WellFormed clause
mkDecls, wellFormedDecls no alias cycles, no duplicates, every type reference resolves first ∃
checkFunction ×2, simplifyTerm every body well-typed, no non-exhaustive match checkAndSimplify = .ok
concretize no polymorphic recursion MonoTerminates
toBytecode no unresolved mvars; layouts computable implied by concretize
deduplicate, needsCircuit total
(simulation precondition) constrained call graph is a DAG ConstrainedCallGraphAcyclic

Because checkAndSimplify is the bridge between Term and TypedTerm, its output is what downstream passes consume. The preservation proof relies on the type annotations in that output being semantically coherent (each annotated type agrees with what the term actually evaluates to). Proving that coherence — checker soundness — is M9.5. It is not optional: without it, the downstream preservation proofs (especially Lower, which consults typSize on the annotations) rest on an unjustified assumption.

1.4 What InterpResultEq actually says

InterpResultEq pins three observables:

  1. Return values agree under flattening. flattenValue decls funcIdx v == gs for source value v and bytecode flat output gs. The flattening is today's Interpret.lean:574 definition, relocated to Ix/Aiur/Semantics/Flatten.lean and made total in M2.
  2. Final IOBuffer is structurally equal. Both data : Array G and map : HashMap (Array G) IOKeyInfo (from Protocol.lean:43).
  3. Errors match. Except.error _ on the source side iff Except.error _ on the bytecode side. v1 does not require message equality; v2 may refine.

1.5 Why the IOBuffer is in the theorem

A theorem that only related returned values would admit a compiler that silently dropped ioWrites or reordered ioSetInfos relative to computation. For a zkDSL where the IOBuffer is the public witness of a proof, that is catastrophic. Observability must include it.

1.6 Two scope clarifications about the statement

hname is surface-level. It hypothesizes that the entry function has a FunIdx — unavoidable, because the bytecode evaluator needs some funIdx to start at. It does not hypothesize that every recursive sub-call has a matching FunIdx. That property holds too, but is derived from hct via the StructCompatible invariant (§5.1) which is threaded through the simulation proof.

Fuel is passed through unchanged. Both evaluators decrement fuel only on function calls, never inside a single function body. Every compiler pass preserves the call count exactly (§5.3), so source and bytecode share a single fuel budget 1-for-1. The relation maps outOfFuel to outOfFuel.

1.7 What the plan does not prove

  • Non-termination. Evaluators are fuel-indexed and total. Divergent programs are observed as outOfFuel; nothing stronger is claimed.
  • STARK / constraint-system soundness. Compiler/Layout.lean computes selectors, auxiliaries, lookups; these affect trace generation only. STARK soundness is a separate theorem stacked on top of compile_correct.
  • Rust executor ≡ Lean bytecode reference. The Lean evaluator we build (§4.2) is the specification; Rust's src/aiur/execute.rs is a performance implementation cross-tested against it (§2.4, M1). A formal Rust↔Lean equivalence is left for a follow-up.
  • Surface parser correctness. Ix/Aiur/Meta.lean elaborates the surface language to Term. Macros like fold(i..j, init, |acc, @v| body) (Meta.lean:293-302) and template-call type-arg dropping (Meta.lean:101, 304-326) happen at parse time. The verified compiler operates from Term onwards. Parser bugs are out of scope, the same way CompCert excludes the C parser.
  • debug observability. .debug nodes call dbg_trace; treated as transparent in the semantic relation.
  • Pretty-printers, CLI, statistics, benchmarks.
  • needsCircuit minimality. It's an optimization; semantics are indifferent.
  • Apps consuming the verified compiler. Currently Apps/ZKVoting/ uses Aiur via the Lean API. The theorem covers the compilation of the Aiur source they emit; their own correctness is separate.

2. The Code as It Stands

2.1 Compile pipeline

From Ix/Aiur/Compiler.lean:80-91:

Toplevel                                         -- Ix/Aiur/Term.lean:143
  ├─ mkDecls              (Check.lean:113)        expand type aliases
  ├─ wellFormedDecls      (Check.lean:695)
  ├─ checkFunction ×1     (Check.lean:743)        Term → TypedTerm (pre-simplify)
  ├─ simplifyTerm         (Simple.lean:12)        match compiler + decisionToTerm
  └─ checkFunction ×2     (Check.lean:743)        TypedTerm post-simplify
       ↓
TypedDecls                                        -- Ix/Aiur/TypedTerm.lean:77
  └─ concretize           (Concretize.lean:361)   monomorphize; erase params/mvars
       ↓
TypedDecls (monomorphic)
  └─ toBytecode           (Lower.lean:622)        + Layout.lean
       ↓
Bytecode.Toplevel                                 -- Ix/Aiur/Bytecode.lean:88
  ├─ deduplicate          (Dedup.lean:115)        bisimulation / partition refinement
  └─ needsCircuit         (Compiler.lean:59)      reachability over constrained calls
       ↓
CompiledToplevel

2.2 Source semantics — Ix/Aiur/Interpret.lean

Interpret.lean is the debug interpreter. Its role is fast interactive execution and error diagnosis, not proof reference. It has three features that the reference semantics deliberately omits:

  • callCache : HashMap (Global × List Value) Value — memoizes (fun, args) to make repeated calls fast.
  • Interrupt.error carrying a full call-stack trace — built up via callSite (:243-246), surfaced on any runtime error for debug messages like "in foo(1, 2)".
  • partial recursion — can loop on non-terminating programs, which is fine for an interactive tool.

Structure:

  • Value inductive (:10-21) — unit / field / tuple / array / ctor / fn / pointer.
  • InterpState = { store, ioBuffer, callCache } (:199).
  • InterpM = StateT InterpState (Except Interrupt) (:206).
  • Entry: runFunction decls funcName inputs ioBuffer (:646).
  • 9 partial defs; cross-function recursion via applyGlobal.

The plan keeps Interpret.lean in this role. It is not refactored into the reference source semantics. The M0 bug fixes that align it with Rust (ioSetInfo overwrite, ioRead OOB) are applied here because the debug interpreter should agree with production on observable behavior, but the cache, stack trace, and partial machinery all stay. The proof-bearing reference semantics lives in separate new files: Ix/Aiur/Source/Eval.lean (M0) and Ix/Aiur/Concrete/Eval.lean (M3), introduced in §4.1.

2.3 Bytecode semantics — src/aiur/execute.rs (Rust)

  • 436-line stack-machine interpreter over Function.body : Block.
  • Computes QueryRecord (multiplicities, memory queries, byte queries) — trace-side artifacts, not part of the observable semantics we will formalize.
  • The Lean side has no native bytecode evaluator yet; Ix/Aiur/Protocol.lean:63-80 exposes Bytecode.Toplevel.execute as an opaque FFI to rs_aiur_toplevel_execute. Building a Lean-native reference is M1.

2.4 Existing test infrastructure

Tests/Aiur/Common.lean:79-103 runTestCase already implements a structural cross-check between the Lean source interpreter and the Rust bytecode executor:

  1. Run Rust: compiled.bytecode.execute funIdx input inputIOBuffer.
  2. Run source interpreter: Aiur.runFunction decls funcName inputs inputIOBuffer.
  3. Assert flattenValue decls funcIdx output == execOutput and state.ioBuffer == execIOBuffer (Common.lean:73-76).
  4. Run the prover, verify the proof against the expected output.

Tests/Aiur/Aiur.lean (884 lines) drives this on ~70 programs: arithmetic, matches (tail and non-tail), enums with mutual recursion through pointers, parametric templates, fold macros, IO, byte/u32 ops, pattern destructuring.

Implications for the plan:

  • M1 extends runTestCase with a third party (the Lean-native bytecode reference); it does not create the cross-test from scratch.
  • The latent bugs in §2.6 are precisely the cases the existing corpus doesn't exercise. M0 must add tests that trigger them, otherwise the fixes will be invisible to CI.
  • The cross-check already verifies IOBuffer equality between source and Rust. The verified-compiler theorem asks for the same property, proved instead of tested.

2.5 Surface language vs Term

Ix/Aiur/Term.lean:69 Term is the post-elaboration IR. The surface language users write is parsed by Ix/Aiur/Meta.lean, which handles a richer surface:

  • fold(i..j, init, |acc, @v| body) is a parse-time macro (Meta.lean:293-302). It unrolls into a chain of let acc = ... using literal indices substituted for @v. There is no fold constructor in Term. @v outside fold is a parse error (Meta.lean:327-328).
  • Template type/function calls (Wrapper‹G›.Mk(42)) drop their type arguments at parse time (Meta.lean:101, 304-326). The elaborated Term.app carries no type arguments; the type checker re-infers them and stores them on TypedTerm.app : Global → Array Typ → .... So the polymorphic-name dispatch problem (§5) is a TypedTerm phenomenon, not a Term phenomenon.

The verified compiler operates from Term onwards. Meta.lean is an untrusted parser (§1.7).

2.6 Latent bugs and representational gaps

These are the cases where source and Rust diverge today, with disposition for each. Every one is latent in the existing test corpus — the corpus passes today because it doesn't exercise the buggy case. Each fix lands together with a test that exercises it.

# Location Observation Disposition
1 Interpret.lean:530 vs execute.rs:62-67 Source ioSetInfo overwrites existing key via HashMap.insert; Rust panics on Entry::Occupied. M0. Source: error on existing key.
2 Interpret.lean:533-537 vs execute.rs:228-235 Source ioRead uses Array.extract (silent short read on OOB); Rust slices (panic on OOB). M0. Source: error on OOB.
3 Interpret.lean:199-206, 260-263 and execute.rs:152-171 Both interpreters memoize (fun, args) calls and skip body execution on a hit, eliding any IO effects inside. Caches are consistent between source and Rust today, so the existing cross-check passes. M0 / policy. Not a "fix" to Interpret.lean — the cache is part of its debug-interpreter contract and stays. The separately-created Ix/Aiur/Source/Eval.lean (M0) and Ix/Aiur/Bytecode/Eval.lean (M1) are cache-free reference semantics. Production Rust keeps its cache; a follow-up theorem can prove "cache-having Rust ≈ cache-free reference" under a purity assumption on IO.
4 Compiler/Simple.lean:22 decisionToTerm returns none on non-exhaustive match, bubbling to unreachable!. M0. Non-exhaustive match becomes a CheckError; Simple.Term (§3.4) never carries an "impossible" leaf.
5 Concretize.lean:361 Monomorphization worklist has no termination guarantee; polymorphic recursion at a strictly larger type diverges. M0. Static check rejecting polymorphic recursion.
6 Interpret.lean:86, 391-398 vs execute.rs:36-40, 173-205 Source Store is one global index space holding structured Values; Rust memory_queries is bucketed by width holding flattened Vec<G>. Pointer values diverge under .ptrVal (Interpret.lean:412). grep ptr_val Tests/Aiur finds nothing — fully latent. M3. Requires Value.pointer : Nat → Nat → Value and runtime types at store time; that only becomes available when the reference evaluator moves from Source.Term to Concrete.Term (i.e., Concrete/Eval.lean). Interpret.lean is not touched. M3 also adds the first ptr_val-using test.
7 Compiler/Lower.lean:31-40, 575 and :45-73 Layout.gadget / GadgetLayout exist and gadgetIdx is threaded through layoutMap, but never populated. Stale scaffolding for an unshipped feature. M0 cleanup. Remove all three; simpler types for later proofs.

Two notes on prior misdiagnoses:

  • .load returning vs[0]! (Interpret.lean:407) is not data loss. Source stores a structured value wrapped in a single-element array (store.insert #[v] () at :397); vs[0]! unwraps the wrapper. The divergence with Rust is representational (item 6), not data loss.
  • dataTypeFlatSize (Interpret.lean:565-568) has no cycle check. Direct datatype recursion (without .pointer indirection) diverges. Today's programs use .pointer for recursion so the bug is latent. Fix in M2 by adding a visited set, mirroring Layout.DataType.size:49-57.

2.7 unreachable! sites: where invariants leak through

26 sites total — 24 in Compiler/**, 2 in Check.lean. Each is a pass relying on an invariant established by an earlier pass but not tracked in the type. These are exactly the places where staged datatypes (§3) pay off; in the new design each becomes impossible by construction.

The most informative ones:

  • Compiler/Simple.lean:22 — match compiler non-exhaustive (M0 bug 4 makes this a CheckError instead).
  • Compiler/Lower.lean:135.ret in toIndex (non-tail position).
  • Compiler/Lower.lean:136.match in toIndex. Comment says "Non-tail match not yet implemented"; after the recent commit, this is still reached for matches outside let-RHS chains. See §3.4.
  • Compiler/Lower.lean:161, 468 — non-trivial .let after simplify.
  • Compiler/Lower.lean:194, 207, 575.ref / .app / .gadget resolving to a non-callable layout.
  • Compiler/Lower.lean:502, 519, 586, 599 — non-.var subpatterns.
  • Compiler/Layout.lean:41, 93 — unresolved .mvar after concretize.
  • Compiler/Match.lean:117-118, 215popNamespace and spatternToPattern .pointer impossibilities.
  • Check.lean:595, 604 — constructor / datatype lookups already resolved by the type checker.

3. Staged Datatypes

The plan introduces a sequence of typed IRs. Each pass is a total function between adjacent stages, and the stages shrink monotonically: constructors disappear when the prior pass has eliminated them. This kills unreachable! arms by construction.

3.1 Pipeline of stages

                           ┌─────────────┐
   Source.Term             │ Source IR   │  Stage 1  (rename of current Term)
       ↓ check                           │
   Typed.Term              │ Typed IR    │  Stage 2  (rename of current TypedTerm)
       ↓ simplify                        │
   Simple.Term             │ Match-free  │  Stage 3  (NEW)
       ↓ concretize                      │
   Concrete.Term           │ Monomorphic │  Stage 4  (NEW)
       ↓ lower                           │
   Bytecode.Toplevel       │ Flat IR     │  Stage 5
       ↓ dedup                           │
   Bytecode.Toplevel       │ Bisimilar   │  Stage 6
                           └─────────────┘

All four *.Term types live under Ix/Aiur/<Stage>/Term.lean with matching namespaces: Aiur.Source.Term, Aiur.Typed.Term, Aiur.Simple.Term, Aiur.Concrete.Term. The rename of the current Ix/Aiur/Term.leanIx/Aiur/Source/Term.lean and Ix/Aiur/TypedTerm.leanIx/Aiur/Typed/Term.lean is part of M3.

3.2 Stage 1 — Source.Term

Today's Ix/Aiur/Term.lean:69 Term, renamed in M3. Full pattern language (.or, .pointer, nested). Types carry .mvar and .app with parameters. As §2.5 notes, this is post-elaboration; surface macros never appear here, and Source.Term.app carries no type arguments.

Refactor: flatten the Data mutual child into Source.Term.tuple / Source.Term.array constructors. Mechanical; cleaner induction principles.

3.3 Stage 2 — Typed.Term

Today's Ix/Aiur/TypedTerm.lean:48 TypedTerm, renamed in M3. Each subterm carries typ : Typ and escapes : Bool. Same pattern shapes as Stage 1; the type checker has inferred type arguments and stored them on app and ref.

Design choice. Keep fielded typing for v1 — a plain inductive annotated with types. An indexed alternative — inductive Typed.Term : Ctx → Typ → Type — would make Lower.expectIdx a refl and turn every type invariant into a type checker obligation rather than a lowering pass obligation. The win is huge for later proofs but the checker has to produce dependently-typed output. Defer to v2; revisit only if M5/M6 stall.

3.4 Stage 3 — Simple.Term (NEW)

Produced by the (rewritten) match compiler. Grammar:

Simple.Pattern ::= .wildcard
                 | .field  G
                 | .ref    Global (Array Local)
                 | .tuple  (Array Local)
                 | .array  (Array Local)

Simple.Term    ::= ... leaves as in Typed.Term ...
                 | .letVar  Local Simple.Term Simple.Term
                 | .letWild       Simple.Term Simple.Term
                 | .letLoad Local Local Simple.Term
                 | .match  Local (Array (Simple.Pattern × Simple.Term))
                                 (Option Simple.Term)

Type-level invariants:

  • No .or subpatterns (match compiler eliminated them).
  • No .pointer subpatterns (lowered to .letLoad).
  • No nested subpatterns — pattern children are bare Locals.
  • .let binds only a Local or is a wildcard.
  • .match scrutinee is a Local.
  • No non-exhaustive matches (after M0 fix 4).

Non-tail match scope. Lower.lean:136 still throws "Non-tail match not yet implemented" whenever .match appears in a value-producing position other than a .let RHS chain. The recent non-tail-match commit added support only via findNonTailMatch (Lower.lean:347-367), which walks chains of .let (.var _) / .let .wildcard and lifts an inner match into a Ctrl.matchContinue. Matches inside tuples, arrays, function arguments, etc. still throw.

For v1 of the plan, Simple.Term's .match admits the same restriction: it appears only in let-RHS positions. Generalizing this is an additive milestone, not in scope here.

Simplify consumes Typed.Term, not Source.Term. Today's checkAndSimplify (Compiler/Simple.lean:38-54) runs checkFunction twice — once on the raw Term (discarding the result, only used to surface errors at original source positions), then simplify on the untyped Term, then checkFunction again to produce the final TypedTerm. The double pass goes away when the match compiler is rewritten to consume Typed.Term and produce Simple.Term directly, threading types through the decision tree.

3.5 Stage 4 — Concrete.Term (NEW)

Produced by concretize. Same constructors as Simple.Term but over a Concrete.Typ with no .mvar and no parametric .app:

Concrete.Typ ::= .unit | .field
              | .tuple    (Array Concrete.Typ)
              | .array    Concrete.Typ Nat
              | .pointer  Concrete.Typ
              | .ref      MonoGlobal
              | .function (List Concrete.Typ) Concrete.Typ

Concrete.Function has no params. Every parametric decl has been specialised and renamed via concretizeName (Concretize.lean:43). The lowering pass toBytecode consumes only Concrete.TermtypSize (Layout.lean:23) no longer has an .mvar arm to reject.

3.6 Stage 5 — Bytecode.Toplevel

Already at Bytecode.lean:88. Two cleanups:

  1. Make partial defs on Block / Ctrl total via termination_by on structural subterm (Dedup.lean:30-92, Layout.lean:192, Lower.lean walks).
  2. FunctionLayout is non-semantic; the bytecode evaluator ignores it. Keep it on Function but the semantic relation is blind to it.

3.7 Stage 6 — Post-dedup Bytecode.Toplevel

Same datatype, post-deduplicate. The proof obligation is bisimulation preservation: for every original funIdx, the original toplevel at funIdx observes the same behavior as the deduplicated toplevel at remap funIdx.

4. Reference Evaluators

Both evaluators are new Lean code. Both are total, fuel-indexed (call-only accounting), error-returning, and cache-free — no callCache, no function_queries. Caching is an orthogonal optimization layer that the production Rust executor uses for multiplicity bookkeeping in the STARK trace.

4.1 Source reference evaluators — Source/Eval.lean, Concrete/Eval.lean

The proof-bearing source reference lives in new files, separate from the debug interpreter Interpret.lean (§2.2). There are two stages:

  • Ix/Aiur/Source/Eval.lean (M0) — reference evaluator on Source.Term. Used by M1's cross-test to validate against the Lean bytecode reference and Rust, before the stage refactor lands.
  • Ix/Aiur/Concrete/Eval.lean (M3) — reference evaluator on Concrete.Term. This is the form that participates in compile_preservation. Supersedes Source/Eval.lean in the proof chain once M3 is done.

Both are fresh implementations. They may mirror Interpret.lean's structure for diffability, but they do not share code — Interpret.lean persists unchanged in its debug role, and keeping the two evaluators independent avoids accidental entanglement of debug features with the spec.

Design common to both (M0 and M3 forms).

  • No cache. The cache is an optimization of the debug interpreter (and of Rust's function_queries for STARK multiplicity tracking), not part of the semantic model. The reference evaluators always re-execute a body.
  • No stack trace. Errors are Except.error SourceError with a small tag enum; no call-stack construction. Debugging is Interpret.lean's job.
  • Fuel-indexed, call-only accounting. A fuel : Nat parameter decrements only at applyGlobal. Intra-body recursion is structural on Term / Concrete.Term. Exhaustion returns Except.error .outOfFuel.
  • Total. Every recursive def carries an explicit termination measure (matchPattern structural on Pattern, interp / applyGlobal lexicographic on (fuel, term), flattenValue / unflattenValue structural on Value / Typ, typFlatSize / dataTypeFlatSize with a visited : HashSet Global argument mirroring Layout.DataType.size:49-57).
  • Errors return, never panic. ioSetInfo on existing key, ioRead OOB, pattern-match failure, type mismatch, .ret outside a function body — all produce Except.error.

M0 form (on Source.Term).

Signature:

Source.Eval.runFunction :
  Decls → Global → List Value → IOBuffer → Nat →
    Except SourceError (Value × IOBuffer)

Covers every Source.Term constructor. Does not solve the pointer representation divergence (§2.6 item 6): source pointers are still a single global counter, so a program that uses ptr_val across different flat widths would disagree with Rust. No current test triggers this.

M3 form (on Concrete.Term).

With types available at runtime, fix the pointer representation and align with Rust's bucketed memory:

  1. Value.pointer : Nat → Nat → Value carrying (width, index). flattenValue on .pointer w i = #[.ofNat i] (width is type-level, never in the flat encoding).
  2. Store : IndexMap Nat (IndexMap (Array G) Unit) — width buckets keyed over flat Array G. Matches Rust's memory_queries exactly.
  3. .store t computes the flat width of t.typ via typFlatSize, flattens t's value, inserts into the right bucket.
  4. .load ptr reads the width from the typed .load node, dispatches to the right bucket, unflattens via the element type.
  5. .ptrVal (if kept) returns the within-bucket index as a field element. Source and bytecode now agree on pointer numeric values. (See the discussion of whether to keep ptrVal at all in §10 open decisions.)

After M3, Concrete.Eval and Bytecode.Eval produce identical IOBuffers and identical flat results on every program.

Signature:

Concrete.Eval.runFunction :
  Concrete.Decls → Global → List Value → IOBuffer → Nat →
    Except SourceError (Value × IOBuffer)

4.2 Bytecode reference evaluator — NEW, Ix/Aiur/Bytecode/Eval.lean

Mirrors src/aiur/execute.rs in big-step style: mutually recursive evalBlock / evalCtrl / evalOp, total, fuel-indexed.

structure EvalState where
  map      : Array G
  memory   : IndexMap Nat (IndexMap (Array G) Unit)
  ioBuffer : IOBuffer

Bytecode.Eval.evalBlock :
  Bytecode.Toplevel → Block → EvalState → Nat →
    Except BytecodeError (Array G × EvalState)

memory has one bucket per width in Toplevel.memorySizes (Bytecode.lean:90), mirroring Rust's QueryRecord.memory_queries (execute.rs:36-40). Each Op.store values uses values.size as the width key.

Differences from execute.rs:

  • No QueryRecord. Drop multiplicities, byte queries, per-function call queries — trace-side bookkeeping.
  • No call cache. §4.1 explanation applies symmetrically.
  • No unconstrained branching. Both branches of every if unconstrained produce the same value; they differ only in whether a query is logged.
  • No stack machine. Direct big-step. Easier to reason about than small-step, and the two coincide here because the language is deterministic.
  • Call-only fuel. Decrements only on Op.call.
  • Errors return, never panic. All failures Except.error.
Bytecode.Eval.runFunction :
  Bytecode.Toplevel → FunIdx → Array G → IOBuffer → Nat →
    Except BytecodeError (Array G × IOBuffer)

4.3 The equivalence relation

Ix/Aiur/Semantics/Relation.lean (NEW):

inductive ValueEq (decls : Decls) (funcIdx : Global → Option Nat) :
    Value → Typ → Array G → Prop
  | unit    : ValueEq .. .unit .unit #[]
  | field   : ValueEq .. (.field g) .field #[g]
  | pointer : ValueEq .. (.pointer w i) (.pointer _) #[.ofNat i]
  | fn      : ValueEq .. (.fn g) _ #[.ofNat (funcIdx g).getD 0]
  | tuple   : ... -- concat of element encodings
  | array   : ...
  | ctor    : ... -- tag + args + padding, matching dataTypeFlatSize

def InterpResultEq decls funcIdx
    (src : Except SourceError (Value × IOBuffer))
    (bc  : Except BytecodeError (Array G × IOBuffer)) (retTyp : Typ) : Prop :=
  match src, bc with
  | .ok (v, io₁),  .ok (gs, io₂)  => ValueEq decls funcIdx v retTyp gs ∧ io₁ = io₂
  | .error _,      .error _       => True
  | _,             _              => False

ValueEq is a propositional mirror of flattenValue. The equivalence proof for Lower (M6) goes by induction on Concrete.Term and maintains the local invariant that the bytecode map slice at the compiled indices contains the flattened form of every live binding.

5. The Simulation Invariant

5.1 StructCompatible

The simulation between source and bytecode must thread through every sub-call. We split this into a structural part — pure facts about the compilation output, provable as a standalone lemma — and a semantic part, established by simultaneous induction with compile_preservation.

-- Ix/Aiur/Semantics/Compatible.lean (NEW)
structure StructCompatible
    (decls   : Decls)
    (bc      : Bytecode.Toplevel)
    (nameMap : Global → Option Bytecode.FunIdx) : Prop where
  total_on_reachable :
    ∀ name f, decls.getByKey name = some (.function f) →
    reachableFromEntries decls name →
    ∃ fi, nameMap name = some fi
  funIdx_valid :
    ∀ name fi, nameMap name = some fi → fi < bc.functions.size
  input_layout_matches :
    ∀ name fi f, nameMap name = some fi →
    decls.getByKey name = some (.function f) →
    (f.inputs.map Prod.snd).foldl
      (fun acc t => acc + typFlatSize decls t) 0 =
    bc.functions[fi]!.layout.inputSize
  call_graph_closed :
    ∀ fi, fi < bc.functions.size →
    ∀ callee, callee ∈ collectAllCallees bc.functions[fi]!.body →
    callee < bc.functions.size

theorem compile_ok_implies_struct_compatible :
    t.compile = .ok ct →
    StructCompatible t.decls ct.bytecode ct.getFuncIdx

This lemma is proved standalone, by induction on the compilation passes.

5.2 The full simulation pattern

The semantic clause — for every compiled function pair, source and bytecode bodies produce matching observations — references the theorem we're proving. So it cannot be proved standalone; it is proved simultaneously with compile_preservation by induction on fuel.

compile_preservation (see the full signature in §1.2) goes by strong induction on fuel. At the outer call site, we use StructCompatible (§5.1) to look up funIdx, equate flat input sizes, and bound the callee index. At every inner Op.call, the induction hypothesis at fuel - 1 discharges the recursive equivalence.

This is CompCert's match_states / fsim_match_states pattern. The adaptation uses big-step semantics with fuel as the well-founded measure, but the structure is identical: StructCompatible plays the global part of match_states, and the per-state "register file matches flattened env" condition (§4.3) plays the local part.

5.3 Fuel discipline

Claim: the fuel transformer φ at every pass is id. Source and bytecode share a single fuel budget 1-for-1. Why:

  1. Call-only accounting. Both evaluators decrement fuel only at function-call boundaries. Intra-body recursion is structural.
  2. Calls are preserved 1-to-1 by every pass.
    • simplify touches the pattern language; never adds or removes calls.
    • concretize is name substitution: every polymorphic call becomes exactly one monomorphic call.
    • lower (Compiler/Lower.lean:177-207) compiles every .app to exactly one Op.call.
    • dedup merges bisimilar bodies but leaves call sites intact.
    • needsCircuit toggles a flag the evaluator ignores.

For every source evaluation that makes k calls before terminating, the compiled bytecode makes exactly k calls. The two terminate at the same fuel bound.

If a future pass blows up calls (e.g., inlining), introduce a distinct φ only for that pass — do not weaken the global invariant.

6. The Proof

6.1 Pass-by-pass obligations

For each compiler pass p : StageN → StageN+1, prove two companion lemmas:

-- preservation
theorem pass_preserves {src : StageN} {tgt : StageN+1}
    (h : p src = .ok tgt) :
    ∀ env args io fuel,
      evalN src env args io fuel ≈ evalN+1 tgt env args io fuel

-- progress
theorem pass_progress {src : StageN} (hwf : Stage_N_WellFormed src) :
    ∃ tgt, p src = .ok tgt

The two compose by transitivity of InterpResultEq for preservation and by bind chaining on Except for progress. Each per-pass progress lemma is typically ≤ 30% of the corresponding preservation proof: the case analysis is identical, only the conclusion changes.

6.2 Proof order — easy to hard

  1. needsCircuit semantic irrelevance. Sets a constrained flag the bytecode evaluator ignores. One-liner, absorbed into M9 as a trivial rfl-level step — no milestone of its own.
  2. deduplicate preservation. Bytecode-only pass; bisimulation over Block up to call-index renaming, with cycles handled by well-founded induction on fuel. (M4)
  3. toBytecode (Lower). The largest single proof. Source is Concrete.Term; target is a flat Block with a map : Array G register file. Local invariant: the bytecode map slice at the compiled indices contains the flattened form of every live binding. (M5 straight-line, M6 full)
  4. concretize. Values are type-erased; substitution changes types but not value denotation. Short proof. (M7)
  5. simplify / match compiler. Adapt Maranget's correctness proof; CompCert Cminorgenproof and CakeML pat_to_dec are the reference adaptations. Hardest single proof. (M8)
  6. Checker soundness. The type annotations on Typed.Term must agree with what the evaluator actually produces — otherwise Lower cannot trust typSize during layout. This is M9.5; full statement in §1.3 and §8. With intrinsic typing (v2 of §3.3) the milestone vanishes because well-typedness becomes structural.

The order and dependencies above are realized as a concrete milestone schedule in §8. How the proof obligations sequence into parallel work — and where each one sits on the critical path — is captured by the DAG in §8.

7. Refactoring Work (Pre-Proof)

7.1 File layout

Ix/Aiur/
  Source/Term.lean                (Stage 1; renamed from Term.lean; flatten Data layer)
  Source/Eval.lean                NEW (M0; reference evaluator on Source.Term)
  Typed/Term.lean                 (Stage 2; renamed from TypedTerm.lean; flatten TypedData layer)
  Simple/Term.lean                NEW (Stage 3; typed)
  Concrete/Term.lean              NEW (Stage 4 + Concrete.Typ)
  Concrete/Eval.lean              NEW (M3; reference evaluator on Concrete.Term; proof-bearing)
  Bytecode.lean                   (Stage 5; unchanged)
  Bytecode/Eval.lean              NEW (M1; cache-free Lean reference)
  Bytecode/ExecuteFfi.lean        NEW (opaque Rust FFI; split from Protocol.lean)
  Interpret.lean                  UNCHANGED in role (debug interpreter with cache +
                                  stack trace); M0 applies io bug fixes for Rust parity
  Semantics/
    Relation.lean                 NEW (ValueEq, InterpResultEq)
    Compatible.lean               NEW (StructCompatible)
    Flatten.lean                  NEW (total flattenValue / unflattenValue)
  Compiler/
    Check.lean                    unchanged API; total internals
    Match.lean                    consumes Typed.Term, produces Simple.Term
    Simple.lean                   Typed.Term → Simple.Term bridge
    Concretize.lean               Simple.Term → Concrete.Term
    Lower.lean                    Concrete.Term → Bytecode.Toplevel
    Dedup.lean                    unchanged; total
    Layout.lean                   gadget arm removed
  Proofs/
    ValueEqFlatten.lean
    StructCompatible.lean         compile_ok_implies_struct_compatible
    WellFormed.lean
    DedupSound.lean               M4
    LowerSoundCore.lean           M5
    LowerSoundControl.lean        M6
    ConcretizeSound.lean          M7
    SimplifySound.lean            M8
    CompilerPreservation.lean     M9
    CheckSound.lean               M9.5 (checker soundness; required)
    CompilerProgress.lean         M10
    CompilerCorrect.lean          M10 (final theorem)

7.2 Killing partial

Grep finds 92 partial defs across Ix/Aiur/**. They fall into three termination categories, each handled differently in M2:

  1. Structural recursion on the term / block / type tree. The bulk (80+). Become def ... termination_by on the obvious structural measure. The pain point is Lean's derived sizeOf not seeing through nested Array.map / foldl; expect ~1 AI-day on a small Ix/Array/Induction.lean helper library before the rest can be discharged.

  2. Cross-function recursion in the reference evaluators. Today's pattern Interpret.lean:251 applyGlobal → :279 interp → applyGlobal is genuinely unbounded. The reference evaluators (Source/Eval.lean, Concrete/Eval.lean) mirror this structure and handle it with an explicit fuel parameter that strictly decreases on every applyGlobal, plus a termination proof on (fuel, term) lexicographic.

  3. Genuinely non-trivial termination. Two cases:

    • Concretize.lean:361 TypedDecls.concretize — worklist over newly discovered instantiations, no finiteness guarantee today. The M0 static check rejects polymorphic recursion; termination of the worklist then follows by structural induction on the type-argument DAG.
    • matchPattern through .pointer recurses into the store, but is bounded by pattern depth (the recursion sees a structurally smaller pattern at every step). Already def (total) in Interpret.lean:138; the reference evaluators reuse the same approach.

Disposition by file:

File Partials Fate
Check.lean 40 Total in M2 (category 1).
Compiler/** 39 Total in M2 (category 1, plus Concretize in category 3).
Compiler.lean 1 Total in M2.
Interpret.lean utilities (flattenValue, unflattenValue, typFlatSize, dataTypeFlatSize) 4 Move to Semantics/Flatten.lean in M2 and made total there (category 1, with a visited set for dataTypeFlatSize).
Interpret.lean core (interp, applyGlobal, applyLocal, ppValue, natToHexGo) 5 Stay partial. Debug tool, may loop.
Meta.lean 3 Stay partial. Parser, not proof-load-bearing.

After M2, the proof stack is partial-free. The 8 remaining partials (5 in Interpret.lean, 3 in Meta.lean) participate in no proof.

7.3 Incidental refactors

These land alongside their driving milestones but are worth calling out so reviewers know they're planned:

  • Flatten the Data / TypedData mutual children into direct Source.Term.tuple / Source.Term.array (and Typed.Term equivalents). Cleaner induction principles. Lands with the M3 rename.
  • Split Ix/Aiur/Protocol.lean into Protocol.lean (keeps AiurSystem, Proof, FRI parameters, buildClaim) and a new Ix/Aiur/Bytecode/ExecuteFfi.lean (keeps the opaque Rust executor). Makes room for Bytecode/Eval.lean (M1) to stand alone.
  • Delete the double checkFunction pass. Today checkAndSimplify (Compiler/Simple.lean:38-54) runs the checker twice because simplify operates on untyped Term. Once Match.lean consumes Typed.Term (M3), the second check disappears.

All three are folded into the milestone that enables them; there is no separate "refactor milestone".

8. Milestones

Time unit: AI-day. One AI-day is a focused interactive day driving a Lean proof session, including edit/elaborate/iterate cycles. It is not equivalent to an engineer-day: compile latency and proof iteration depth dominate, and much of the "thinking" time is compressed or amortized across parallel attempts.

Dependencies form the DAG in §8.1 below. Headline numbers:

Metric Value
Sequential sum (all milestones back-to-back) ~26–42 AI-days
Critical path (M0→M1→M2→M3→M5→M6→M9→M9.5→M10) ~20–32 AI-days
Wall clock with parallelism (M4/M7/M8 alongside) ~5–7 weeks
Risk-adjusted (M3, M6, M8, or M9.5 stalling) ~8 weeks

M3, M6, M8, and M9.5 are the four milestones most likely to slip. M9.5 is on the critical path and mandatory — skipping it would leave a hole in the simulation proof at M6 (Lower consults typSize on type annotations that nothing guarantees are accurate).

8.1 Dependency graph

                   ┌─→ M4 ───────────────────┐
                   │                         │
M0 ─→ M1 ─→ M2 ────┤                         │
                   │                         │
                   └─→ M3 ─┬─→ M5 ─→ M6 ─────┼─→ M9 ─→ M9.5 ─→ M10
                           ├─→ M7 ───────────┤
                           └─→ M8 ───────────┘
  • M4 (dedup) needs Bytecode.Eval (M1) and totality (M2), but not the staged datatypes. It starts as soon as M2 is done, running in parallel with M3.
  • M5/M6, M7, M8 all need M3 (staged datatypes, Concrete.Eval). They run concurrently after M3.
  • M9 composes preservation from M4 + M6 + M7 + M8 using StructCompatible (the standalone invariant from §5.1).
  • M9.5 proves checkAndSimplify sound — every type annotation in its output is semantically accurate. Required by M6 (Lower consults typSize on annotations) and by M10 (WellFormed decomposition).
  • M10 adds progress across all passes and yields compile_correct.

Each milestone below ships independently and keeps the tree green.

M0 — Semantic-parity fixes and source reference evaluator — ~1–2 AI-days

Two concerns land together in M0:

Part A: fix the latent bugs in §2.6. Each fix lives in whichever file originally contained the bug, and each lands with the test that exercises it — none of these bugs is observable on the existing corpus, so without new tests the PRs would be invisible to CI.

File Fix New test
Ix/Aiur/Interpret.lean:530 ioSetInfo errors on existing key Set [0] then set [0] again.
Ix/Aiur/Interpret.lean:533-537 ioRead errors on OOB Read len=4 past data.size.
Ix/Aiur/Compiler/Simple.lean:22 + Compiler/Match.lean Non-exhaustive match becomes a CheckError instead of bubbling to unreachable! Program with non-exhaustive match on G.
Ix/Aiur/Compiler/Concretize.lean:361 Static monomorphization-termination check (rejects polymorphic recursion) Polymorphic-recursive template.
Ix/Aiur/Compiler/Lower.lean:31-40, 575 + :45-73 Remove dead Layout.gadget / GadgetLayout / gadgetIdx None new — code deletion.

The two Interpret.lean fixes keep Interpret.lean aligned with Rust on observable behavior. The callCache and stack trace are not touched — they are part of the debug interpreter's contract. The other three fixes are in the compiler pipeline, not in the interpreter.

Part B: create Ix/Aiur/Source/Eval.lean. Fresh cache-free, stack-trace-free, fuel-indexed reference evaluator on Source.Term. Structure may mirror Interpret.lean for diffability but the two files share no code. §4.1 lists the design.

partial still permitted in this file through M0; totality comes in M2 (where Source/Eval.lean is one of the files made total, Interpret.lean is not).

Exit: every Part A fix has a test that fails on main and passes on its branch. Ix/Aiur/Source/Eval.lean compiles and agrees with Interpret.lean on the full corpus (on value and IOBuffer) via a new cross-test added to Tests/Aiur/Common.lean. lake test green.

M1 — Lean-native bytecode reference — ~1–2 AI-days

Ix/Aiur/Bytecode/Eval.lean: total, fuel-indexed, error-returning, big-step, cache-free. Per-width memory buckets keyed from Toplevel.memorySizes. partial still allowed; totality in M2.

Extends the cross-test. Tests/Aiur/Common.lean:runTestCase already validates Interpret.lean against Rust (on value + IOBuffer). M0 added a check that Source.Eval agrees with Interpret.lean. M1 adds the fourth node: Bytecode.Eval against the other three.

The full cross-test at M1:

           Rust.executeFfi
          ╱
Interpret.lean ─── (existing)
          ╲
           Source.Eval ─── (M0 test)
              ╲
               Bytecode.Eval ─── (M1 test)

Every test asserts that the four agree on value and IOBuffer. Any divergence is a ticket.

Expected divergence at M1: a program that calls an IO-effecting function twice with the same args would show Interpret.lean / Rust.executeFfi (cached, one write) disagreeing with Source.Eval / Bytecode.Eval (cache-free, two writes). No current test exercises this; if one does, it becomes a known divergence rather than a bug.

Exit: four-way agreement on the M0-extended corpus, or documented divergences with tickets.

M2 — Totality — ~3–5 AI-days

Make every proof-load-bearing partial def total. Scope:

  • Ix/Aiur/Check.lean (40 partials) — the largest mechanical lift.
  • Ix/Aiur/Compiler/** (39 partials across Concretize, Dedup, Lower, Layout, Match, Simple).
  • Ix/Aiur/Source/Eval.lean (the reference evaluator created in M0).
  • Ix/Aiur/Bytecode/Eval.lean (the reference evaluator created in M1).
  • Move the flattening utilities out of Interpret.lean into a new Ix/Aiur/Semantics/Flatten.lean: flattenValue, unflattenValue, typFlatSize, dataTypeFlatSize. These are used by the reference evaluators and by the theorem statement, so they must be total. Add a visited : HashSet Global parameter to dataTypeFlatSize / typFlatSize mirroring Layout.DataType.size:49-57. Update Tests/Aiur/Common.lean and Interpret.lean to import from the new location.

Fuel where genuinely needed (§7.2 category 2). Check.lean's 40 partials through nested arrays need real iteration; budget ~1 AI-day on Ix/Array/Induction.lean helpers before the bulk can be discharged.

Out of scope: Ix/Aiur/Interpret.lean itself. After M2 it still contains partial def interp, partial def applyGlobal, partial def applyLocal, partial def ppValue, partial def natToHexGo (5 partials; matchPattern is already def). The core interpreter loop can loop on divergent programs — that is a feature of the debug tool, not a bug. The pretty printers are not semantic. None of these participate in any proof.

Exit: grep -n "^partial def" Ix/Aiur/ returns only:

  • Interpret.lean: interp, applyGlobal, applyLocal, ppValue, natToHexGo — 5 deliberate partials (debug interpreter core + pretty printers; matchPattern is already def).
  • Meta.lean: 3 partials (parser, not proof-load-bearing).

Every other file in Ix/Aiur/** is partial-free.

M3 — Stage datatypes and typed source reference — ~4–6 AI-days

The largest refactor milestone. Two coupled changes, each with sub-tasks:

Part A: staged datatypes.

  • Rename Ix/Aiur/Term.leanSource/Term.lean and Ix/Aiur/TypedTerm.leanTyped/Term.lean; update every import.
  • Introduce new files Simple/Term.lean (§3.4) and Concrete/Term.lean + Concrete/Typ (§3.5).
  • Rewrite Compiler/Match.lean to consume Typed.Term and produce Simple.Term directly (pure decision-tree compiler, see §9 risk 3).
  • Rewrite Compiler/Concretize.lean from TypedDecls → TypedDecls to Simple.Term → Concrete.Term.
  • Rewrite Compiler/Lower.lean (toIndex, TypedTerm.compile) to consume Concrete.Term.
  • Delete the double checkFunction pass in Compiler/Simple.lean:38-54.
  • Delete the 20+ unreachable! / Impossible / Should not happen sites in Compiler/** now ruled out by types (§2.7).
  • Flatten Data / TypedData mutual children into direct constructors (§7.3).
  • Split Ix/Aiur/Protocol.lean into Protocol.lean + Bytecode/ExecuteFfi.lean (§7.3).

Part B: typed reference evaluator.

  • Create Ix/Aiur/Concrete/Eval.lean, a fresh evaluator on Concrete.Term (§4.1 M3 form). Width-bucketed Store, Value.pointer (w, i), typed .store / .load / .ptrVal.
  • Add the first ptr_val-using test to lock the new pointer semantics (skip if ptr_val is removed per §10 decision 3).

Concrete.Eval.lean supersedes Source/Eval.lean in the proof chain; Source/Eval.lean stays around only as the M1 cross-test witness.

This is the milestone most likely to slip. Total code churn is on the order of 2,000–4,000 lines, spread across ~15 files. The budget assumes that M2 totality work has already smoothed out Check.lean and Compiler/** — if any residual partial def blocks a type refactor, expect extra iteration.

Exit: Toplevel.compile type-checks through the new stages; existing tests pass; grep -rn "unreachable!\|Impossible\|Should not happen" Ix/Aiur/Compiler/ returns ≤ a handful with documented reasons; Concrete.Eval and Bytecode.Eval agree pointer-for-pointer on every program in the cross-test.

M4 — Dedup soundness — ~1–2 AI-days — parallel with M5/M7

Preservation: Bytecode.Eval.run t idx args io fuel ≈ Bytecode.Eval.run t.dedup.1 (t.dedup.2 idx) args io fuel. Progress: deduplicate is total (trivial). Bisimulation up to call-index renaming, cycles handled by well-founded induction on fuel.

Exit: Proofs/DedupSound.lean compiles.

M5 — Lower proof, straight-line subset — ~2–3 AI-days

Preservation + progress for Lower.toIndex on the fragment of Concrete.Term excluding .match, function calls, .store / .load, IO, and u8/u32 ops. Establishes the local simulation relation: bytecode map slice at compiled indices = flattened bound value.

Exit: lemma covers every non-control constructor.

M6 — Lower proof, full — ~3–5 AI-days — parallel with M8

Extend M5 to:

  • .match in tail position.
  • Non-tail .match via Ctrl.matchContinue, only for the let-RHS form matching findNonTailMatch. Matches in other positions are excluded by the Concrete.Term shape established in M3.
  • .ret / Ctrl.return / Ctrl.yield.
  • Function calls (Op.call); the unconstrained flag is semantically equivalent in both branches (§4.2).
  • .store / .load with width-bucketed memory (M3 payoff).
  • IO operations — the IOBuffer clause of the main theorem is discharged here.
  • u8/u32 ops — bv_decide or a library of bitvector lemmas.

Exit: Lower.compile_preservation and Lower.compile_progress for every Concrete.Term.

M7 — Concretize soundness — ~1 AI-day — parallel with M5/M6

Preservation: substInTypedTerm subst body has the same value denotation as body under the original type environment (type erasure). Progress: the M0 termination check bounds the worklist.

Exit: Proofs/ConcretizeSound.lean compiles.

M8 — Match compiler soundness — ~3–5 AI-days — parallel with M6

Adapt Maranget's correctness proof for decision-tree pattern compilation. For every Simple.Term produced from a Typed.Term by the rewritten simplifyTerm, the decision tree matches exactly when the original nested pattern would have matched, and binds the same locals to the same values.

References: CompCert Cminorgenproof, CakeML pat_to_dec, Maranget
2008. Largest single proof; iteration depth is the bottleneck.

Exit: Proofs/SimplifySound.lean compiles.

M9 — Top-level preservation — ~1 AI-day

Toplevel.compile_preservation from M4 + M6 + M7 + M8 by transitivity of InterpResultEq, threading fuel / StructCompatible / typing assumptions through the composition.

Exit: Proofs/CompilerPreservation.lean compiles.

M9.5 — Checker soundness — ~3–5 AI-days

Prove that checkAndSimplify produces semantically coherent annotations:

theorem checkAndSimplify_sound
    {t : Toplevel} {decls : Decls} (hdecls : t.mkDecls = .ok decls)
    {typedDecls : TypedDecls} (h : t.checkAndSimplify = .ok typedDecls) :
    ∀ {name f}, typedDecls.get? name = some (.function f) →
    ∀ env, WellFormedEnv env →
    let ⟨v, _⟩ := Source.Eval.runTypedBody decls f env
    ValueShapeMatches v f.output ∧
    (flattenValue decls funcIdx v).size = typSize decls f.output

This is the property Lower.preservation (M6) needs when it consults typSize on TypedTerm.typ. Without it, the Lower proof has no grounds to claim its layout computation matches the runtime value.

The proof is by induction on the Check.lean inferTerm structure (40 mutual arms) plus the simplify pass (which preserves annotations through the match compiler). Heavy use of the M2 totality proofs for the recursion structure; each arm has a bespoke soundness argument.

Required, not optional. Without M9.5, M6 has a hole: "type annotations are accurate" is an unjustified assumption, and the claim that Lower preserves semantics cannot be discharged. The full-stack theorem is a chain that breaks at this link.

If intrinsic typing is adopted (risk 1 / §3.3), M9.5 vanishes: the Typed.Term : Ctx → Typ → Type datatype makes coherence structural. This is the one path that avoids the work.

Exit: Proofs/CheckSound.lean compiles. M6 and M7 can cite checkAndSimplify_sound in their preservation proofs.

M10 — Progress and compile_correct~1–2 AI-days

  1. Define WellFormed (§1.3) in Proofs/WellFormed.lean. Since every conjunct is a computable .ok observation, no ghost predicates appear.
  2. Per-pass progress lemmas (most are byproducts of preservation). The mkDecls and checkAndSimplify lemmas are trivial: their success is already hypothesized by WellFormed.
  3. Compose into compile_progress.
  4. Combine with M9 (preservation) and M9.5 (checker soundness) into compile_correct in Proofs/CompilerCorrect.lean.

Exit: the §1.1 theorem compiles, with no sorry and no trusted components beyond Lean's kernel and the Rust FFI boundary.

9. Risks

  1. Indexed Stage 2 fallback. If M5/M6 stall on the simulation relation, intrinsic typing (inductive Typed.Term : Ctx → Typ → Type) becomes attractive. Budget ~3 AI-days to reboot the checker — the largest single contingency.
  2. Monomorphization-termination check rejects real programs. Run the M0 check against the existing corpus before merging. Fall back to a bounded-instance-count policy if anything legitimate breaks.
  3. runMatchCompiler is stateful (id supply, diagnostics). Pure versions are easier to reason about. Pure-ify in M3 when rewriting it to produce Simple.Term directly.
  4. Array induction ergonomics. Lean 4's derived sizeOf is finicky around nested Arrays. Budget ~1 AI-day in M2 for an Ix.Array.Induction helper library — already included in the M2 estimate but worth calling out.
  5. u8/u32 byte-level reasoning. The u8/u32 ops in Interpret.lean:427-505 and the Rust gadgets use native bit operations. The Lower proof at M6 needs either bv_decide or a hand-built library of byte-level lemmas. Budget within M6.
  6. M9.5 (checker soundness) is open-ended. 3–5 AI-days is a point estimate; the real cost depends on whether the fielded TypedTerm supports a clean soundness proof or whether the checker has to be partially rewritten. If the latter, the Stage-2 intrinsic-typing rewrite (risk 1) may end up being cheaper overall — both obviate M9.5 and fix the same root cause. Decide after spiking the first few inferTerm arms.

10. Open Decisions

Sign-off needed before M0:

  1. Indexed Typed.Term — now or v2. Recommend v2 (§3.3). If adopted now, M9.5 vanishes entirely (intrinsic well-typedness), at the cost of a dependently-typed checker rewrite. The plan currently assumes v1 (fielded typing) + M9.5.
  2. Monomorphization-termination policy — static reject vs bounded count. Recommend static reject (§2.6 item 5).
  3. ptrVal — keep or remove? Zero call sites today. Keeping it costs ~25 lines of proof work across the pipeline; removing it costs a handful of surface-language deletions. Recommend remove in M0 cleanup; §4.1 M3 step and §2.6 item 6 become no-ops.

All other design choices are already folded into the plan:

  • Fuel-indexed evaluation, call-only accounting. §1.6, §4.1, §4.2, §5.3.
  • callCache absent from the Lean reference evaluators, kept in Interpret.lean and production Rust. §2.2, §4.1.
  • Non-exhaustive match is a CheckError. M0 bug fix 4.
  • .debug is transparent in the semantic relation. §1.7.
  • Constrained-call-graph acyclicity is a WellFormed clause, not a checker obligation. §1.3.
  • Non-tail match in Simple.Term is restricted to let-RHS chains (matches the current compiler). §3.4.
  • Store bucketing is deferred to M3 (requires runtime types). §2.6 item 6, §4.1 M3 step.

11. Appendix — Concrete Numbers

Verified by grep over Ix/Aiur/**:

  • 92 total partial defs — per-file breakdown and disposition in §7.2.
  • 26 unreachable! / throw "Impossible ..." / "Should not happen ..." sites — 24 in Compiler/**, 2 in Check.lean; Lower.lean alone has 20. Enumerated in §2.7.
  • Source LOC to touch: ~3,500 (pipeline) + ~2,800 (tests to extend).
  • New LOC estimate: ~1,200 (datatypes + two reference evaluators + StructCompatible + WellFormed) + ~5,000–7,000 (proofs).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions