Skip to content

Parallel eval test runner and test migration#9285

Closed
lukewilliamboswell wants to merge 30 commits intomainfrom
eval-test-runner
Closed

Parallel eval test runner and test migration#9285
lukewilliamboswell wants to merge 30 commits intomainfrom
eval-test-runner

Conversation

@lukewilliamboswell
Copy link
Copy Markdown
Collaborator

@lukewilliamboswell lukewilliamboswell commented Mar 23, 2026

Summary

src/eval/test/parallel_runner.zig is a standalone binary that runs eval tests across multiple threads using a work-stealing job queue. Each test runs the interpreter, dev backend, and wasm backend, then compares all results via Str.inspect string comparison. Crash protection (setjmp/longjmp + signal handlers) allows recovery from segfaults.

It includes coverage support using the --coverage flag, and a separate coverage-eval build step.

🤖 Generated with Claude Code

lukewilliamboswell and others added 19 commits March 23, 2026 15:14
Replace the sequential Zig built-in test runner for eval tests with a
standalone parallel binary. Worker threads pull tests from a shared
atomic index, each loading its own builtins to avoid shared mutable
state. Crash protection uses threadlocal setjmp/longjmp + signal
handlers (following the snapshot tool pattern) so segfaults are
recorded and the runner continues.

`zig build test-eval` now builds and runs the new runner.
Supports --filter, --threads, and --verbose via run args.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nd compare results

The parallel eval test runner was only exercising the interpreter. Now each
test also runs the dev, wasm, and llvm backends via Str.inspect, then
compares all outputs to catch cross-backend mismatches.

Key changes:
- compareAllBackends() runs dev/wasm/llvm via helpers.devEvaluatorStr,
  wasmEvaluatorStr, llvmEvaluatorStr and checks agreement
- Restore eval module to zig build test (was accidentally removed)
- Wire test-eval-parallel into zig build test
- Export devEvaluatorStr/wasmEvaluatorStr/llvmEvaluatorStr as pub in helpers.zig
- Fix runTestProblem UB (was passing undefined to cleanup), fix SA.NODEFER
  portability, remove unused ThreadBuiltins, implement dev_only_str

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…l tests

The old eval module tests are temporarily removed from `zig build test`
while tests are ported to the new parallel runner format. The parallel
runner (test-eval) is wired into `zig build test` as the replacement.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… step

The eval runner was hanging under kcov because the dev backend uses fork()
for crash isolation, and kcov can't trace forked children properly.

- Add --coverage CLI flag: disables fork and forces single-threaded
- Add force_no_fork flag to helpers.zig devEvaluatorStr
- Move eval coverage out of `zig build coverage` into standalone
  `zig build coverage-eval` step that passes --coverage to the runner

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
kcov instrumentation skews timing measurements, so suppress the
aggregate stats table and slowest-tests ranking when --coverage is
active. Per-test breakdowns still show in --verbose for debugging.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The coverage-eval step was pulling in the full parser coverage pipeline
via a transitive dependency on mkdir_step. Fixed by giving eval its own
codesign step.

Also made CoverageSummaryStep generic: label and min_coverage are now
configurable so eval coverage prints "EVAL CODE COVERAGE SUMMARY" and
uses its own threshold (0% while tests are being ported).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…unner

- Add `skip` field to TestCase with flags for interpreter/dev/wasm/llvm,
  allowing individual backends to be disabled per test. Any test with a
  skip reports as SKIP rather than PASS to keep partial coverage visible.
- Add per-phase monotonic timing (std.time.Timer) for parse, canonicalize,
  typecheck, interpreter, dev, wasm, and llvm phases with statistical
  summary (min/max/mean/median/stddev/P95) and slowest-5 breakdown.
- Add --help/-h with documentation of all options, timing instrumentation,
  and backend coverage philosophy.
- Update MIGRATE_EVAL_TEST_PROMPT.md with skip field usage examples.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix use-after-free: arena-allocated failure messages are now duped to
  the GPA so they survive arena resets between test iterations.

- Fix signal handler: remove SA.NODEFER to prevent re-entrant signals
  during longjmp. After recovery, explicitly unblock SEGV/BUS/ILL via
  sigprocmask so future crashes are still caught.

- Reduce duplication: consolidate six runTest* functions into a single
  runNormalTest with a switch on Expected variant. Extract runBackend
  helper for compareAllBackends. Rewrite runTestProblem to reuse
  parseAndCanonicalizeExpr.

- Strict layout checks: remove silent fallbacks in value assertions
  (e.g., i64_val no longer silently handles Dec layout). Each Expected
  variant now validates the exact layout type before reading the value.

- Remove redundant int_dec variant (i64_val already covers integers,
  dec_val covers Dec values).

- Fix i64_val type: i128 -> i64 to match the name.

- Fix test data: untyped number literals default to Dec in Roc, so
  tests now use dec_val instead of i64_val.

- Consistent Timer.start() error handling: use catch unreachable
  everywhere.

- Document LLVM evaluator bitrot in LLVM_EVAL_ISSUE.md (MonoLlvmCodeGen
  and lirExprResultLayout reference removed APIs). Fix monomorphization
  step in llvm_evaluator.zig.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… runner

- Replace ParTestEnv with shared TestEnv (fixes alignment-unsafe realloc
  that used Allocator.realloc instead of rawAlloc+memcpy, and removes
  80 lines of duplicated host ops code)
- Remove numericStringsEqual/boolStringsEquivalent — all backends use
  Str.inspect so direct byte comparison is correct
- Fix compareBackendResults OOM path: return static error string instead
  of null (which silently swallowed real mismatches)
- Remove int_dec variant from migration guide (not implemented)
- Remove hardcoded MAX_THREADS=64, dynamically allocate thread array
  capped by CPU count
- Document signal handler setjmp/longjmp UB as TODO
- Document wasm evaluator thread safety (per-call instances + threadlocal)
- Improve --help to explain the -- separator requirement
- Delete LLVM_EVAL_ISSUE.md (belongs in a GitHub issue, not repo root)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move all tests using supported Expected variants (i64_val, dec_val,
bool_val, str_val, f32_val, f64_val, err_val, problem,
type_mismatch_crash, dev_only_str) from eval_test.zig into the
data-driven eval_tests.zig table consumed by `zig build test-eval`.

Key decision: unsuffixed numeric literals in Roc default to Dec, not
I64. The old runExpectI64 silently converted Dec→int, masking the
actual type. Migrated tests now use .dec_val for unsuffixed literals
and .i64_val only for suffixed integer types (e.g. 42.I64, 255.U8),
making the expected types accurate.

62 test blocks remain in eval_test.zig using helpers that have no
parallel runner variant yet (runExpectRecord, runExpectTuple,
runExpectListI64, runExpectListZst, runExpectEmptyListI64,
runExpectIntDec, runExpectSuccess) plus custom infrastructure tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All 53 closure tests use unsuffixed numeric literals, so numeric results
use .dec_val. String results use .str_val. The old file is deleted and
its refAllDecls removed from mod.zig.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add per-type Expected variants (u8_val, u16_val, u32_val, u64_val,
u128_val, i8_val, i16_val, i32_val, i128_val) to the parallel runner
so type-annotated expressions use the correct storage type. All integer
variants share the same handler pattern via intExpected() helper.

Covers all 10 integer types (U8-U128, I8-I128), F32, F64, Dec,
Dec.to_str, and type mismatch tests. Old file deleted.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All 11 list_refcount files migrated (10 with tests, 1 placeholder).
All tests use unsuffixed numeric literals → .dec_val. String tests
use .str_val. All files deleted and refAllDecls removed from mod.zig.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nner

Add `inspect_str` Expected variant to the parallel test runner that
compares RocValue.format() output (interpreter) and Str.inspect output
(compiled backends) against an expected string. This enables testing
records, tuples, lists, and other composite types without building
complex structured value comparisons.

Migrates record fold tests (26), list I64/ZST tests (16+6), tuple tests
(2), Dec fold/sum tests (6), literal evaluation tests (~15), and issue
regression tests to the parallel runner (987 total test cases).

5 tests remain in eval_test.zig: 2 infrastructure tests (crash callback,
ModuleEnv serialization), 3 tag-union-result tests that can't use
inspect_str (RocValue.format hits unreachable for tag_union layout).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nner

- Add TagUnionNotSupported error to interpreter and RocValue.format()
  so tag union tests can gracefully fall back to compiled-backend comparison
- Migrate 3 tag union regression tests from eval_test.zig to parallel runner
- Fix formatting/indentation across eval_tests.zig test cases
- Update dev_object snapshot hashes for nested tag codegen changes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Delete MIGRATE_EVAL_TEST_PROMPT.md (migration task complete)
- Add FUZZ_EVAL_COVERAGE_PROMPT.md for LLM-driven coverage improvement
- Add scripts/eval_coverage_gaps.py to analyze kcov output and find
  uncovered interpreter code regions
- Add SKIP_ALL constant to eval_tests.zig for bug-documenting tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Crash message test → TestEnv.zig (tests its own crash callback)
- ModuleEnv serialization + interpreter test → module_env_test.zig
  (joins existing serialization roundtrip tests)
- Remove eval_test.zig refAllDecls from eval/mod.zig

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@lukewilliamboswell lukewilliamboswell changed the title Add parallel eval test runner and migrate 990 tests Parallel eval test runner and test migration Mar 23, 2026
lukewilliamboswell and others added 6 commits March 23, 2026 16:25
…hmetic, and strings

Adds ~90 new eval test cases targeting uncovered interpreter paths:
- Shift operations (shift_left_by, shift_right_by, shift_right_zf_by) on I8-I64, U8-U64
- Float/Dec type conversions (F64→int, F32→int, Dec→int) - all skipped (crash)
- Typed int arithmetic (U8, U16, I8, I16, I128, U128 add/sub/mul)
- Typed int comparisons across all int types
- F64/F32 arithmetic and comparisons
- Closure/lambda tests with typed numerics
- Tag union matching with payloads
- String operations (is_empty, starts_with, ends_with, trim, count_utf8_bytes)
- to_str on typed ints (I8, I16, I32, U8, U64, F32, F64)
- SKIP_INTERP constant for interpreter-only failures

Also removes "coverage:" prefix from all test names and uses SKIP_INTERP
where only the interpreter fails vs SKIP_ALL for cross-backend crashes.

Coverage: 50.22% → 51.66%

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Parallel runner improvements:
- Hang detection: watchdog thread polls workers every 500ms, kills tests
  that exceed the timeout (default 5s) via SIGUSR1 + child process kill
- Progress reporting: prints "running: N/M results, Xs elapsed" every 1s
- SKIP_ALL validation: tests with all backends skipped still run the
  front-end (parse/check) so syntax errors surface as INVALID_SYNTAX
  failures instead of being silently hidden
- --timeout <MS> CLI flag to configure per-test hang timeout
- New .timeout status for hung tests, reported as HANG in output

New eval test cases:
- from_str: I64, I32, U64, U8, I8, F64 parsing from strings
- Tag unions: 3-variant enum matching, typed payloads, nested tags
- Num.abs on I8/I32/I64, is_zero/is_negative/is_positive
- Record field access and record update syntax
- Tuple access and destructuring via match
- Str: concat, repeat, trim, count_utf8_bytes, to_utf8
- For loop summing I64 elements
- More to_str: I128, U128, U16, U32, I64
- Skipped: Str.contains (infinite loop)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread src/eval/llvm_evaluator.zig
@lukewilliamboswell lukewilliamboswell marked this pull request as ready for review March 23, 2026 07:30
lukewilliamboswell and others added 2 commits March 23, 2026 19:51
- Use i32 millisecond timestamps instead of i64 nanoseconds for atomic
  compatibility with 32-bit x86 (std.atomic requires <= 32-bit types)
- Dup crash/timeout messages to GPA before storing in results so the
  uniform gpa.free() in main doesn't try to free static string literals
- Change default "not started" message to null (was a static string)
- Bump default hang timeout to 10s (5s too aggressive for CI)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
These pass on macOS but cause infinite loops in the interpreter
on x86_64-linux (nix CI).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
lukewilliamboswell and others added 3 commits March 23, 2026 21:02
U8 and U16 arithmetic with large values causes infinite loops in the
interpreter on x86_64-linux CI. Skip all 30 U8/U16 comprehensive
arithmetic tests until the underlying interpreter bug is fixed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Brings in str_inspekt→str_inspect rename, LowLevel/MirToLir updates,
CLI simplification, and polymorphic specialization fixes from main.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@lukewilliamboswell
Copy link
Copy Markdown
Collaborator Author

I'm going to land this with the interpreter replacement instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant