Add problem automation contract and validator by saarang123 · Pull Request #63 · tensara/problems

saarang123 · 2026-03-31T04:12:14Z

Summary

add a backward-compatible authoring contract for def.py and problem.md
add a validation contract covering structural CI, local CUDA validation, and Modal/product-runtime validation
add a generic scripts/validate_problem.py validator with text and JSON output
add templates for new agent-authored problems
add PR-time structural validation in GitHub Actions

Why

The goal is to make tensara/problems much more agent-friendly and support reliable automated problem growth.

This PR is the first layer:

stable authoring format
stable validation format
machine-readable diagnostics
CI enforcement that does not break the current corpus

It is intentionally contract-first and backward-compatible, not a broad migration.

Included

docs/problem-authoring-contract.md
docs/problem-validation-contract.md
docs/problem-automation-roadmap.md
scripts/validate_problem.py
templates/problem-template.def.py
templates/problem-template.md
.github/workflows/validate-problems.yml
README updates for the new contract and validation flow

Validation Model

This PR defines a 3-tier validation model:

structural validation in normal CI
local CUDA validation on real GPUs such as Together H100
Modal/product-runtime validation as the authoritative final gate

That means cheap local GPU checks are still useful, but long-term runtime truth should come from the same Modal-backed path used by the real product.

What validate_problem.py does

Structural mode checks:

required files exist
required frontmatter exists
slug consistency
Problem subclass exists
required methods exist
method signatures match the stable contract
parameters/signature contract is present

Runtime mode is designed to validate problem behavior, not just schema:

load the problem
run sample / generated cases
run the reference path
confirm the verifier accepts correct outputs
confirm perturbed wrong outputs are rejected

Today:

structural validation is fully wired and CI-safe
local runtime validation is supported
Modal/product-runtime validation is the intended authoritative path and should be the next acceptance-layer to rely on for automation

Backward Compatibility

existing published problems are not forced to adopt new metadata immediately
optional metadata such as source, authoring, and validation is additive
current corpus passes structural validation without breaking changes
one legacy warning remains:
- problems/mse-loss/problem.md uses mse_loss instead of mse-loss

Validation

Ran:

python3 scripts/validate_problem.py --runtime none --format text

Result on current main corpus:

84 problems checked
0 errors
1 warning
84 infos

Also verified:

python3 scripts/validate_problem.py relu --runtime none --format json

So the validator is working both as a human-readable structural checker and as a machine-readable surface for agents.

Follow-up

This PR does not yet make Modal validation the default enforcement path in CI. The next step should be wiring the real Modal/sample/checker runtime into the automation flow so product-runtime validation becomes part of the actual merge
pipeline.

If you want, I can also give you a shorter reviewer-oriented version.

Add problem automation contract and validator

13eb921

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add problem automation contract and validator#63

Add problem automation contract and validator#63
saarang123 wants to merge 1 commit intotensara:mainfrom
saarang123:codex/problem-automation-contract

saarang123 commented Mar 31, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

saarang123 commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Included

Validation Model

What validate_problem.py does

Backward Compatibility

Validation

Follow-up

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

saarang123 commented Mar 31, 2026 •

edited

Loading