- Environment ID:
kbediako/tower_defence - Python package:
prime_td_env - Short description: Multi-turn macro-round tower defense environment for hosted RL.
- Tags: games, tower-defense, rl, verifiers, prime-rl, multi-turn
- Published package shape:
env.py,pyproject.toml,README.md,src/
- Primary dataset(s): Procedural tower-defense seeds generated by the environment.
- Snapshot mode: Configurable round snapshots (
dataset.snapshots) for curriculum and stability studies. - Split control: Training/eval sample volume is controlled by environment args and run config.
- Type: Multi-turn game interaction (macro-round planning).
- Interaction contract: One assistant plan maps to one in-game round progression.
- Action surface: Candidate-index planning with
{"type":"plan","actions":[{"type":"choose","index":N}, ...]}.
Run local smoke checks:
PYTHONPATH=src python3 scripts/smoke.pyRun baseline local evaluation:
PYTHONPATH=src python3 scripts/eval_baseline.py --episodes 10 --max-rounds 20 --output out/metrics.jsonRun hosted training:
prime rl run configs/lab/prime-td.toml| Group | Key args | Purpose |
|---|---|---|
wrapper |
wrapper="macro_round" |
Enables multi-turn round-by-round planning mode. |
difficulty |
max_rounds |
Controls episode horizon/curriculum cap. |
observation |
max_action_candidates, max_build_slots, max_towers, max_threats |
Bounds payload size and candidate space. |
candidate_balance |
min_build_frac, max_upgrade_candidates, by_phase.* |
Tunes build/upgrade exposure by phase. |
dataset |
policy, rollout_steps, snapshots, safe_explore_* |
Controls training observation generation. |
rules |
auto_advance_round, prep_actions_*, mask_sell |
Governs turn semantics and allowed behavior. |
| Metric | Meaning |
|---|---|
reward/mean |
Aggregate training reward over samples in a step. |
metrics/num_turns |
Episode length in environment turns. |
format_reward |
Validity of structured plan/action output. |
macro_round_delta (derived) |
Round advance per turn; expected delta_round == 1 after turn 1. |
| Action mix (derived) | Build vs upgrade distribution, tracked by phase. |
- Record run IDs and config filenames immediately after launch.
- Pull rollouts for every sampled step and parse all user/assistant turns.
- Validate macro-round invariant
delta_round == 1for user observations after turn 1. - If sample upload 500s appear, reduce tokens/observation caps or horizon pressure.
Detailed longitudinal results and run-by-run analysis: docs/RESULTS.md.