[Bug] Simulator produces incorrect results for parallel loop with assemble+slice on shared input tensor

### Diagnosis

**simpler** — The simulator (a2a3sim) does not correctly isolate tensor state across parallel loop iterations. When multiple iterations of `pl.parallel` each `assemble` into and then `slice` from the same input tensor (writing to non-overlapping regions), the simulator produces incorrect results. Real hardware passes with the same code.

### Description

In `qwen3_32b_prefill_scope2`, `k_cache` and `v_cache` are function input tensors that get updated via `pl.assemble` and later read via `pl.slice` inside a `pl.parallel(0, batch, 1)` loop. Each batch iteration writes to and reads from its own non-overlapping region of the cache tensors.

**Reproducer:**

```python
# In examples/models/qwen3/qwen3_32b_prefill_scope2.py, set:
BATCH = 2
MAX_SEQ = 64
NUM_HEADS = 8
NUM_KV_HEADS = 1
HEAD_DIM = 64
```

```bash
python examples/models/qwen3/qwen3_32b_prefill_scope2.py -p a2a3sim
```

**Results:**
- `BATCH=1` (single parallel iteration): PASS on both sim and hardware
- `BATCH=2` (two parallel iterations): FAIL on sim (~14% element mismatch), PASS on hardware

**Error output:**
```
'attn_out' FAIL  shape=(2, 64, 512) dtype=torch.bfloat16
  Mismatched elements: 9052/65536  rtol=0.002 atol=0.002
```

**Workaround:** Slice the shared input tensor into a per-batch local tensor before the parallel body, so each iteration operates on an independent tensor:

```python
for b in pl.parallel(0, batch, 1):
    cache_base = b * num_kv_heads * max_seq
    k_cache_b = pl.slice(k_cache, [num_kv_heads * max_seq, head_dim], [cache_base, 0])
    # use k_cache_b instead of k_cache for assemble/slice
```

### Environment

| Component | Version |
|---|---|
| pypto-lib | `b6c82cf` |
| pypto | `a0d21d1` (branch: `main`) |
| simpler | `bb7965f` |
| ptoas | `0.17` |
| CANN | `8.5.0.alpha001` |

### Host Platform

Linux (aarch64)

### Additional Context

The root cause appears to be that the simulator does not properly isolate the SSA tensor versions across parallel loop iterations. When iteration 0 does `k_cache = pl.assemble(k_cache, ...)`, the updated tensor version leaks into iteration 1's view (or vice versa), corrupting the subsequent `pl.slice` reads. On real hardware, each AI Core operates on physically separate memory, so no cross-iteration interference occurs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Simulator produces incorrect results for parallel loop with assemble+slice on shared input tensor #663

Diagnosis

Description

Environment

Host Platform

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Component	Version
pypto-lib	`b6c82cf`
pypto	`a0d21d1` (branch: `main`)
simpler	`bb7965f`
ptoas	`0.17`
CANN	`8.5.0.alpha001`

[Bug] Simulator produces incorrect results for parallel loop with assemble+slice on shared input tensor #663

Description

Diagnosis

Description

Environment

Host Platform

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions