Skip to content

[Bug] Simulator produces incorrect results for parallel loop with assemble+slice on shared input tensor #663

@wangqin1723-max

Description

@wangqin1723-max

Diagnosis

simpler — The simulator (a2a3sim) does not correctly isolate tensor state across parallel loop iterations. When multiple iterations of pl.parallel each assemble into and then slice from the same input tensor (writing to non-overlapping regions), the simulator produces incorrect results. Real hardware passes with the same code.

Description

In qwen3_32b_prefill_scope2, k_cache and v_cache are function input tensors that get updated via pl.assemble and later read via pl.slice inside a pl.parallel(0, batch, 1) loop. Each batch iteration writes to and reads from its own non-overlapping region of the cache tensors.

Reproducer:

# In examples/models/qwen3/qwen3_32b_prefill_scope2.py, set:
BATCH = 2
MAX_SEQ = 64
NUM_HEADS = 8
NUM_KV_HEADS = 1
HEAD_DIM = 64
python examples/models/qwen3/qwen3_32b_prefill_scope2.py -p a2a3sim

Results:

  • BATCH=1 (single parallel iteration): PASS on both sim and hardware
  • BATCH=2 (two parallel iterations): FAIL on sim (~14% element mismatch), PASS on hardware

Error output:

'attn_out' FAIL  shape=(2, 64, 512) dtype=torch.bfloat16
  Mismatched elements: 9052/65536  rtol=0.002 atol=0.002

Workaround: Slice the shared input tensor into a per-batch local tensor before the parallel body, so each iteration operates on an independent tensor:

for b in pl.parallel(0, batch, 1):
    cache_base = b * num_kv_heads * max_seq
    k_cache_b = pl.slice(k_cache, [num_kv_heads * max_seq, head_dim], [cache_base, 0])
    # use k_cache_b instead of k_cache for assemble/slice

Environment

Component Version
pypto-lib b6c82cf
pypto a0d21d1 (branch: main)
simpler bb7965f
ptoas 0.17
CANN 8.5.0.alpha001

Host Platform

Linux (aarch64)

Additional Context

The root cause appears to be that the simulator does not properly isolate the SSA tensor versions across parallel loop iterations. When iteration 0 does k_cache = pl.assemble(k_cache, ...), the updated tensor version leaks into iteration 1's view (or vice versa), corrupting the subsequent pl.slice reads. On real hardware, each AI Core operates on physically separate memory, so no cross-iteration interference occurs.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions