Diagnosis
simpler — The simulator (a2a3sim) does not correctly isolate tensor state across parallel loop iterations. When multiple iterations of pl.parallel each assemble into and then slice from the same input tensor (writing to non-overlapping regions), the simulator produces incorrect results. Real hardware passes with the same code.
Description
In qwen3_32b_prefill_scope2, k_cache and v_cache are function input tensors that get updated via pl.assemble and later read via pl.slice inside a pl.parallel(0, batch, 1) loop. Each batch iteration writes to and reads from its own non-overlapping region of the cache tensors.
Reproducer:
# In examples/models/qwen3/qwen3_32b_prefill_scope2.py, set:
BATCH = 2
MAX_SEQ = 64
NUM_HEADS = 8
NUM_KV_HEADS = 1
HEAD_DIM = 64
python examples/models/qwen3/qwen3_32b_prefill_scope2.py -p a2a3sim
Results:
BATCH=1 (single parallel iteration): PASS on both sim and hardware
BATCH=2 (two parallel iterations): FAIL on sim (~14% element mismatch), PASS on hardware
Error output:
'attn_out' FAIL shape=(2, 64, 512) dtype=torch.bfloat16
Mismatched elements: 9052/65536 rtol=0.002 atol=0.002
Workaround: Slice the shared input tensor into a per-batch local tensor before the parallel body, so each iteration operates on an independent tensor:
for b in pl.parallel(0, batch, 1):
cache_base = b * num_kv_heads * max_seq
k_cache_b = pl.slice(k_cache, [num_kv_heads * max_seq, head_dim], [cache_base, 0])
# use k_cache_b instead of k_cache for assemble/slice
Environment
| Component |
Version |
| pypto-lib |
b6c82cf |
| pypto |
a0d21d1 (branch: main) |
| simpler |
bb7965f |
| ptoas |
0.17 |
| CANN |
8.5.0.alpha001 |
Host Platform
Linux (aarch64)
Additional Context
The root cause appears to be that the simulator does not properly isolate the SSA tensor versions across parallel loop iterations. When iteration 0 does k_cache = pl.assemble(k_cache, ...), the updated tensor version leaks into iteration 1's view (or vice versa), corrupting the subsequent pl.slice reads. On real hardware, each AI Core operates on physically separate memory, so no cross-iteration interference occurs.
Diagnosis
simpler — The simulator (a2a3sim) does not correctly isolate tensor state across parallel loop iterations. When multiple iterations of
pl.paralleleachassembleinto and thenslicefrom the same input tensor (writing to non-overlapping regions), the simulator produces incorrect results. Real hardware passes with the same code.Description
In
qwen3_32b_prefill_scope2,k_cacheandv_cacheare function input tensors that get updated viapl.assembleand later read viapl.sliceinside apl.parallel(0, batch, 1)loop. Each batch iteration writes to and reads from its own non-overlapping region of the cache tensors.Reproducer:
Results:
BATCH=1(single parallel iteration): PASS on both sim and hardwareBATCH=2(two parallel iterations): FAIL on sim (~14% element mismatch), PASS on hardwareError output:
Workaround: Slice the shared input tensor into a per-batch local tensor before the parallel body, so each iteration operates on an independent tensor:
Environment
b6c82cfa0d21d1(branch:main)bb7965f0.178.5.0.alpha001Host Platform
Linux (aarch64)
Additional Context
The root cause appears to be that the simulator does not properly isolate the SSA tensor versions across parallel loop iterations. When iteration 0 does
k_cache = pl.assemble(k_cache, ...), the updated tensor version leaks into iteration 1's view (or vice versa), corrupting the subsequentpl.slicereads. On real hardware, each AI Core operates on physically separate memory, so no cross-iteration interference occurs.