Add: parallel for iteration isolation in tensormap and orchestrator#551
Add: parallel for iteration isolation in tensormap and orchestrator#551zhusy54 wants to merge 2 commits intohw-native-sys:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces parallel-for iteration isolation for the PTO2 runtime. It adds lifecycle hooks to the runtime operations, implements RAII guards and macros for parallel scopes, and updates the PTO2TensorMap lookup logic to filter out tensor entries from previous iterations using per-ring local task IDs. These changes are applied consistently across the a2a3 and a5 runtime paths. I have no feedback to provide.
硬件性能测试结果在 Ascend NPU (device-8) 上对 本 PR(parallel) 与 main 分支(0745dee1) 进行了基准测试对比。 测试环境
对比结果
各轮次明细(Trimmed Avg, µs)
分析
|
a5bf2e1 to
f6c2509
Compare
Introduces per-iteration context isolation for PTO2_PARALLEL_FOR loops. Each iteration now pushes/pops a scoped frame on a nested stack in the orchestrator, tensormap, and ring-buffer layers, preventing cross-iteration state leakage. Upgrades the iter-isolation machinery from a flat variable to the nested stack design and refines API ergonomics.
Replaces plain for-loops with PTO2_PARALLEL_FOR in mixed_example, paged_attention_unroll, batch_paged_attention, alternating_matmul_add, and benchmark_bgemm to use the new iteration-isolation API.
Summary
PTO2_PARALLEL_FOR/PTO2_PARALLEL_SCOPEmacros and RAII guards that bracket each loop iteration with a scope-level dependency filteriter_start_local_idsper ring inPTO2TensorMapso that tensor-map lookups skip entries produced in prior iterations on the same ring, preventing false cross-iteration dependencies when independent loop iterations submit tasks concurrentlyparallel_for_begin/endandparallel_scope_begin/endops intoPTO2RuntimeOpsvtableChanges
pto_orchestration_api.h: newparallel_for_begin/endandparallel_scope_begin/endops, inline wrappers, RAII guards,PTO2_PARALLEL_FOR/PTO2_PARALLEL_SCOPEmacros (a2a3 + a5)pto_orchestrator.h/.cpp: implementpto2_parallel_for/scope_begin/endusing existing scope stack + iter_start filter bookkeepingpto_tensormap.h/.cpp: additer_start_local_ids[ring]array, initialise to -1, filter stale entries during lookuppto_ring_buffer.h: exposenext_local_id()for snapshot at scope entrypto_runtime2.h/.cpp: wire new ops intoPTO2RuntimeOpsvtableTesting