Refactor: align a5 profiling collectors with a2a3 streaming buffer arch by doraemonmj · Pull Request #691 · hw-native-sys/simpler

doraemonmj · 2026-04-27T09:31:52Z

Convert a5 L2 perf, PMU, and tensor-dump collectors from one-shot post-stream-sync rtMemcpy into streaming SPSC free-queue / ready-queue buffer pools, mirroring a2a3. Host runs a dedicated collector thread that polls ready queues, copies records via rtMemcpy (onboard) or memcpy (sim) into shadow buffers, writes CSV/swimlane output, and recycles buffers back into per-core/thread free queues.

platform/include/common: introduce L2PerfDataHeader / PmuDataHeader / DumpMetaBuffer + per-core/thread BufferState with embedded SPSC free queues; replace single-buffer Setup headers
platform/{aicpu,host}: implement device-side buffer-switch on full buffer + host-side poll/collect/recycle loop; rename PmuCollector to PmuCollectorHost and add export_swimlane_json / poll_and_collect hooks on DeviceRunner (onboard + sim)
platform_config.h: add PROF/PMU/DUMP SLOT_COUNT, BUFFERS_PER_CORE, BUFFERS_PER_THREAD, READYQUEUE_SIZE, TIMEOUT_SECONDS knobs; PROF_BUFFER_SIZE 10000 -> 1000 (per-buffer), PMU_RECORDS_PER_BUFFER 4096 -> 512
runtime/scheduler: add CoreExecState.dispatch_count, switch buffer in dispatch_subtask_to_core when full, flush l2_perf / phase / pmu buffers at end of resolve_and_dispatch and on_orchestration_done
runtime/host_build_graph/aicpu_executor: flush l2_perf / pmu / dump_tensor buffers before AICore shutdown in run()
docs/pmu-profiling: rewrite to describe streaming architecture and document a5-specific shadow-buffer + rtMemcpy transport (no halHostRegister on DAV_3510)

Convert a5 L2 perf, PMU, and tensor-dump collectors from one-shot post-stream-sync rtMemcpy into streaming SPSC free-queue / ready-queue buffer pools, mirroring a2a3. Host runs a dedicated collector thread that polls ready queues, copies records via rtMemcpy (onboard) or memcpy (sim) into shadow buffers, writes CSV/swimlane output, and recycles buffers back into per-core/thread free queues. - platform/include/common: introduce L2PerfDataHeader / PmuDataHeader / DumpMetaBuffer + per-core/thread BufferState with embedded SPSC free queues; replace single-buffer Setup headers - platform/{aicpu,host}: implement device-side buffer-switch on full buffer + host-side poll/collect/recycle loop; rename PmuCollector to PmuCollectorHost and add export_swimlane_json / poll_and_collect hooks on DeviceRunner (onboard + sim) - platform_config.h: add PROF/PMU/DUMP SLOT_COUNT, BUFFERS_PER_CORE, BUFFERS_PER_THREAD, READYQUEUE_SIZE, TIMEOUT_SECONDS knobs; PROF_BUFFER_SIZE 10000 -> 1000 (per-buffer), PMU_RECORDS_PER_BUFFER 4096 -> 512 - runtime/scheduler: add CoreExecState.dispatch_count, switch buffer in dispatch_subtask_to_core when full, flush l2_perf / phase / pmu buffers at end of resolve_and_dispatch and on_orchestration_done - runtime/host_build_graph/aicpu_executor: flush l2_perf / pmu / dump_tensor buffers before AICore shutdown in run() - docs/pmu-profiling: rewrite to describe streaming architecture and document a5-specific shadow-buffer + rtMemcpy transport (no halHostRegister on DAV_3510)

gemini-code-assist

Code Review

This pull request refactors the performance, tensor dump, and PMU profiling collectors to use a streaming buffer architecture with SPSC queues, replacing the previous memcpy-based design. This change improves data collection efficiency and decouples the device and host operations. I have identified critical issues in the struct padding and layout definitions for DumpFreeQueue, DumpBufferState, and PmuBuffer that will cause static assertion failures and potential alignment issues.

gemini-code-assist · 2026-04-27T09:41:18Z

+    volatile uint64_t buffer_ptrs[PLATFORM_DUMP_SLOT_COUNT];
+    volatile uint32_t head;  // Consumer read position (Device increments)
+    volatile uint32_t tail;  // Producer write position (Host increments)
+    uint32_t pad[13];        // Pad to 128 bytes


The padding for DumpFreeQueue is incorrect. With PLATFORM_DUMP_SLOT_COUNT as 4, the size of buffer_ptrs is 32 bytes. With head and tail (4 bytes each), the size before padding is 40 bytes. pad[13] adds 52 bytes, making the total size 92 bytes. This will fail the static_assert on line 145 which requires a size of 128 bytes. To fix this, the padding should be uint32_t pad[22] (88 bytes), which makes the total size 40 + 88 = 128 bytes.

uint32_t pad[22]; // Pad to 128 bytes

References

Use static_assert to prevent regressions in hardware-related struct layouts and alignment.

gemini-code-assist · 2026-04-27T09:41:18Z

+    volatile uint64_t arena_size;            // Arena size in bytes
+    volatile uint64_t arena_write_offset;    // Monotonic write cursor (host computes % arena_size)
+    volatile uint32_t dropped_record_count;  // Records dropped before host export
+    uint8_t pad1[28];                        // Pad to 256 bytes


The padding for DumpBufferState is incorrect. Assuming DumpFreeQueue is fixed to be 128 bytes, the size of the members before pad1 is 172 bytes (128 + 8 + 4 + 4 + 8 + 8 + 8 + 4). pad1[28] adds 28 bytes, making the total size 200 bytes. This will fail the static_assert on line 174 which requires a size of 256 bytes. To fix this, the padding should be uint8_t pad1[84], as 172 + 84 = 256.

uint8_t pad1[84]; // Pad to 256 bytes

References

Use static_assert to prevent regressions in hardware-related struct layouts and alignment.

gemini-code-assist · 2026-04-27T09:41:18Z

 struct PmuBuffer {
-    // Header (first 64 bytes) — host copies this alone first to learn count.
-    volatile uint32_t count;  // Number of valid records
-    uint32_t pad[15];         // Pad header to 64 bytes
+    PmuRecord records[PLATFORM_PMU_RECORDS_PER_BUFFER];
+    volatile uint32_t count;
+    uint32_t pad[15];  // Pad count to 64 bytes

-    // Dual-issue staging slots — AICore writes a PmuRecord here after
-    // each task, then AICPU copies it into records[count] and fills
-    // func_id / core_type on FIN. Index = task_id & 1.
+    // a5-specific: AICore writes here, AICPU commits to records[]
    PmuRecord dual_issue_slots[2];
+} __attribute__((aligned(64)));


The layout of PmuBuffer is inconsistent with L2PerfBuffer and seems problematic. The dual_issue_slots are placed at the end of the struct, after the large records array. In L2PerfBuffer, the equivalent wip slots are at the beginning. Placing the dual_issue_slots, which are written to by AICore, at a large, non-fixed offset makes the design fragile and difficult to maintain. For consistency with L2PerfBuffer and for a more robust design, dual_issue_slots should be moved to the beginning of the struct.

struct PmuBuffer { // a5-specific: AICore writes here, AICPU commits to records[] PmuRecord dual_issue_slots[2]; PmuRecord records[PLATFORM_PMU_RECORDS_PER_BUFFER]; volatile uint32_t count; } __attribute__((aligned(64)));

References

Maintain 64-byte alignment and robust layout for hardware-accessed buffers to ensure platform compatibility and prevent regressions.

ChaoZheng109 · 2026-04-27T11:13:00Z

 constexpr uint32_t AICORE_COREID_MASK = 0x0FFF;

 // DAV_3510 hardware counter count.
 constexpr int PMU_COUNTER_COUNT_A5 = 10;


改定义应该挪到src/a5/platform/include/common/pmu_profiling.h，因为它不是一个需要暴露出来的可配置项

ChaoZheng109 · 2026-04-27T11:27:10Z

+ * commits into records[] on COND FIN. Parity task_id & 1 picks the slot
+ * so adjacent dual-issue dispatches never collide.
 */
 struct PmuBuffer {


为什么要将原本的结构体顺序更改

ChaoZheng109 · 2026-04-27T11:32:20Z

+    wmb();
+}
+
+void l2_perf_aicpu_flush_buffers(Runtime *runtime, int thread_idx, const int *cur_thread_cores, int core_num) {


rebase回退了修改，新增了无用的入参runtime

gemini-code-assist Bot reviewed Apr 27, 2026

View reviewed changes

ChaoZheng109 reviewed Apr 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: align a5 profiling collectors with a2a3 streaming buffer arch#691

Refactor: align a5 profiling collectors with a2a3 streaming buffer arch#691
doraemonmj wants to merge 1 commit intohw-native-sys:mainfrom
doraemonmj:a5memory

doraemonmj commented Apr 27, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Uh oh!

ChaoZheng109 Apr 27, 2026

Uh oh!

ChaoZheng109 Apr 27, 2026

Uh oh!

ChaoZheng109 Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

doraemonmj commented Apr 27, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

ChaoZheng109 Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

ChaoZheng109 Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

ChaoZheng109 Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants