Refactor: align a5 profiling collectors with a2a3 streaming buffer arch#691
Refactor: align a5 profiling collectors with a2a3 streaming buffer arch#691doraemonmj wants to merge 1 commit intohw-native-sys:mainfrom
Conversation
Convert a5 L2 perf, PMU, and tensor-dump collectors from one-shot
post-stream-sync rtMemcpy into streaming SPSC free-queue / ready-queue
buffer pools, mirroring a2a3. Host runs a dedicated collector thread
that polls ready queues, copies records via rtMemcpy (onboard) or
memcpy (sim) into shadow buffers, writes CSV/swimlane output, and
recycles buffers back into per-core/thread free queues.
- platform/include/common: introduce L2PerfDataHeader / PmuDataHeader /
DumpMetaBuffer + per-core/thread BufferState with embedded SPSC
free queues; replace single-buffer Setup headers
- platform/{aicpu,host}: implement device-side buffer-switch on full
buffer + host-side poll/collect/recycle loop; rename PmuCollector to
PmuCollectorHost and add export_swimlane_json / poll_and_collect
hooks on DeviceRunner (onboard + sim)
- platform_config.h: add PROF/PMU/DUMP SLOT_COUNT, BUFFERS_PER_CORE,
BUFFERS_PER_THREAD, READYQUEUE_SIZE, TIMEOUT_SECONDS knobs;
PROF_BUFFER_SIZE 10000 -> 1000 (per-buffer), PMU_RECORDS_PER_BUFFER
4096 -> 512
- runtime/scheduler: add CoreExecState.dispatch_count, switch buffer
in dispatch_subtask_to_core when full, flush l2_perf / phase / pmu
buffers at end of resolve_and_dispatch and on_orchestration_done
- runtime/host_build_graph/aicpu_executor: flush l2_perf / pmu /
dump_tensor buffers before AICore shutdown in run()
- docs/pmu-profiling: rewrite to describe streaming architecture and
document a5-specific shadow-buffer + rtMemcpy transport (no
halHostRegister on DAV_3510)
There was a problem hiding this comment.
Code Review
This pull request refactors the performance, tensor dump, and PMU profiling collectors to use a streaming buffer architecture with SPSC queues, replacing the previous memcpy-based design. This change improves data collection efficiency and decouples the device and host operations. I have identified critical issues in the struct padding and layout definitions for DumpFreeQueue, DumpBufferState, and PmuBuffer that will cause static assertion failures and potential alignment issues.
| volatile uint64_t buffer_ptrs[PLATFORM_DUMP_SLOT_COUNT]; | ||
| volatile uint32_t head; // Consumer read position (Device increments) | ||
| volatile uint32_t tail; // Producer write position (Host increments) | ||
| uint32_t pad[13]; // Pad to 128 bytes |
There was a problem hiding this comment.
The padding for DumpFreeQueue is incorrect. With PLATFORM_DUMP_SLOT_COUNT as 4, the size of buffer_ptrs is 32 bytes. With head and tail (4 bytes each), the size before padding is 40 bytes. pad[13] adds 52 bytes, making the total size 92 bytes. This will fail the static_assert on line 145 which requires a size of 128 bytes. To fix this, the padding should be uint32_t pad[22] (88 bytes), which makes the total size 40 + 88 = 128 bytes.
uint32_t pad[22]; // Pad to 128 bytesReferences
- Use static_assert to prevent regressions in hardware-related struct layouts and alignment.
| volatile uint64_t arena_size; // Arena size in bytes | ||
| volatile uint64_t arena_write_offset; // Monotonic write cursor (host computes % arena_size) | ||
| volatile uint32_t dropped_record_count; // Records dropped before host export | ||
| uint8_t pad1[28]; // Pad to 256 bytes |
There was a problem hiding this comment.
The padding for DumpBufferState is incorrect. Assuming DumpFreeQueue is fixed to be 128 bytes, the size of the members before pad1 is 172 bytes (128 + 8 + 4 + 4 + 8 + 8 + 8 + 4). pad1[28] adds 28 bytes, making the total size 200 bytes. This will fail the static_assert on line 174 which requires a size of 256 bytes. To fix this, the padding should be uint8_t pad1[84], as 172 + 84 = 256.
uint8_t pad1[84]; // Pad to 256 bytesReferences
- Use static_assert to prevent regressions in hardware-related struct layouts and alignment.
| struct PmuBuffer { | ||
| // Header (first 64 bytes) — host copies this alone first to learn count. | ||
| volatile uint32_t count; // Number of valid records | ||
| uint32_t pad[15]; // Pad header to 64 bytes | ||
| PmuRecord records[PLATFORM_PMU_RECORDS_PER_BUFFER]; | ||
| volatile uint32_t count; | ||
| uint32_t pad[15]; // Pad count to 64 bytes | ||
|
|
||
| // Dual-issue staging slots — AICore writes a PmuRecord here after | ||
| // each task, then AICPU copies it into records[count] and fills | ||
| // func_id / core_type on FIN. Index = task_id & 1. | ||
| // a5-specific: AICore writes here, AICPU commits to records[] | ||
| PmuRecord dual_issue_slots[2]; | ||
| } __attribute__((aligned(64))); |
There was a problem hiding this comment.
The layout of PmuBuffer is inconsistent with L2PerfBuffer and seems problematic. The dual_issue_slots are placed at the end of the struct, after the large records array. In L2PerfBuffer, the equivalent wip slots are at the beginning. Placing the dual_issue_slots, which are written to by AICore, at a large, non-fixed offset makes the design fragile and difficult to maintain. For consistency with L2PerfBuffer and for a more robust design, dual_issue_slots should be moved to the beginning of the struct.
struct PmuBuffer {
// a5-specific: AICore writes here, AICPU commits to records[]
PmuRecord dual_issue_slots[2];
PmuRecord records[PLATFORM_PMU_RECORDS_PER_BUFFER];
volatile uint32_t count;
} __attribute__((aligned(64)));References
- Maintain 64-byte alignment and robust layout for hardware-accessed buffers to ensure platform compatibility and prevent regressions.
| constexpr uint32_t AICORE_COREID_MASK = 0x0FFF; | ||
|
|
||
| // DAV_3510 hardware counter count. | ||
| constexpr int PMU_COUNTER_COUNT_A5 = 10; |
There was a problem hiding this comment.
改定义应该挪到src/a5/platform/include/common/pmu_profiling.h,因为它不是一个需要暴露出来的可配置项
| * commits into records[] on COND FIN. Parity task_id & 1 picks the slot | ||
| * so adjacent dual-issue dispatches never collide. | ||
| */ | ||
| struct PmuBuffer { |
| wmb(); | ||
| } | ||
|
|
||
| void l2_perf_aicpu_flush_buffers(Runtime *runtime, int thread_idx, const int *cur_thread_cores, int core_num) { |
There was a problem hiding this comment.
rebase回退了修改,新增了无用的入参runtime
Convert a5 L2 perf, PMU, and tensor-dump collectors from one-shot post-stream-sync rtMemcpy into streaming SPSC free-queue / ready-queue buffer pools, mirroring a2a3. Host runs a dedicated collector thread that polls ready queues, copies records via rtMemcpy (onboard) or memcpy (sim) into shadow buffers, writes CSV/swimlane output, and recycles buffers back into per-core/thread free queues.