Skip to content

Refactor: align a5 profiling collectors with a2a3 streaming buffer arch#691

Open
doraemonmj wants to merge 1 commit intohw-native-sys:mainfrom
doraemonmj:a5memory
Open

Refactor: align a5 profiling collectors with a2a3 streaming buffer arch#691
doraemonmj wants to merge 1 commit intohw-native-sys:mainfrom
doraemonmj:a5memory

Conversation

@doraemonmj
Copy link
Copy Markdown
Contributor

Convert a5 L2 perf, PMU, and tensor-dump collectors from one-shot post-stream-sync rtMemcpy into streaming SPSC free-queue / ready-queue buffer pools, mirroring a2a3. Host runs a dedicated collector thread that polls ready queues, copies records via rtMemcpy (onboard) or memcpy (sim) into shadow buffers, writes CSV/swimlane output, and recycles buffers back into per-core/thread free queues.

  • platform/include/common: introduce L2PerfDataHeader / PmuDataHeader / DumpMetaBuffer + per-core/thread BufferState with embedded SPSC free queues; replace single-buffer Setup headers
  • platform/{aicpu,host}: implement device-side buffer-switch on full buffer + host-side poll/collect/recycle loop; rename PmuCollector to PmuCollectorHost and add export_swimlane_json / poll_and_collect hooks on DeviceRunner (onboard + sim)
  • platform_config.h: add PROF/PMU/DUMP SLOT_COUNT, BUFFERS_PER_CORE, BUFFERS_PER_THREAD, READYQUEUE_SIZE, TIMEOUT_SECONDS knobs; PROF_BUFFER_SIZE 10000 -> 1000 (per-buffer), PMU_RECORDS_PER_BUFFER 4096 -> 512
  • runtime/scheduler: add CoreExecState.dispatch_count, switch buffer in dispatch_subtask_to_core when full, flush l2_perf / phase / pmu buffers at end of resolve_and_dispatch and on_orchestration_done
  • runtime/host_build_graph/aicpu_executor: flush l2_perf / pmu / dump_tensor buffers before AICore shutdown in run()
  • docs/pmu-profiling: rewrite to describe streaming architecture and document a5-specific shadow-buffer + rtMemcpy transport (no halHostRegister on DAV_3510)

Convert a5 L2 perf, PMU, and tensor-dump collectors from one-shot
post-stream-sync rtMemcpy into streaming SPSC free-queue / ready-queue
buffer pools, mirroring a2a3. Host runs a dedicated collector thread
that polls ready queues, copies records via rtMemcpy (onboard) or
memcpy (sim) into shadow buffers, writes CSV/swimlane output, and
recycles buffers back into per-core/thread free queues.

- platform/include/common: introduce L2PerfDataHeader / PmuDataHeader /
  DumpMetaBuffer + per-core/thread BufferState with embedded SPSC
  free queues; replace single-buffer Setup headers
- platform/{aicpu,host}: implement device-side buffer-switch on full
  buffer + host-side poll/collect/recycle loop; rename PmuCollector to
  PmuCollectorHost and add export_swimlane_json / poll_and_collect
  hooks on DeviceRunner (onboard + sim)
- platform_config.h: add PROF/PMU/DUMP SLOT_COUNT, BUFFERS_PER_CORE,
  BUFFERS_PER_THREAD, READYQUEUE_SIZE, TIMEOUT_SECONDS knobs;
  PROF_BUFFER_SIZE 10000 -> 1000 (per-buffer), PMU_RECORDS_PER_BUFFER
  4096 -> 512
- runtime/scheduler: add CoreExecState.dispatch_count, switch buffer
  in dispatch_subtask_to_core when full, flush l2_perf / phase / pmu
  buffers at end of resolve_and_dispatch and on_orchestration_done
- runtime/host_build_graph/aicpu_executor: flush l2_perf / pmu /
  dump_tensor buffers before AICore shutdown in run()
- docs/pmu-profiling: rewrite to describe streaming architecture and
  document a5-specific shadow-buffer + rtMemcpy transport (no
  halHostRegister on DAV_3510)
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the performance, tensor dump, and PMU profiling collectors to use a streaming buffer architecture with SPSC queues, replacing the previous memcpy-based design. This change improves data collection efficiency and decouples the device and host operations. I have identified critical issues in the struct padding and layout definitions for DumpFreeQueue, DumpBufferState, and PmuBuffer that will cause static assertion failures and potential alignment issues.

volatile uint64_t buffer_ptrs[PLATFORM_DUMP_SLOT_COUNT];
volatile uint32_t head; // Consumer read position (Device increments)
volatile uint32_t tail; // Producer write position (Host increments)
uint32_t pad[13]; // Pad to 128 bytes
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The padding for DumpFreeQueue is incorrect. With PLATFORM_DUMP_SLOT_COUNT as 4, the size of buffer_ptrs is 32 bytes. With head and tail (4 bytes each), the size before padding is 40 bytes. pad[13] adds 52 bytes, making the total size 92 bytes. This will fail the static_assert on line 145 which requires a size of 128 bytes. To fix this, the padding should be uint32_t pad[22] (88 bytes), which makes the total size 40 + 88 = 128 bytes.

    uint32_t pad[22];        // Pad to 128 bytes
References
  1. Use static_assert to prevent regressions in hardware-related struct layouts and alignment.

volatile uint64_t arena_size; // Arena size in bytes
volatile uint64_t arena_write_offset; // Monotonic write cursor (host computes % arena_size)
volatile uint32_t dropped_record_count; // Records dropped before host export
uint8_t pad1[28]; // Pad to 256 bytes
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The padding for DumpBufferState is incorrect. Assuming DumpFreeQueue is fixed to be 128 bytes, the size of the members before pad1 is 172 bytes (128 + 8 + 4 + 4 + 8 + 8 + 8 + 4). pad1[28] adds 28 bytes, making the total size 200 bytes. This will fail the static_assert on line 174 which requires a size of 256 bytes. To fix this, the padding should be uint8_t pad1[84], as 172 + 84 = 256.

    uint8_t pad1[84];                        // Pad to 256 bytes
References
  1. Use static_assert to prevent regressions in hardware-related struct layouts and alignment.

Comment on lines 167 to +174
struct PmuBuffer {
// Header (first 64 bytes) — host copies this alone first to learn count.
volatile uint32_t count; // Number of valid records
uint32_t pad[15]; // Pad header to 64 bytes
PmuRecord records[PLATFORM_PMU_RECORDS_PER_BUFFER];
volatile uint32_t count;
uint32_t pad[15]; // Pad count to 64 bytes

// Dual-issue staging slots — AICore writes a PmuRecord here after
// each task, then AICPU copies it into records[count] and fills
// func_id / core_type on FIN. Index = task_id & 1.
// a5-specific: AICore writes here, AICPU commits to records[]
PmuRecord dual_issue_slots[2];
} __attribute__((aligned(64)));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The layout of PmuBuffer is inconsistent with L2PerfBuffer and seems problematic. The dual_issue_slots are placed at the end of the struct, after the large records array. In L2PerfBuffer, the equivalent wip slots are at the beginning. Placing the dual_issue_slots, which are written to by AICore, at a large, non-fixed offset makes the design fragile and difficult to maintain. For consistency with L2PerfBuffer and for a more robust design, dual_issue_slots should be moved to the beginning of the struct.

struct PmuBuffer {
    // a5-specific: AICore writes here, AICPU commits to records[]
    PmuRecord dual_issue_slots[2];

    PmuRecord records[PLATFORM_PMU_RECORDS_PER_BUFFER];
    volatile uint32_t count;
} __attribute__((aligned(64)));
References
  1. Maintain 64-byte alignment and robust layout for hardware-accessed buffers to ensure platform compatibility and prevent regressions.

constexpr uint32_t AICORE_COREID_MASK = 0x0FFF;

// DAV_3510 hardware counter count.
constexpr int PMU_COUNTER_COUNT_A5 = 10;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

改定义应该挪到src/a5/platform/include/common/pmu_profiling.h,因为它不是一个需要暴露出来的可配置项

* commits into records[] on COND FIN. Parity task_id & 1 picks the slot
* so adjacent dual-issue dispatches never collide.
*/
struct PmuBuffer {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么要将原本的结构体顺序更改

wmb();
}

void l2_perf_aicpu_flush_buffers(Runtime *runtime, int thread_idx, const int *cur_thread_cores, int core_num) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rebase回退了修改,新增了无用的入参runtime

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants