hw-native-sys · doraemonmj · Apr 28, 2026
diff --git a/src/a5/docs/pmu-profiling.md b/src/a5/docs/pmu-profiling.md
@@ -17,14 +17,17 @@ exercise the PMU export flow, but does not provide real hardware counters.
 ### Layered Responsibilities
 
 - **Host** owns user entry, event-type selection, PMU session setup, and
-  final CSV export
+  final CSV export. Runs a dedicated collector thread that polls
+  per-thread ready queues via `rtMemcpy`, writes records to CSV, and
+  recycles buffers back into SPSC free queues.
 - **AICPU** owns PMU init/finalize (event selectors, `PMU_CTRL_0/1`
   start, CTRL restore), publishes per-core `PmuBuffer` and PMU MMIO base
   into each `Handshake`, and on each task FIN copies AICore's
   `dual_issue_slots[]` snapshot into `PmuBuffer::records[]` while
-  filling `func_id` / `core_type`. AICPU also stamps each core's
-  `PmuBufferState::owning_thread_id` (the AICPU scheduler thread that
-  drives that core) so host can emit a per-record `thread_id` column.
+  filling `func_id` / `core_type`. When a buffer is full, AICPU
+  switches to a new buffer via the SPSC free queue / ready queue
+  protocol (identical to a2a3). At shutdown, AICPU flushes any
+  partially-filled buffers via `pmu_aicpu_flush_buffers()`.
 - **AICore** gates the counting window around the kernel body via CTRL
   SPR bit 0, reads the 10 PMU counters + `PMU_CNT_TOTAL` via the
   `ld_dev` MMIO load intrinsic after each task, and writes the snapshot
@@ -50,57 +53,65 @@ logical id encodes more than 32 bits (e.g. PTO2's
 values differ — slot match must use the register token, otherwise the
 slot will never validate and every commit silently drops.
 
+### Streaming Buffer Architecture (mirrors a2a3)
+
+The a5 PMU collector uses the same SPSC streaming buffer architecture
+as a2a3, with identical data structures and flow:
+
+- **PmuFreeQueue**: SPSC queue per core. Host pushes recycled/new
+  PmuBuffers; AICPU pops when switching after a buffer fills.
+- **PmuReadyQueue**: Per-thread ready queue in PmuDataHeader. AICPU
+  enqueues full buffers; host collector thread dequeues them.
+- **PmuBufferState**: Per-core state tracking current active buffer,
+  sequence number, dropped/total record counts.
+- **Buffer lifecycle**: Host pre-allocates `BUFFERS_PER_CORE` buffers
+  per core, pushes them into free queues. AICPU pops one at init.
+  When full, AICPU enqueues to ready queue and pops a fresh one.
+  Host collector drains ready queue, writes CSV, recycles buffers
+  back into free queues.
+
+The only difference from a2a3 is the memory transport layer:
+
+- **a2a3**: `halHostRegister` maps device memory into host address
+  space — host and device share the same physical memory.
+- **a5**: No `halHostRegister` on DAV_3510. Host maintains separate
+  shadow buffers and synchronizes via `rtMemcpy` (onboard) or
+  `memcpy` (sim). The platform copy hooks
+  `pmu_platform_copy_to/from_device` abstract this.
+
 ### Device Memory Layout
 
 ```text
-[ PmuSetupHeader ]              ← num_cores, event_type, buffer_ptrs[N]
-[ PmuBufferState[num_cores] ]   ← owning_thread_id, dropped_record_count,
-                                   total_record_count (per core)
+[ PmuDataHeader ]              ← ready queues, queue_heads/tails,
+                                  num_cores, event_type
+[ PmuBufferState[num_cores] ]  ← free_queue, current_buf_ptr,
+                                  dropped/total counts, owning_thread_id
 ```
 
-This single shared region (`calc_pmu_setup_size`) is allocated once at
-init and pulled back via one `rtMemcpy` at finalize. The per-core
-`PmuBuffer`s themselves are separate device allocations (one per core),
-copied back individually on demand because their `count`-sized payloads
-vary.
-
-### Device → Host Transfer
-
-halHostRegister is not supported on DAV_3510, so the PMU collector uses
-the same two-step rtMemcpy pattern already used by the performance and
-tensor-dump collectors:
-
-1. At init: host allocates the `[PmuSetupHeader][PmuBufferState[]]`
-   region plus one `PmuBuffer` per core. The setup region's device
-   address is published into `kernel_args.pmu_data_base`. AICPU then
-   publishes each core's `PmuBuffer` address and PMU MMIO base into
-   the matching `Handshake` (`pmu_buffer_addr`, `pmu_reg_base`) so
-   AICore can do its own MMIO read — parallels `perf_records_addr`.
-2. During execution: AICore, after each kernel completes, reads the
-   10 PMU counters via `ld_dev(base, offset)` and writes them into
-   `PmuBuffer::dual_issue_slots[reg_task_id & 1]`. AICPU, on observing
-   COND FIN, validates that slot's `task_id` against the register
-   token (`pmu_aicpu_complete_record`), copies register state into
-   `PmuBuffer::records[count]`, fills `func_id` / `core_type`, stamps
-   `PmuBufferState::owning_thread_id`, and advances `count`.
-3. After stream sync: host pulls back the entire setup region (all
-   `PmuBufferState`s in one shot) plus each core's `PmuBuffer` payload
-   (header first to learn `count`, then `count * sizeof(PmuRecord)`
-   records). The host writes a CSV under `outputs/` and emits a
-   cross-check log line:
-
-   ```text
-   PMU collector: record counts match (collected=N, dropped=K, device_total=N+K)
-   ```
-
-   If `collected + dropped != device_total`, the difference is silent
-   slot-mismatch loss (AICore had not yet published the slot when
-   AICPU tried to commit) and the line becomes a `record count
-   mismatch (... diff=M silent slot-mismatch losses)` warning.
-
-This design fits naturally into the existing a5 `PerformanceCollector` /
-`TensorDumpCollector` lifecycle — `initialize / collect_all / export /
-finalize` is called from `DeviceRunner::run`.
+This single shared region (`calc_pmu_data_size`) is allocated once at
+init. Host maintains a shadow copy and syncs via copy hooks.
+
+Per-core `PmuBuffer`s are separate device allocations, each with a
+paired host shadow buffer in `buf_pool_`.
+
+### Host Collector Thread
+
+The host runs `poll_and_collect()` on a dedicated thread (launched by
+`DeviceRunner::run()` before kernel launch, joined after stream sync).
+This thread:
+
+1. Polls ready queue tails via `pmu_platform_copy_from_device`
+2. For each ready entry, copies the full `PmuBuffer` from device to
+   its host shadow
+3. Writes records to CSV
+4. Recycles the buffer back into the core's free queue (zeroes count,
+   updates free queue tail on device)
+5. Exits when `signal_execution_complete()` is called
+
+After the collector thread exits, `drain_remaining_buffers()` does a
+final pass: syncs the entire shared memory region, drains any remaining
+ready queue entries, and scans `current_buf_ptr` for partially-filled
+buffers that AICPU flushed but couldn't enqueue.
 
 ## Usage
 
@@ -166,7 +177,7 @@ Columns (in order) — matches a2a3 host PMU CSV for tooling parity:
 
 | Column | Meaning |
 | ------ | ------- |
-| `thread_id` | AICPU scheduler thread that drives this core (read from `PmuBufferState::owning_thread_id`) |
+| `thread_id` | AICPU scheduler thread that drives this core |
 | `core_id` | Logical AICore id in the runtime |
 | `task_id` | Runtime task id, printed as hex |
 | `func_id` | Kernel function id |
@@ -179,8 +190,9 @@ For the default `PIPE_UTILIZATION` event type (`2`), the counter columns
 on a5 are (from pypto `tilefwk_pmu_to_csv.py` table_pmu_header_3510):
 
 ```text
-pmu_idc_aic_vec_busy_o,cube_instr_busy,scalar_instr_busy,mte1_instr_busy,
-mte2_instr_busy,mte3_instr_busy,icache_req,icache_miss,pmu_fix_instr_busy
+pmu_idc_aic_vec_busy_o,cube_instr_busy,scalar_instr_busy,
+mte1_instr_busy,mte2_instr_busy,mte3_instr_busy,
+icache_req,icache_miss,pmu_fix_instr_busy
 ```
 
 The number of counter columns varies by event type — each DAV_3510 event
@@ -196,12 +208,13 @@ discover which columns are present.
   always 0. The AICPU still programs the PMU event selectors and the CSV
   still carries one row per task with a zero counter tuple — useful for
   verifying the end-to-end data flow but not for performance analysis.
-- The per-core on-device `PmuBuffer` is a fixed size
-  (`PLATFORM_PMU_RECORDS_PER_BUFFER = 4096`). Tasks past that count are
-  dropped and accounted in `PmuBufferState::dropped_record_count`; the
-  host surfaces the total in the finalize log line. Increase the
-  constant in [platform_config.h](../platform/include/common/platform_config.h)
-  if your workload executes more tasks per core.
+- The per-core on-device `PmuBuffer` capacity is controlled by
+  `PLATFORM_PMU_RECORDS_PER_BUFFER` (default 512). When full, AICPU
+  switches to a new buffer via the free queue. If no free buffer is
+  available, records are dropped. Increase `PLATFORM_PMU_BUFFERS_PER_CORE`
+  (default 4) in
+  [platform_config.h](../platform/include/common/platform_config.h)
+  if your workload produces bursts that exhaust the buffer pool.
 - A non-zero `diff` in the host's `record count mismatch` warning means
   AICPU attempted to commit `diff` records whose dual-issue slot still
   carried an older `task_id`. Under the current design on DAV_3510

diff --git a/src/a5/platform/include/aicpu/l2_perf_collector_aicpu.h b/src/a5/platform/include/aicpu/l2_perf_collector_aicpu.h
@@ -13,10 +13,7 @@
  * @brief AICPU performance data collection interface
  *
  * Provides performance profiling management interface for AICPU side.
- * Handles buffer initialization and per-record completion. In the memcpy-based
- * collection design, Host pre-allocates one L2PerfBuffer per core and one
- * PhaseBuffer per thread; AICPU writes directly into them until full, after
- * which further records are silently dropped.
+ * Handles buffer initialization, switching, and flushing.
  */
 
 #ifndef PLATFORM_AICPU_L2_PERF_COLLECTOR_AICPU_H_
@@ -56,12 +53,10 @@ void l2_perf_aicpu_init_profiling(Runtime *runtime);
  * Complete a L2PerfRecord with AICPU-side metadata after AICore task completion
  *
  * Reads l2_perf_buf->count, validates task_id match against the latest record,
- * and fills all AICPU-side fields. Returns -1 and silently drops the record
- * when the buffer is full (count >= PLATFORM_PROF_BUFFER_SIZE). Callers must
- * pre-extract fanout into a plain uint64_t array (platform layer cannot depend
- * on runtime linked-list types).
+ * and fills all AICPU-side fields. Callers must pre-extract fanout into a
+ * plain uint64_t array (platform layer cannot depend on runtime linked-list types).
  *
- * @param l2_perf_buf              L2PerfBuffer pointer (from handshake l2_perf_records_addr)
+ * @param l2_perf_buf           L2PerfBuffer pointer (from handshake l2_perf_records_addr)
  * @param expected_reg_task_id  Register dispatch token (low 32 bits) to validate
  * @param task_id               Task identifier to write (PTO2 encoding or plain id)
  * @param func_id               Kernel function identifier
@@ -76,6 +71,30 @@ int l2_perf_aicpu_complete_record(
     uint64_t dispatch_time, uint64_t finish_time, const uint64_t *fanout, int32_t fanout_count
 );
 
+/**
+ * Switch performance buffer when current buffer is full
+ *
+ * Enqueues the full buffer to ReadyQueue, pops a new buffer from FreeQueue,
+ * and updates the handshake l2_perf_records_addr. If FreeQueue is empty,
+ * overwrites the current buffer (lossy fallback).
+ *
+ * @param runtime Runtime instance pointer
+ * @param core_id Core ID
+ * @param thread_idx Thread index
+ */
+void l2_perf_aicpu_switch_buffer(Runtime *runtime, int core_id, int thread_idx);
+
+/**
+ * Flush remaining performance data
+ *
+ * Marks non-empty buffers as ready and enqueues them for host collection.
+ *
+ * @param thread_idx Thread index
+ * @param cur_thread_cores Array of core IDs managed by this thread
+ * @param core_num Number of cores managed by this thread
+ */
+void l2_perf_aicpu_flush_buffers(int thread_idx, const int *cur_thread_cores, int core_num);
+
 /**
  * Update total task count in performance header
  *
@@ -92,15 +111,16 @@ void l2_perf_aicpu_update_total_tasks(uint32_t total_tasks);
  * Sets up AicpuPhaseHeader and clears per-thread phase record buffers.
  * Must be called once from thread 0 after l2_perf_aicpu_init_profiling().
  *
+ * @param runtime Runtime instance pointer
  * @param num_sched_threads Number of scheduler threads
  */
-void l2_perf_aicpu_init_phase_profiling(int num_sched_threads);
+void l2_perf_aicpu_init_phase_profiling(Runtime *runtime, int num_sched_threads);
 
 /**
  * Record a single scheduler phase
  *
  * Appends an AicpuPhaseRecord to the specified thread's buffer.
- * Silently drops records when the buffer is full.
+ * When the buffer is full, switches to a new buffer via FreeQueue.
  *
  * @param thread_idx Scheduler thread index
  * @param phase_id Phase identifier
@@ -163,4 +183,14 @@ void l2_perf_aicpu_record_orch_phase(
 void l2_perf_aicpu_init_core_assignments(int total_cores);
 void l2_perf_aicpu_write_core_assignments_for_thread(int thread_idx, const int *core_ids, int core_num);
 
+/**
+ * Flush remaining phase records for a thread
+ *
+ * Marks the current WRITING phase buffer as READY and enqueues it
+ * for host collection. Called at thread exit (analogous to l2_perf_aicpu_flush_buffers).
+ *
+ * @param thread_idx Thread index (scheduler thread or orchestrator)
+ */
+void l2_perf_aicpu_flush_phase_buffers(int thread_idx);
+
 #endif  // PLATFORM_AICPU_L2_PERF_COLLECTOR_AICPU_H_
diff --git a/src/a5/platform/include/aicpu/pmu_collector_aicpu.h b/src/a5/platform/include/aicpu/pmu_collector_aicpu.h
@@ -13,31 +13,21 @@
  * @file pmu_collector_aicpu.h
  * @brief AICPU-side PMU collection interface (a5)
  *
- * Split of duties:
- *   - AICPU owns init (event selectors, PMU_CTRL_0/1 start) and finalize
- *     (CTRL restore). It also publishes per-core pmu_buffer_addr /
- *     pmu_reg_base into Handshake at init time so AICore can do the
- *     MMIO read itself.
- *   - AICore reads PMU counters + PMU_CNT_TOTAL via MMIO after each task
- *     (pmu_aicore_record_task), writing into PmuBuffer::dual_issue_slots.
- *   - AICPU, on COND FIN, validates the slot and commits a full PmuRecord
- *     into PmuBuffer::records[] (pmu_aicpu_complete_record).
- *
  * Lifecycle (called from aicpu_executor.cpp):
  *   pmu_aicpu_init()              — resolve per-core PMU MMIO bases + buffer
  *                                   pointers, program events, start counters,
+ *                                   pop initial PmuBuffers from free_queues,
  *                                   publish (pmu_buffer_addr, pmu_reg_base)
  *                                   to each Handshake.
  *   [task loop]
- *     pmu_aicpu_complete_record() — copy the dual-issue slot AICore wrote
+ *     pmu_aicpu_complete_record()     — copy the dual-issue slot AICore wrote
  *                                   into PmuBuffer::records[count], filling
- *                                   func_id + core_type. Drops the record
- *                                   silently if the buffer is full.
+ *                                   func_id + core_type. Switches buffer
+ *                                   when full.
+ *   pmu_aicpu_flush_buffers()     — per-thread: flush each of this thread's
+ *                                   non-empty PmuBuffers to the ready_queue
+ *                                   (mirrors a2a3 pmu_aicpu_flush_buffers)
  *   pmu_aicpu_finalize()          — per-thread: restore CTRL registers.
- *
- * a5 uses a single pre-allocated PmuBuffer per core; the host drains it via
- * rtMemcpy after stream sync (see src/a5/platform/src/host/pmu_collector.cpp).
- * There is no SPSC queue and no per-thread flush step.
  */
 
 #ifndef PLATFORM_AICPU_PMU_COLLECTOR_AICPU_H_
@@ -62,6 +52,7 @@ extern "C" bool is_pmu_enabled();
  *     PMU reg-addr table.
  *   - Program event selectors (PMU_CNT0_IDX..CNT9_IDX).
  *   - Start counters (set PMU_CTRL_0 and PMU_CTRL_1).
+ *   - Pop an initial PmuBuffer from the per-core free_queue.
  *   - Publish (pmu_buffer_addr, pmu_reg_base) into handshakes[i] so the
  *     matching AICore can read PMU MMIO and write the dual-issue slot.
  *
@@ -73,45 +64,41 @@ extern "C" bool is_pmu_enabled();
  * Must be called after the host has published pmu_data_base (via
  * set_platform_pmu_base) and after every active core has reported its
  * physical_core_id via handshake. Must be called BEFORE the caller
- * sets aicpu_regs_ready=1 on each handshake, so AICore observes the
- * new fields via the same release/acquire boundary.
+ * sets aicpu_regs_ready=1 on each handshake.
  *
- * @param handshakes         Handshake array (one per core). This function
- *                           writes pmu_buffer_addr and pmu_reg_base into
- *                           handshakes[0..num_cores). Caller owns lifetime.
- * @param physical_core_ids  Array of hardware physical core ids, indexed by
- *                           logical core_id. Caller owns the memory; this
- *                           function does not retain the pointer.
- * @param num_cores          Number of active cores (logical core_id range is [0, num_cores))
+ * @param handshakes         Handshake array (one per core)
+ * @param physical_core_ids  Array of hardware physical core ids
+ * @param num_cores          Number of active cores
  */
 void pmu_aicpu_init(Handshake *handshakes, const uint32_t *physical_core_ids, int num_cores);
 
 /**
- * Commit one PmuRecord from the dual-issue staging slot that AICore wrote
- * into PmuBuffer::dual_issue_slots[task_id & 1]. Copies register state
- * (pmu_counters + pmu_total_cycles) and fills AICPU-owned metadata
- * (task_id, func_id, core_type). When the buffer is full the record is
- * dropped and the core's PmuBufferState::dropped_record_count is incremented.
- * Every call bumps PmuBufferState::total_record_count so host can cross-check
- * collected + dropped against the AICPU's attempted-commit count.
- * No-op if PMU is not enabled or the core has no PMU buffer bound.
+ * Commit one PmuRecord from the dual-issue staging slot.
+ * Switches buffer via SPSC free_queue/ready_queue when full.
  *
  * @param core_id     Logical core index
- * @param thread_idx  AICPU thread index (reserved; not used on a5 memcpy path)
- * @param reg_task_id Register dispatch token (DATA_MAIN_BASE value). AICore
- *                    wrote this 32-bit value into dual_issue_slots[...].task_id,
- *                    so AICPU uses it to locate the slot and validate its
- *                    freshness. Callers should pass the same register token
- *                    they observed on COND / wrote via DATA_MAIN_BASE.
- * @param task_id     Full task_id to store in the PmuRecord (e.g. PTO2's
- *                    (ring_id<<32)|local_id). May differ from reg_task_id.
+ * @param thread_idx  AICPU thread index (selects ready_queue)
+ * @param reg_task_id Register dispatch token (slot match key)
+ * @param task_id     Full task_id to store in the PmuRecord
  * @param func_id     kernel_id from the completed task slot
  * @param core_type   AIC or AIV
  */
 void pmu_aicpu_complete_record(
     int core_id, int thread_idx, uint32_t reg_task_id, uint64_t task_id, uint32_t func_id, CoreType core_type
 );
 
+/**
+ * Per-thread PMU buffer flush. Mirrors a2a3 pmu_aicpu_flush_buffers().
+ *
+ * For each core in cur_thread_cores, enqueue its non-empty PmuBuffer into the
+ * thread's ready_queue so the host collector can pick it up.
+ *
+ * @param thread_idx        AICPU thread index (selects ready_queue)
+ * @param cur_thread_cores  Array of logical core ids owned by this thread
+ * @param core_num          Entries in cur_thread_cores
+ */
+void pmu_aicpu_flush_buffers(int thread_idx, const int *cur_thread_cores, int core_num);
+
 /**
  * Per-thread PMU finalize: restore CTRL registers for this thread's cores.
  *