Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
131 changes: 72 additions & 59 deletions src/a5/docs/pmu-profiling.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,14 +17,17 @@ exercise the PMU export flow, but does not provide real hardware counters.
### Layered Responsibilities

- **Host** owns user entry, event-type selection, PMU session setup, and
final CSV export
final CSV export. Runs a dedicated collector thread that polls
per-thread ready queues via `rtMemcpy`, writes records to CSV, and
recycles buffers back into SPSC free queues.
- **AICPU** owns PMU init/finalize (event selectors, `PMU_CTRL_0/1`
start, CTRL restore), publishes per-core `PmuBuffer` and PMU MMIO base
into each `Handshake`, and on each task FIN copies AICore's
`dual_issue_slots[]` snapshot into `PmuBuffer::records[]` while
filling `func_id` / `core_type`. AICPU also stamps each core's
`PmuBufferState::owning_thread_id` (the AICPU scheduler thread that
drives that core) so host can emit a per-record `thread_id` column.
filling `func_id` / `core_type`. When a buffer is full, AICPU
switches to a new buffer via the SPSC free queue / ready queue
protocol (identical to a2a3). At shutdown, AICPU flushes any
partially-filled buffers via `pmu_aicpu_flush_buffers()`.
- **AICore** gates the counting window around the kernel body via CTRL
SPR bit 0, reads the 10 PMU counters + `PMU_CNT_TOTAL` via the
`ld_dev` MMIO load intrinsic after each task, and writes the snapshot
Expand All @@ -50,57 +53,65 @@ logical id encodes more than 32 bits (e.g. PTO2's
values differ — slot match must use the register token, otherwise the
slot will never validate and every commit silently drops.

### Streaming Buffer Architecture (mirrors a2a3)

The a5 PMU collector uses the same SPSC streaming buffer architecture
as a2a3, with identical data structures and flow:

- **PmuFreeQueue**: SPSC queue per core. Host pushes recycled/new
PmuBuffers; AICPU pops when switching after a buffer fills.
- **PmuReadyQueue**: Per-thread ready queue in PmuDataHeader. AICPU
enqueues full buffers; host collector thread dequeues them.
- **PmuBufferState**: Per-core state tracking current active buffer,
sequence number, dropped/total record counts.
- **Buffer lifecycle**: Host pre-allocates `BUFFERS_PER_CORE` buffers
per core, pushes them into free queues. AICPU pops one at init.
When full, AICPU enqueues to ready queue and pops a fresh one.
Host collector drains ready queue, writes CSV, recycles buffers
back into free queues.

The only difference from a2a3 is the memory transport layer:

- **a2a3**: `halHostRegister` maps device memory into host address
space — host and device share the same physical memory.
- **a5**: No `halHostRegister` on DAV_3510. Host maintains separate
shadow buffers and synchronizes via `rtMemcpy` (onboard) or
`memcpy` (sim). The platform copy hooks
`pmu_platform_copy_to/from_device` abstract this.

### Device Memory Layout

```text
[ PmuSetupHeader ] ← num_cores, event_type, buffer_ptrs[N]
[ PmuBufferState[num_cores] ] ← owning_thread_id, dropped_record_count,
total_record_count (per core)
[ PmuDataHeader ] ← ready queues, queue_heads/tails,
num_cores, event_type
[ PmuBufferState[num_cores] ] ← free_queue, current_buf_ptr,
dropped/total counts, owning_thread_id
```

This single shared region (`calc_pmu_setup_size`) is allocated once at
init and pulled back via one `rtMemcpy` at finalize. The per-core
`PmuBuffer`s themselves are separate device allocations (one per core),
copied back individually on demand because their `count`-sized payloads
vary.

### Device → Host Transfer

halHostRegister is not supported on DAV_3510, so the PMU collector uses
the same two-step rtMemcpy pattern already used by the performance and
tensor-dump collectors:

1. At init: host allocates the `[PmuSetupHeader][PmuBufferState[]]`
region plus one `PmuBuffer` per core. The setup region's device
address is published into `kernel_args.pmu_data_base`. AICPU then
publishes each core's `PmuBuffer` address and PMU MMIO base into
the matching `Handshake` (`pmu_buffer_addr`, `pmu_reg_base`) so
AICore can do its own MMIO read — parallels `perf_records_addr`.
2. During execution: AICore, after each kernel completes, reads the
10 PMU counters via `ld_dev(base, offset)` and writes them into
`PmuBuffer::dual_issue_slots[reg_task_id & 1]`. AICPU, on observing
COND FIN, validates that slot's `task_id` against the register
token (`pmu_aicpu_complete_record`), copies register state into
`PmuBuffer::records[count]`, fills `func_id` / `core_type`, stamps
`PmuBufferState::owning_thread_id`, and advances `count`.
3. After stream sync: host pulls back the entire setup region (all
`PmuBufferState`s in one shot) plus each core's `PmuBuffer` payload
(header first to learn `count`, then `count * sizeof(PmuRecord)`
records). The host writes a CSV under `outputs/` and emits a
cross-check log line:

```text
PMU collector: record counts match (collected=N, dropped=K, device_total=N+K)
```

If `collected + dropped != device_total`, the difference is silent
slot-mismatch loss (AICore had not yet published the slot when
AICPU tried to commit) and the line becomes a `record count
mismatch (... diff=M silent slot-mismatch losses)` warning.

This design fits naturally into the existing a5 `PerformanceCollector` /
`TensorDumpCollector` lifecycle — `initialize / collect_all / export /
finalize` is called from `DeviceRunner::run`.
This single shared region (`calc_pmu_data_size`) is allocated once at
init. Host maintains a shadow copy and syncs via copy hooks.

Per-core `PmuBuffer`s are separate device allocations, each with a
paired host shadow buffer in `buf_pool_`.

### Host Collector Thread

The host runs `poll_and_collect()` on a dedicated thread (launched by
`DeviceRunner::run()` before kernel launch, joined after stream sync).
This thread:

1. Polls ready queue tails via `pmu_platform_copy_from_device`
2. For each ready entry, copies the full `PmuBuffer` from device to
its host shadow
3. Writes records to CSV
4. Recycles the buffer back into the core's free queue (zeroes count,
updates free queue tail on device)
5. Exits when `signal_execution_complete()` is called

After the collector thread exits, `drain_remaining_buffers()` does a
final pass: syncs the entire shared memory region, drains any remaining
ready queue entries, and scans `current_buf_ptr` for partially-filled
buffers that AICPU flushed but couldn't enqueue.

## Usage

Expand Down Expand Up @@ -166,7 +177,7 @@ Columns (in order) — matches a2a3 host PMU CSV for tooling parity:

| Column | Meaning |
| ------ | ------- |
| `thread_id` | AICPU scheduler thread that drives this core (read from `PmuBufferState::owning_thread_id`) |
| `thread_id` | AICPU scheduler thread that drives this core |
| `core_id` | Logical AICore id in the runtime |
| `task_id` | Runtime task id, printed as hex |
| `func_id` | Kernel function id |
Expand All @@ -179,8 +190,9 @@ For the default `PIPE_UTILIZATION` event type (`2`), the counter columns
on a5 are (from pypto `tilefwk_pmu_to_csv.py` table_pmu_header_3510):

```text
pmu_idc_aic_vec_busy_o,cube_instr_busy,scalar_instr_busy,mte1_instr_busy,
mte2_instr_busy,mte3_instr_busy,icache_req,icache_miss,pmu_fix_instr_busy
pmu_idc_aic_vec_busy_o,cube_instr_busy,scalar_instr_busy,
mte1_instr_busy,mte2_instr_busy,mte3_instr_busy,
icache_req,icache_miss,pmu_fix_instr_busy
```

The number of counter columns varies by event type — each DAV_3510 event
Expand All @@ -196,12 +208,13 @@ discover which columns are present.
always 0. The AICPU still programs the PMU event selectors and the CSV
still carries one row per task with a zero counter tuple — useful for
verifying the end-to-end data flow but not for performance analysis.
- The per-core on-device `PmuBuffer` is a fixed size
(`PLATFORM_PMU_RECORDS_PER_BUFFER = 4096`). Tasks past that count are
dropped and accounted in `PmuBufferState::dropped_record_count`; the
host surfaces the total in the finalize log line. Increase the
constant in [platform_config.h](../platform/include/common/platform_config.h)
if your workload executes more tasks per core.
- The per-core on-device `PmuBuffer` capacity is controlled by
`PLATFORM_PMU_RECORDS_PER_BUFFER` (default 512). When full, AICPU
switches to a new buffer via the free queue. If no free buffer is
available, records are dropped. Increase `PLATFORM_PMU_BUFFERS_PER_CORE`
(default 4) in
[platform_config.h](../platform/include/common/platform_config.h)
if your workload produces bursts that exhaust the buffer pool.
- A non-zero `diff` in the host's `record count mismatch` warning means
AICPU attempted to commit `diff` records whose dual-issue slot still
carried an older `task_id`. Under the current design on DAV_3510
Expand Down
52 changes: 41 additions & 11 deletions src/a5/platform/include/aicpu/l2_perf_collector_aicpu.h
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,7 @@
* @brief AICPU performance data collection interface
*
* Provides performance profiling management interface for AICPU side.
* Handles buffer initialization and per-record completion. In the memcpy-based
* collection design, Host pre-allocates one L2PerfBuffer per core and one
* PhaseBuffer per thread; AICPU writes directly into them until full, after
* which further records are silently dropped.
* Handles buffer initialization, switching, and flushing.
*/

#ifndef PLATFORM_AICPU_L2_PERF_COLLECTOR_AICPU_H_
Expand Down Expand Up @@ -56,12 +53,10 @@ void l2_perf_aicpu_init_profiling(Runtime *runtime);
* Complete a L2PerfRecord with AICPU-side metadata after AICore task completion
*
* Reads l2_perf_buf->count, validates task_id match against the latest record,
* and fills all AICPU-side fields. Returns -1 and silently drops the record
* when the buffer is full (count >= PLATFORM_PROF_BUFFER_SIZE). Callers must
* pre-extract fanout into a plain uint64_t array (platform layer cannot depend
* on runtime linked-list types).
* and fills all AICPU-side fields. Callers must pre-extract fanout into a
* plain uint64_t array (platform layer cannot depend on runtime linked-list types).
*
* @param l2_perf_buf L2PerfBuffer pointer (from handshake l2_perf_records_addr)
* @param l2_perf_buf L2PerfBuffer pointer (from handshake l2_perf_records_addr)
* @param expected_reg_task_id Register dispatch token (low 32 bits) to validate
* @param task_id Task identifier to write (PTO2 encoding or plain id)
* @param func_id Kernel function identifier
Expand All @@ -76,6 +71,30 @@ int l2_perf_aicpu_complete_record(
uint64_t dispatch_time, uint64_t finish_time, const uint64_t *fanout, int32_t fanout_count
);

/**
* Switch performance buffer when current buffer is full
*
* Enqueues the full buffer to ReadyQueue, pops a new buffer from FreeQueue,
* and updates the handshake l2_perf_records_addr. If FreeQueue is empty,
* overwrites the current buffer (lossy fallback).
*
* @param runtime Runtime instance pointer
* @param core_id Core ID
* @param thread_idx Thread index
*/
void l2_perf_aicpu_switch_buffer(Runtime *runtime, int core_id, int thread_idx);

/**
* Flush remaining performance data
*
* Marks non-empty buffers as ready and enqueues them for host collection.
*
* @param thread_idx Thread index
* @param cur_thread_cores Array of core IDs managed by this thread
* @param core_num Number of cores managed by this thread
*/
void l2_perf_aicpu_flush_buffers(int thread_idx, const int *cur_thread_cores, int core_num);

/**
* Update total task count in performance header
*
Expand All @@ -92,15 +111,16 @@ void l2_perf_aicpu_update_total_tasks(uint32_t total_tasks);
* Sets up AicpuPhaseHeader and clears per-thread phase record buffers.
* Must be called once from thread 0 after l2_perf_aicpu_init_profiling().
*
* @param runtime Runtime instance pointer
* @param num_sched_threads Number of scheduler threads
*/
void l2_perf_aicpu_init_phase_profiling(int num_sched_threads);
void l2_perf_aicpu_init_phase_profiling(Runtime *runtime, int num_sched_threads);

/**
* Record a single scheduler phase
*
* Appends an AicpuPhaseRecord to the specified thread's buffer.
* Silently drops records when the buffer is full.
* When the buffer is full, switches to a new buffer via FreeQueue.
*
* @param thread_idx Scheduler thread index
* @param phase_id Phase identifier
Expand Down Expand Up @@ -163,4 +183,14 @@ void l2_perf_aicpu_record_orch_phase(
void l2_perf_aicpu_init_core_assignments(int total_cores);
void l2_perf_aicpu_write_core_assignments_for_thread(int thread_idx, const int *core_ids, int core_num);

/**
* Flush remaining phase records for a thread
*
* Marks the current WRITING phase buffer as READY and enqueues it
* for host collection. Called at thread exit (analogous to l2_perf_aicpu_flush_buffers).
*
* @param thread_idx Thread index (scheduler thread or orchestrator)
*/
void l2_perf_aicpu_flush_phase_buffers(int thread_idx);

#endif // PLATFORM_AICPU_L2_PERF_COLLECTOR_AICPU_H_
71 changes: 29 additions & 42 deletions src/a5/platform/include/aicpu/pmu_collector_aicpu.h
Original file line number Diff line number Diff line change
Expand Up @@ -13,31 +13,21 @@
* @file pmu_collector_aicpu.h
* @brief AICPU-side PMU collection interface (a5)
*
* Split of duties:
* - AICPU owns init (event selectors, PMU_CTRL_0/1 start) and finalize
* (CTRL restore). It also publishes per-core pmu_buffer_addr /
* pmu_reg_base into Handshake at init time so AICore can do the
* MMIO read itself.
* - AICore reads PMU counters + PMU_CNT_TOTAL via MMIO after each task
* (pmu_aicore_record_task), writing into PmuBuffer::dual_issue_slots.
* - AICPU, on COND FIN, validates the slot and commits a full PmuRecord
* into PmuBuffer::records[] (pmu_aicpu_complete_record).
*
* Lifecycle (called from aicpu_executor.cpp):
* pmu_aicpu_init() — resolve per-core PMU MMIO bases + buffer
* pointers, program events, start counters,
* pop initial PmuBuffers from free_queues,
* publish (pmu_buffer_addr, pmu_reg_base)
* to each Handshake.
* [task loop]
* pmu_aicpu_complete_record() — copy the dual-issue slot AICore wrote
* pmu_aicpu_complete_record() — copy the dual-issue slot AICore wrote
* into PmuBuffer::records[count], filling
* func_id + core_type. Drops the record
* silently if the buffer is full.
* func_id + core_type. Switches buffer
* when full.
* pmu_aicpu_flush_buffers() — per-thread: flush each of this thread's
* non-empty PmuBuffers to the ready_queue
* (mirrors a2a3 pmu_aicpu_flush_buffers)
* pmu_aicpu_finalize() — per-thread: restore CTRL registers.
*
* a5 uses a single pre-allocated PmuBuffer per core; the host drains it via
* rtMemcpy after stream sync (see src/a5/platform/src/host/pmu_collector.cpp).
* There is no SPSC queue and no per-thread flush step.
*/

#ifndef PLATFORM_AICPU_PMU_COLLECTOR_AICPU_H_
Expand All @@ -62,6 +52,7 @@ extern "C" bool is_pmu_enabled();
* PMU reg-addr table.
* - Program event selectors (PMU_CNT0_IDX..CNT9_IDX).
* - Start counters (set PMU_CTRL_0 and PMU_CTRL_1).
* - Pop an initial PmuBuffer from the per-core free_queue.
* - Publish (pmu_buffer_addr, pmu_reg_base) into handshakes[i] so the
* matching AICore can read PMU MMIO and write the dual-issue slot.
*
Expand All @@ -73,45 +64,41 @@ extern "C" bool is_pmu_enabled();
* Must be called after the host has published pmu_data_base (via
* set_platform_pmu_base) and after every active core has reported its
* physical_core_id via handshake. Must be called BEFORE the caller
* sets aicpu_regs_ready=1 on each handshake, so AICore observes the
* new fields via the same release/acquire boundary.
* sets aicpu_regs_ready=1 on each handshake.
*
* @param handshakes Handshake array (one per core). This function
* writes pmu_buffer_addr and pmu_reg_base into
* handshakes[0..num_cores). Caller owns lifetime.
* @param physical_core_ids Array of hardware physical core ids, indexed by
* logical core_id. Caller owns the memory; this
* function does not retain the pointer.
* @param num_cores Number of active cores (logical core_id range is [0, num_cores))
* @param handshakes Handshake array (one per core)
* @param physical_core_ids Array of hardware physical core ids
* @param num_cores Number of active cores
*/
void pmu_aicpu_init(Handshake *handshakes, const uint32_t *physical_core_ids, int num_cores);

/**
* Commit one PmuRecord from the dual-issue staging slot that AICore wrote
* into PmuBuffer::dual_issue_slots[task_id & 1]. Copies register state
* (pmu_counters + pmu_total_cycles) and fills AICPU-owned metadata
* (task_id, func_id, core_type). When the buffer is full the record is
* dropped and the core's PmuBufferState::dropped_record_count is incremented.
* Every call bumps PmuBufferState::total_record_count so host can cross-check
* collected + dropped against the AICPU's attempted-commit count.
* No-op if PMU is not enabled or the core has no PMU buffer bound.
* Commit one PmuRecord from the dual-issue staging slot.
* Switches buffer via SPSC free_queue/ready_queue when full.
*
* @param core_id Logical core index
* @param thread_idx AICPU thread index (reserved; not used on a5 memcpy path)
* @param reg_task_id Register dispatch token (DATA_MAIN_BASE value). AICore
* wrote this 32-bit value into dual_issue_slots[...].task_id,
* so AICPU uses it to locate the slot and validate its
* freshness. Callers should pass the same register token
* they observed on COND / wrote via DATA_MAIN_BASE.
* @param task_id Full task_id to store in the PmuRecord (e.g. PTO2's
* (ring_id<<32)|local_id). May differ from reg_task_id.
* @param thread_idx AICPU thread index (selects ready_queue)
* @param reg_task_id Register dispatch token (slot match key)
* @param task_id Full task_id to store in the PmuRecord
* @param func_id kernel_id from the completed task slot
* @param core_type AIC or AIV
*/
void pmu_aicpu_complete_record(
int core_id, int thread_idx, uint32_t reg_task_id, uint64_t task_id, uint32_t func_id, CoreType core_type
);

/**
* Per-thread PMU buffer flush. Mirrors a2a3 pmu_aicpu_flush_buffers().
*
* For each core in cur_thread_cores, enqueue its non-empty PmuBuffer into the
* thread's ready_queue so the host collector can pick it up.
*
* @param thread_idx AICPU thread index (selects ready_queue)
* @param cur_thread_cores Array of logical core ids owned by this thread
* @param core_num Entries in cur_thread_cores
*/
void pmu_aicpu_flush_buffers(int thread_idx, const int *cur_thread_cores, int core_num);

/**
* Per-thread PMU finalize: restore CTRL registers for this thread's cores.
*
Expand Down
Loading
Loading