Skip to content

Refactor: unify PMU/L2Perf/TensorDump collectors on shared profiling framework#705

Open
ChaoZheng109 wants to merge 1 commit intohw-native-sys:mainfrom
ChaoZheng109:a2a3/profiling
Open

Refactor: unify PMU/L2Perf/TensorDump collectors on shared profiling framework#705
ChaoZheng109 wants to merge 1 commit intohw-native-sys:mainfrom
ChaoZheng109:a2a3/profiling

Conversation

@ChaoZheng109
Copy link
Copy Markdown
Collaborator

Summary

  • Introduce src/a2a3/platform/include/host/profiling_common/ with ProfilerBase<Derived, Module> (CRTP-based mgmt + collector thread orchestration) and BufferPoolManager (pre-registered device buffer pool, dev↔host pointer mapping).
  • Rewrite PmuCollector, L2PerfCollector, and TensorDumpCollector on top of the shared framework, collapsing three near-identical control flows into one and dropping ~2000 lines of duplicated .cpp code.
  • Reorganize profiling docs: move pmu-profiling.md from src/{a2a3,a5}/docs/ to top-level docs/, add profiling-framework.md and l2-swimlane-profiling.md, refresh tensor-dump.md, and update profiling-name-map.md / runtimes.md / testing.md to point at the new locations and the per-case output_prefix layout.

Testing

  • Simulation tests pass
  • Hardware tests pass

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request unifies the host-side infrastructure for PMU, L2 Swimlane, and Tensor Dump profiling into a shared framework, significantly reducing code duplication and improving maintainability across the a2a3 and a5 architectures. Key enhancements include a new three-bucket counter accounting model for better loss diagnostics, improved memory management via a centralized buffer pool, and the addition of completion barriers to ensure data consistency in tensor dumps. Feedback from the review suggests optimizing memory barriers in the SPSC queue logic to avoid redundancy and improve performance on weak-ordering architectures, as well as increasing the frequency of progress updates during the final data export phase.

Comment thread src/a2a3/platform/include/host/profiling_common/profiler_base.h Outdated
Comment thread src/a2a3/platform/include/host/profiling_common/profiler_base.h Outdated
Comment thread src/a2a3/platform/src/host/tensor_dump_collector.cpp Outdated
@ChaoZheng109 ChaoZheng109 force-pushed the a2a3/profiling branch 3 times, most recently from 9db7c83 to 67fee48 Compare April 30, 2026 03:43
…framework

Introduce src/a2a3/platform/include/host/profiling_common/ with
ProfilerBase<Derived, Module> (CRTP-based mgmt + collector thread
orchestration) and BufferPoolManager (pre-registered device buffer pool,
dev<->host pointer mapping). Rewrite PmuCollector, L2PerfCollector, and
TensorDumpCollector on top of it, collapsing three near-identical
control flows into one and shedding ~2000 lines of duplication across
the .cpp files.

Reorganize profiling docs to match the now-shared framework: move
pmu-profiling.md out of src/{a2a3,a5}/docs/ to top-level docs/, add
profiling-framework.md and l2-swimlane-profiling.md, refresh
tensor-dump.md, and update profiling-name-map.md / runtimes.md /
testing.md to point at the new locations and the per-case
output_prefix layout.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant