Refactor: unify PMU/L2Perf/TensorDump collectors on shared profiling framework#705
Refactor: unify PMU/L2Perf/TensorDump collectors on shared profiling framework#705ChaoZheng109 wants to merge 1 commit intohw-native-sys:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request unifies the host-side infrastructure for PMU, L2 Swimlane, and Tensor Dump profiling into a shared framework, significantly reducing code duplication and improving maintainability across the a2a3 and a5 architectures. Key enhancements include a new three-bucket counter accounting model for better loss diagnostics, improved memory management via a centralized buffer pool, and the addition of completion barriers to ensure data consistency in tensor dumps. Feedback from the review suggests optimizing memory barriers in the SPSC queue logic to avoid redundancy and improve performance on weak-ordering architectures, as well as increasing the frequency of progress updates during the final data export phase.
9db7c83 to
67fee48
Compare
…framework
Introduce src/a2a3/platform/include/host/profiling_common/ with
ProfilerBase<Derived, Module> (CRTP-based mgmt + collector thread
orchestration) and BufferPoolManager (pre-registered device buffer pool,
dev<->host pointer mapping). Rewrite PmuCollector, L2PerfCollector, and
TensorDumpCollector on top of it, collapsing three near-identical
control flows into one and shedding ~2000 lines of duplication across
the .cpp files.
Reorganize profiling docs to match the now-shared framework: move
pmu-profiling.md out of src/{a2a3,a5}/docs/ to top-level docs/, add
profiling-framework.md and l2-swimlane-profiling.md, refresh
tensor-dump.md, and update profiling-name-map.md / runtimes.md /
testing.md to point at the new locations and the per-case
output_prefix layout.
67fee48 to
35ef482
Compare
Summary
src/a2a3/platform/include/host/profiling_common/withProfilerBase<Derived, Module>(CRTP-based mgmt + collector thread orchestration) andBufferPoolManager(pre-registered device buffer pool, dev↔host pointer mapping).PmuCollector,L2PerfCollector, andTensorDumpCollectoron top of the shared framework, collapsing three near-identical control flows into one and dropping ~2000 lines of duplicated.cppcode.pmu-profiling.mdfromsrc/{a2a3,a5}/docs/to top-leveldocs/, addprofiling-framework.mdandl2-swimlane-profiling.md, refreshtensor-dump.md, and updateprofiling-name-map.md/runtimes.md/testing.mdto point at the new locations and the per-caseoutput_prefixlayout.Testing