Skip to content

[ADR] VF fusion PIPE_V barrier coarsening design#605

Open
TaoTao-real wants to merge 1 commit intohw-native-sys:mainfrom
TaoTao-real:codex/adr-vf-pipev-barrier
Open

[ADR] VF fusion PIPE_V barrier coarsening design#605
TaoTao-real wants to merge 1 commit intohw-native-sys:mainfrom
TaoTao-real:codex/adr-vf-pipev-barrier

Conversation

@TaoTao-real
Copy link
Copy Markdown
Contributor

Summary

Add an architecture-level ADR for reducing redundant pipe_barrier(PIPE_V) insertion in VF-fusible vector chains.

Why

Current auto sync tends to insert PIPE_V barriers conservatively for same-pipe dependencies, which can over-serialize vector chains (e.g. FA/softmax style sequences) and leave performance on the table.

What is included

  • New design doc:
    • docs/designs/ptoas-vf-fusion-pipev-barrier-optimization-adr.md
  • Proposed architecture:
    • VF segment modeling
    • risk-matrix based keep/drop decisions
    • unknown-safe conservative fallback
    • phased rollout and regression gates

Scope

  • Docs-only ADR PR
  • No code/behavior change in this PR

Follow-up

Implementation will be tracked in follow-up PRs with strict regression gates (issue428/454 + FA representative cases).

@reedhecre
Copy link
Copy Markdown

Codex Review

该评论由 review 机器人自动更新。

  • PR: [ADR] VF fusion PIPE_V barrier coarsening design #605 [ADR] VF fusion PIPE_V barrier coarsening design
  • Author: TaoTao-real
  • Base/Head: main / codex/adr-vf-pipev-barrier
  • Head SHA: 0a02c82cc99a
  • Trigger: 检测到新的 open PR
  • Generated At: 2026-04-29T09:40:54Z
  • Status: completed

Summary

The ADR's pairwise coarsening model can remove required same-pipe barriers, and its validation criteria are not aligned with the existing A5 PIPE_V barrier contract.

Findings

  1. P1 Adjacent-pair VF segmentation can delete barriers that protect non-adjacent hazards docs/designs/ptoas-vf-fusion-pipev-barrier-optimization-adr.md:112

The proposal reasons about PIPE_V barriers only through adjacent A -> B pairs and segment boundaries. Current InsertSync is not local in that way: InsertSeqSync scans all earlier ops for each now op, and a same-pipe barrier before now records the original producer (lib/PTO/Transforms/InsertSync/InsertSyncAnalysis.cpp, around lines 153-176 and 350-356). A barrier before C can therefore represent an A -> C hazard even when the immediate predecessor B is unrelated. In a sequence like A: write ub0; B: write ub1; C: read ub0, both adjacent pairs look safe, but the barrier before C is still required. A segmenter that only decides on adjacent pairs would absorb C into the same segment and drop a required barrier, producing wrong results on same-pipe RAW/WAW chains. This also contradicts the current same-pipe contract exercised by test/samples/Sync/test_inject_sync_intra_pipe_barrier.pto.

  1. P2 The acceptance criteria ignore that A5 already strips `PIPE_V` intra-pipe barriers docs/designs/ptoas-vf-fusion-pipev-barrier-optimization-adr.md:183

The ADR treats dense PIPE_V barriers as a general runtime problem and uses a drop in pipeVBarrierAfter as a success criterion. That is not true for A5 today: SyncCodegen::CreateBarrierOp already returns early for PIPE_V barriers on A5, and LoweringSyncToPipe erases any remaining PIPE_V barriers there. So InsertSync-side coarsening does not change emitted A5 kernels, while on A2/A3 the same barrier is still the correctness mechanism for same-pipe hazards. Without scoping the target architecture and whether the metric is counted in syncIR versus generated code/runtime, this ADR can appear successful without affecting the kernels it is trying to speed up, or encourage applying an A5-specific assumption to architectures where the barrier is still required.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request proposes an ADR to optimize PIPE_V barriers in VF fusion scenarios, reducing redundant synchronization in vector chains like FA and softmax through segment-level modeling and risk-based operator classification. Reviewer feedback suggests enhancing the design by addressing interactions with existing Gather/Scatter logic, providing hardware constraints for stateful operations, specifying analysis tools for alias detection, and defining heuristics for the proposed optimization levels.

集成策略:

1. `InsertSyncAnalysis` 仍先建全量候选同步边(保持正确性基线)。
2. 在 `RemoveRedundantSync` 之前增加“VF 段精简”步骤:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The existing PTOInsertSync pass (line 127) disables redundancy elimination for kernels with Gather/Scatter operations due to correctness risks. Since the ADR proposes a new optimization step for PIPE_V barriers, it should explicitly address whether this new analysis will also be gated by hasGatherScatterLikeOps or how it intends to safely handle these high-risk operations where the current logic defaults to conservative behavior.


问题不是“完全移除 `PIPE_V` barrier”,而是区分:

1. 必要 barrier:防止真实读写冲突、跨阶段可见性问题、特殊算子内部临时缓冲风险。
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ADR mentions "cross-stage visibility issues" as a reason for necessary barriers. Defining the specific hardware constraints (e.g., internal buffer synchronization or pipeline hazards) would help in creating a more precise VF_STATEFUL_OR_UNKNOWN classification.

1. `A/B` 在同一控制域、同一 VF 候选段内。
2. 分类组合在“可连续执行矩阵”中允许。
3. 依赖为 SSA 前向可见,不要求额外内存可见性屏障。
4. alias/slice 可证明无额外冲突。
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Clarify whether the "alias/slice" proof will leverage the existing MemoryDependentAnalyzer or if a specialized analysis for vector instruction offsets and strides is required to detect intra-buffer overlaps.

建议新增 pass 选项(先默认关闭,灰度):

1. `--enable-vf-barrier-coarsening`
2. `--vf-barrier-coarsening-level=[safe|balanced|aggressive]`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ADR should define the heuristics for balanced and aggressive levels. For example, will aggressive allow merging segments across certain control flow boundaries or relax the safety checks for REDUCTION operations?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants