Skip to content

feat: [ENG-2485] defer summary cascade to dream#579

Open
RyanNg1403 wants to merge 3 commits intoproj/curation-enhancementfrom
feat/ENG-2485
Open

feat: [ENG-2485] defer summary cascade to dream#579
RyanNg1403 wants to merge 3 commits intoproj/curation-enhancementfrom
feat/ENG-2485

Conversation

@RyanNg1403
Copy link
Copy Markdown
Collaborator

Summary

  • Problem: Every brv curate runs propagateStaleness inline at the end of the hot path, making 1–3 LLM calls per curate to regenerate parent _index.md summaries up the directory tree.
  • Why it matters: Across multi-curate sessions, this is ~5–15% of the per-curate token bill — pure latency/cost on the user-facing path. The work is purely cosmetic for search (parent summaries lag by up to 12h is acceptable; matched leaf content is unaffected).
  • What changed: Curate now enqueues changed paths to a stale-summary queue persisted in .brv/dream-state.json. The next brv dream cycle drains the queue, unions with its own snapshot diff (A ∪ B), and runs propagateStaleness once per unique path. On propagation failure, the catch re-enqueues the drained snapshot so atomic-drain doesn't lose work.
  • What did NOT change (scope boundary): _index.md files still exist (frontmatter + body still BM25-indexed); propagateStaleness semantics unchanged; manifest rebuild stays inline (pure filesystem, no LLM); dream's 12h cadence unchanged; in-curate writes to leaf topic files unchanged.

Type of change

  • New feature

Scope (select all touched areas)

  • Server / Daemon

Linked issues

  • Closes ENG-2485
  • Related ENG-932 (token-cost benchmark methodology), ENG-2100 (parentTaskId threading — preserved through the new runStaleSummaryPropagation seam)

Root cause (bug fixes only, otherwise write N/A)

  • Root cause: N/A
  • Why this was not caught earlier: N/A

Test plan

  • Coverage added:
    • Unit test
    • Integration test
    • Manual verification (real-LLM E2E)
  • Test file(s):
    • test/unit/infra/dream/dream-state-schema.test.ts (+54)
    • test/unit/infra/dream/dream-state-service.test.ts (+159, new tests for enqueueStaleSummaryPaths, drainStaleSummaryPaths, dedup, concurrency)
    • test/unit/infra/executor/dream-executor.test.ts (+126, drain + re-enqueue + A∪B merge)
    • test/unit/infra/executor/curate-executor.test.ts (replaced obsolete inline-propagateStaleness ENG-2100 test with ENG-2485 deferral test)
  • Key scenarios covered:
    • Dedup: same path enqueued twice → one entry, oldest enqueuedAt preserved
    • Concurrency: Promise.all over 3 enqueues → 3 unique entries
    • Atomic drain: snapshot-and-clear semantics
    • Re-enqueue on propagation failure (integration test with real FileContextTreeSnapshotService)
    • A ∪ B merge in step 5: paths from queue ∪ paths from dream's own diff

User-visible changes

None directly. Side effect: parent _index.md summaries may lag by up to 12h after a curate (until the next dream). BM25 search and leaf-file lookups are unaffected.

Evidence

  • Full unit suite: 6981 passing, 16 pending, 0 failing
  • node validate.mjs (BM25 invariants under stale _index.md): 15/15 passed across 4 staleness scenarios
  • Real-LLM E2E on ByteRover paid-tier (gemini-3-flash-preview) — 8 curates + 2 dreams covering:
    • Multi-curate accumulation: 6 paths queued across 3 distinct curates
    • Dedup live: same JWT topic re-curated, queue stayed at 3 paths with original enqueuedAt
    • Drain: brv dream --force cleared 6-path queue, regenerated 5 _index.md files at all depths
    • Curate-side fail-open: chmod 000 dream-state.json → curate exit 0
    • Dream-side re-enqueue: pre-created _manifest.json as a directory → buildManifest threw EISDIR → drained snapshot re-enqueued with new enqueuedAt
  • AC5 timing comparison (BEFORE proj/curation-enhancement vs AFTER feat/ENG-2485):
    • 6 inline _index.md LLM regenerations → 0
    • Per-curate latency: ~10s saved on hot-daemon curates (~30%)

Move the LLM-bound parent-summary cascade off curate's hot path. Curate
now enqueues changed paths to a stale-summary queue persisted in
.brv/dream-state.json; the next dream cycle drains the queue, merges
with its own snapshot diff (A ∪ B), and runs propagateStaleness once
per dirty directory.

- DreamState schema gains staleSummaryPaths[]; new service methods
  enqueueStaleSummaryPaths (RMW with path-level dedup, oldest enqueuedAt
  preserved) and drainStaleSummaryPaths (atomic snapshot-and-clear).
- curate-executor swaps the inline propagateStaleness call for an
  enqueueStaleSummaryPaths call. Manifest rebuild stays inline (pure
  filesystem, no LLM).
- dream-executor's step 5 drains the queue, unions with the diff-set,
  and runs propagateStaleness once per unique path. On propagation
  failure the catch re-enqueues the drained snapshot so atomic drain
  doesn't lose work.
- runStaleSummaryPropagation extracted as a protected seam; preserves
  the ENG-2100 parentTaskId threading so summary regenerations still
  share one billing session with the parent dream task.

Validation:
  - validate.mjs: 15/15 BM25 invariants hold under stale _index.md
  - Real-LLM E2E: 8 curates / 2 dreams on byterover paid-tier
    - dedup, multi-curate accumulation, drain, re-enqueue on failure
  - Full unit suite: 6981 passing
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 29, 2026

Claude finished @RyanNg1403's task in 2m 33s —— View job


PR Review: feat: [ENG-2485] defer summary cascade to dream

Todo

  • Gather context and understand the PR
  • Review dream-state-schema.ts changes
  • Review dream-state-service.ts changes
  • Review curate-executor.ts changes
  • Review dream-executor.ts changes
  • Review test files
  • Post comprehensive review with score

Overall Assessment: ⭐⭐⭐⭐ 4/5

Solid implementation. The atomic-drain pattern, A ∪ B merge, and re-enqueue-on-failure are all correctly reasoned. The test coverage is excellent — the concurrent enqueue tests with a shared file-backed mutex, the post-drain same-path race repro, and the propagation-failure re-enqueue test all pin the correctness invariants precisely. Four issues noted below, none blocking.


Issues Found

1. Bug: within-batch duplicates not deduplicated in enqueueStaleSummaryPaths

file: src/server/infra/dream/dream-state-service.tsenqueueStaleSummaryPaths

bug: The dedup set is built from state.staleSummaryPaths (already-queued paths), but if paths itself contains duplicates (e.g. ['a.md', 'a.md']), both entries pass the filter and both are appended.

const existing = new Set(state.staleSummaryPaths.map((e) => e.path))
const additions = paths
  .filter((p) => !existing.has(p))  // only checks against queue, not within `paths`
  .map((p) => ({enqueuedAt, path: p}))

In practice diffStates() returns unique paths, so this is unlikely to fire. But the contract doc says "deduping by path" without scoping that to the cross-queue check — a future caller passing a non-unique array would silently create duplicates.

suggestion: Add new Set(paths) before the filter, or filter against a combined set:

const incoming = [...new Set(paths)]
const additions = incoming
  .filter((p) => !existing.has(p))
  .map((p) => ({enqueuedAt, path: p}))

Fix this →


2. Defect: orphaned JSDoc comment before runStaleSummaryPropagation

file: src/server/infra/executor/dream-executor.ts — lines 336–347

The diff removed the JSDoc from runOperations but accidentally placed it immediately before runStaleSummaryPropagation, resulting in two consecutive JSDoc blocks — neither annotating runOperations:

  /**
   * Runs the three dream operations sequentially, pushing results into `out` after   <-- belongs to runOperations above
   * each step. ...
   */
  /**
   * Regenerate parent `_index.md` files for the given paths and rebuild the          <-- belongs here
   * manifest. ...
   */
  protected async runStaleSummaryPropagation(...)

runOperations at line 282 now has no JSDoc. TypeScript tooling attaches only the immediately preceding comment to the method, so the old runOperations doc is orphaned and invisible to IDE hover.

suggestion: Move the "Runs the three dream operations sequentially..." JSDoc back to just above runOperations, or remove it (the method signature is self-explanatory).

Fix this →


3. Suggestion: manifest rebuild coupled to enqueue in the same try block

file: src/server/infra/executor/curate-executor.ts — lines 144–157

The comment says the manifest rebuild is intentionally inline ("no LLM, keeps newly-curated leaf files immediately discoverable"), yet it shares a try-catch with enqueueStaleSummaryPaths:

try {
  const postState = await snapshotService.getCurrentState(baseDir)
  const changedPaths = diffStates(preState, postState)
  if (changedPaths.length > 0) {
    await dreamStateService.enqueueStaleSummaryPaths(changedPaths)  // if this throws...
    const manifestService = new FileContextTreeManifestService({baseDirectory: baseDir})
    await manifestService.buildManifest(baseDir)                    // ...this is skipped
  }
} catch {
  // Fail-open
}

A transient disk error on the dream-state write would prevent the manifest from being rebuilt, even though buildManifest is a pure FS scan with no dependency on the queue. The stated goal ("keeps newly-curated leaf files immediately discoverable") would silently not be met.

suggestion: (non-blocking) Separate the two try-catch blocks or move buildManifest to its own fail-open block. Not a correctness bug (both fail-open in the outer catch), but the current structure contradicts the comment's intent.


4. Nitpick: StaleSummaryEntrySchema accepts empty-string paths

file: src/server/infra/dream/dream-state-schema.ts

path: z.string() permits "". A z.string().min(1) would make the schema self-documenting and prevent garbage entries from a bug at the call site.


Test coverage observations

The test suite is excellent overall. One gap worth noting:

  • No test for within-batch duplicate paths in enqueueStaleSummaryPaths (the bug in issue 1). A test like enqueueStaleSummaryPaths(['a.md', 'a.md']) → length 1 would pin the intended contract and would currently fail, revealing the bug.

  • The setTimeout(5) in the dedup timestamp test (dedups entries by path) is deterministic enough for CI but worth a comment explaining why 5ms is sufficient — the value 5 isn't obviously load-safe on a busy CI runner. Consider using a monotonic clock or Date.now() + 1 mock instead.


What's done well

  • Atomic drain design (drainStaleSummaryPaths clear-in-RMW) is the right call — the comment explaining why "snapshot + clear-later" fails the same-path race is clear and directly led to the test that reproduces it.
  • Re-enqueue on propagation failure is correctly scoped to drainedSnapshot only (not changedPaths, which dream can recompute via log scan).
  • Module-level mutex registry comment explains the growth-is-bounded property so future readers don't incorrectly add cleanup.
  • EMPTY_DREAM_STATE spread in read() error path (removing the explicit pendingMerges: []) is a clean simplification — the constant already carries all defaults.
  • parentTaskId threading through runStaleSummaryPropagation is preserved correctly per ENG-2100.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0cf9ae52b6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/server/infra/executor/curate-executor.ts
- Bypass dream activity gate when the stale-summary queue has work
  (codex P1): low-activity projects with 1-2 curates would otherwise
  strand queued paths indefinitely. dream-trigger.ts now treats a
  populated queue as enough reason to drain.
- Dedup within-batch input in enqueueStaleSummaryPaths: a caller passing
  a non-unique array no longer produces duplicate queue entries.
- Split the curate-executor post-processing try block: snapshot diff,
  enqueue, and manifest rebuild now have independent fail-open scopes
  so a transient queue write error cannot skip the (no-LLM) manifest
  scan that keeps newly-curated leaves immediately discoverable.
- Restore the runOperations JSDoc that the cherry-pick orphaned in
  front of runStaleSummaryPropagation.
- Tighten StaleSummaryEntrySchema to z.string().min(1) so empty paths
  cannot persist into dream-state.json.

Tests:
  - dream-trigger: bypass-when-queue-populated + still-block-when-empty
  - dream-state-service: within-batch duplicate dedup
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 29, 2026

Claude finished @RyanNg1403's task in 3m 59s —— View job


PR Review: feat: [ENG-2485] defer summary cascade to dream

Todo

  • Gather context and understand the PR
  • Review dream-state-schema.ts changes
  • Review dream-state-service.ts changes
  • Review dream-trigger.ts changes
  • Review curate-executor.ts changes
  • Review dream-executor.ts changes
  • Review test files
  • Post comprehensive review with score

Overall Assessment: ⭐⭐⭐⭐⭐ 5/5

The refactor commit (0e5e1b3) cleanly resolves all four issues from the first review and the Codex P1 gate-2 concern. The implementation is correct, the error handling is principled, and the tests are thorough.


What the Refactor Commit Fixed

Issue Status
Within-batch duplicates in enqueueStaleSummaryPaths ✅ Fixed — [...new Set(paths)] dedup on line 87
Orphaned JSDoc before runStaleSummaryPropagation ✅ Fixed — runOperations JSDoc restored at line 282
Manifest rebuild coupled with enqueue in same try ✅ Fixed — two independent fail-open blocks in curate-executor.ts:162–175
StaleSummaryEntrySchema accepting empty paths ✅ Fixed — z.string().min(1)
Codex P1: Gate 2 blocks low-activity queue drain ✅ Fixed — dream-trigger.ts:87 bypasses activity gate when staleSummaryPaths.length > 0

Remaining Observations (all non-blocking)

1. Misleading comment in Step 1 snapshot-failure catch (dream-executor.ts:132–137)

The phrase "atomic drain already removed entries before any propagation could re-enqueue them" reads as if drain has already run — but the entire point of the if (preState) guard is that drain is never called here. The intent is the opposite: skipping drain is safer than "drain-then-fail" because a drain that ran but couldn't re-enqueue on failure would silently lose work. The inline comment I posted suggests clearer wording.

2. Re-enqueue on propagation failure resets enqueuedAt (dream-executor.ts:187–188)

enqueueStaleSummaryPaths always stamps entries with Date.now(). When the catch block re-enqueues drainedSnapshot, the original wait-time timestamps are lost. No consumer reads enqueuedAt today so this is cosmetic — but a future latency dashboard would see re-tried paths as "freshly enqueued." The inline comment flags this trade-off.

3. checkEligibility missing the queue-bypass test coverage

tryStartDream has a dedicated test for the stale-summary queue bypass (line 82–96 in dream-trigger.test.ts), but checkEligibility does not — even though both methods delegate to checkGates1to3. A future refactor of the shared path would only be caught by the tryStartDream test. Suggested test in the inline comment above.


What's Done Well (still holds)

  • Atomic drain semantics are the correct call — the clear-upfront RMW means any concurrent curate enqueueing after the drain always appends a fresh entry that persists regardless of propagation outcome.
  • A ∪ B merge before propagateStaleness — dedup via new Set([...changedPaths, ...drainedSnapshot]) is clean and correct; the MergeTestExecutor test pins this invariant precisely.
  • Re-enqueue on failure is correctly scoped to drainedSnapshot only — changedPaths from dream's own diff are naturally recovered via findChangedFilesSinceLastDream in the next cycle.
  • Gate 2 bypass is correctly narrowed: curationsSinceDream < minCurations && staleSummaryPaths.length === 0 — the gate still fires when both activity AND queue are empty, so the guard doesn't open dream unnecessarily.
  • Test suite is excellent — the concurrent enqueue, post-drain same-path race, propagation-failure re-enqueue, and A ∪ B merge tests all pin the load-bearing invariants directly.

@RyanNg1403
Copy link
Copy Markdown
Collaborator Author

Pushed 0e5e1b3 addressing the Claude review-agent feedback:

Fixed:

  • Issue 1 (within-batch dedup in enqueueStaleSummaryPaths) — Input is now deduped via new Set(paths) before checking against the queue. Test added in dream-state-service.test.ts (dedups within-batch duplicates).
  • Issue 2 (orphaned JSDoc) — Restored the runOperations JSDoc to its method; runStaleSummaryPropagation keeps its own.
  • Issue 3 (manifest coupled to enqueue) — Split into independent fail-open try blocks. A queue write error no longer skips the (no-LLM) manifest scan.
  • Issue 4 (empty-path schema) — Tightened StaleSummaryEntrySchema.path to z.string().min(1).

Plus the codex P1 (dream-trigger Gate 2 bypass when queue has work), addressed in the inline thread.

Test coverage gap noted in the review (no test for within-batch duplicates) is now closed by the new dedup test.

// this dream cycle and accumulates until the next successful run.
// Drain-skip is preferable to drain-and-lose because atomic drain
// already removed entries before any propagation could re-enqueue them.
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: The phrase "atomic drain already removed entries before any propagation could re-enqueue them" reads as if drain has already run in this code path — but drain is intentionally skipped when preState is undefined. The intended meaning is closer to:

Leaving the queue intact is safer than draining-then-failing, because the atomic drain design clears entries immediately on drain; there would be no catch to re-enqueue them if we drained up here and then threw.

No behavior change needed, just a future-reader confusion risk.

Fix this →

// already removed them, so without this they would be lost.
if (drainedSnapshot.length > 0) {
await this.deps.dreamStateService.enqueueStaleSummaryPaths(drainedSnapshot).catch(() => {
// If the re-enqueue itself fails, there is nothing more to do here.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: enqueueStaleSummaryPaths stamps each entry with Date.now(), so original enqueuedAt timestamps from the drained snapshot are silently reset on retry. No consumer reads this field today, but a future latency dashboard would see re-tried paths as "freshly enqueued." Worth a comment here so a future telemetry author isn't surprised by the reset semantics.

Fix this →

// already removed them, so without this they would be lost.
if (drainedSnapshot.length > 0) {
await this.deps.dreamStateService.enqueueStaleSummaryPaths(drainedSnapshot).catch(() => {
// If the re-enqueue itself fails, there is nothing more to do here.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: enqueueStaleSummaryPaths stamps re-enqueued entries with Date.now(), so the original enqueuedAt timestamps from the drained snapshot are lost on retry. The telemetry property ("oldest waiting path") is reset as though the retry is a fresh enqueue.

No consumer reads enqueuedAt today, so this is purely cosmetic. But if you ever add latency dashboards, the first propagation failure silently resets the clock. Consider passing the original timestamps through a lower-level API — or just a comment that explains this trade-off so a future telemetry author isn't surprised.

Suggested change
// If the re-enqueue itself fails, there is nothing more to do here.
// Note: original enqueuedAt timestamps are not preserved here; re-enqueued
// paths receive Date.now() so latency telemetry restarts from this point.
await this.deps.dreamStateService.enqueueStaleSummaryPaths(drainedSnapshot).catch(() => {

- Reword the Step 1 snapshot-failure catch in dream-executor: the old
  comment claimed "atomic drain already removed entries", but drain
  doesn't run in this code path. Clarify that skipping drain here is
  safer than draining-and-failing because the atomic-drain RMW clears
  entries synchronously.
- Add a checkEligibility test for the stale-summary queue bypass so
  both tryStartDream and checkEligibility pin the invariant. They
  delegate to the same checkGates1to3 path; without this, a future
  refactor of the shared path could regress only one call site.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 29, 2026

Claude finished @RyanNg1403's task in 1m 33s —— View job


PR Review: feat: [ENG-2485] defer summary cascade to dream

Todo

  • Gather context and understand the PR (3rd pass)
  • Review latest commit changes vs base branch
  • Check if second-pass observations were addressed
  • Review dream-executor.ts changes (misleading comment + enqueuedAt observation)
  • Review dream-trigger.ts / test coverage for checkEligibility
  • Post comprehensive review with score

Overall Assessment: ⭐⭐⭐⭐⭐ 5/5

The third commit (7838bf4) cleanly resolves two of the three second-pass observations. The one remaining observation is a pure cosmetic nitpick with no consumer impact today — acceptable to defer.


What the Third Commit Fixed

Observation Status
Misleading comment in Step 1 snapshot-failure catch ✅ Fixed — comment reworded correctly
checkEligibility missing queue-bypass test coverage ✅ Fixed — new test added in dream-trigger.test.ts
Re-enqueue resets enqueuedAt (no comment about timestamp loss) ⬜ Not addressed (cosmetic, no current consumer)

Review of the Fixes

Fix 1: Step 1 snapshot-failure catch comment (dream-executor.ts:132–137)

The rewrite is correct and meaningfully clearer. The old comment ("atomic drain already removed entries before any propagation could re-enqueue them") falsely implied drain had run. The new comment explains the actual invariant:

// Fail-open: leaving preState undefined skips the entire step 5 block
// (queue drain + propagation), so the stale-summary queue is left
// intact for the next successful dream cycle. Skipping drain here is
// safer than drain-then-fail: the atomic-drain design clears entries
// synchronously inside the RMW, so if we drained and then threw
// before reaching the catch's re-enqueue, the snapshot would be lost.

This is the correct reasoning — a future reader landing on this catch can now understand why the code is written this way, not just what happens.

Fix 2: checkEligibility queue-bypass test (dream-trigger.test.ts:299–313)

The symmetry comment is a good call:

// Symmetry with the tryStartDream bypass test — both methods delegate
// to checkGates1to3, so a future refactor of the shared path must keep
// this invariant on both call sites.

Pinning the contract on both call sites means a refactor that breaks the gate bypass would fail in two distinct test contexts rather than one, making the regression harder to miss.


Remaining Observation (non-blocking, no change needed)

Re-enqueue resets enqueuedAt (dream-executor.ts:188)

The inline suggestion from the second pass (adding a comment about timestamp loss on retry) was not applied. This is a reasonable call — enqueuedAt has no consumer today and the observation was purely cosmetic. The trade-off is documented in the PR comments; a future telemetry author can find it there. No change needed.


What's Done Well (cumulative across all three passes)

  • Atomic drain designdrainStaleSummaryPaths clear-in-RMW is the right pattern; the comment block at step 5 makes the invariant legible.
  • A ∪ B mergenew Set([...changedPaths, ...drainedSnapshot]) is correct and clean.
  • Re-enqueue on failure — correctly scoped to drainedSnapshot only; changedPaths is naturally recovered by findChangedFilesSinceLastDream in the next cycle.
  • Gate 2 bypass — narrowed correctly: staleSummaryPaths.length > 0 bypasses the activity minimum, but both conditions empty still blocks. The negative case test (should still fail activity gate when both curations AND queue are empty) pins this precisely.
  • Test suite — concurrent enqueue, post-drain same-path race, propagation-failure re-enqueue, A ∪ B merge, and now both tryStartDream and checkEligibility bypass tests all cover the load-bearing invariants.

This PR is ready to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants