Skip to content

Add multi-node failure, lost transactions, and source-failover guidance to CNF doc.#394

Open
ibrarahmad wants to merge 2 commits intomainfrom
DOC-ACE
Open

Add multi-node failure, lost transactions, and source-failover guidance to CNF doc.#394
ibrarahmad wants to merge 2 commits intomainfrom
DOC-ACE

Conversation

@ibrarahmad
Copy link
Contributor

Add a new scenario for two-of-three-nodes-fail recovery via ZODAN rebuild, a section on the risk of unreplicated transactions due to asynchronous logical replication, and stronger troubleshooting steps when the source node fails mid-replay.

…ce to CNF doc.

Add a new scenario for two-of-three-nodes-fail recovery via ZODAN rebuild,
a section on the risk of unreplicated transactions due to asynchronous
logical replication, and stronger troubleshooting steps when the source
node fails mid-replay.
@ibrarahmad ibrarahmad requested a review from mason-sharp March 17, 2026 15:55
@coderabbitai
Copy link

coderabbitai bot commented Mar 17, 2026

📝 Walkthrough

Walkthrough

Adds a new “Two of Three Nodes Fail” recovery scenario, expands per-origin and per-table recovery workflows, introduces a “Risk of Lost Transactions” section with mitigations, renames/reorders phases (e.g., “Spock Cleanup”), and updates commands and warnings including --preserve-origin usage.

Changes

Cohort / File(s) Summary
Disaster Recovery Documentation
docs/recovery/catastrophic_node_failure.md
Adds a third scenario (“Two of Three Nodes Fail”) and full recovery workflow; inserts “Risk of Lost Transactions” and mitigation guidance; renames/reorders phases (Phase 2 → “Spock Cleanup”, updates sequencing); adds per-origin and multi-origin recovery steps, per-table diffs/repairs, stronger warnings, updated example commands (consistent --preserve-origin), mermaid diagram, and improved cross-references and editorial consistency.

Poem

🐰 I hopped through nodes both near and far,

Two fell silent — we studied each scar.
With diffs and spock-clean sweeps so clever,
--preserve-origin binds roots together.
Recovery carrots, safe and bright — hop on, forever!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the three main additions: multi-node failure recovery, lost transactions guidance, and source-failover troubleshooting.
Description check ✅ Passed The description is directly related to the changeset, covering the new two-of-three-nodes recovery scenario, unreplicated transactions risk, and source-failover troubleshooting steps.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch DOC-ACE
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
docs/recovery/catastrophic_node_failure.md (1)

282-312: Add a one-line origin-semantics note in the data-loss section

Since this section discusses origin-filtered recovery, consider explicitly clarifying that local writes are origin 0, while unavailable origin metadata is NULL/unknown—they are not the same state.

Based on learnings: origin = 0 means local change; NULL/unknown means origin metadata unavailable and must not be conflated.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/recovery/catastrophic_node_failure.md` around lines 282 - 312, Add a
one-line clarification to the "Risk of Lost Transactions" / origin-filtered
recovery discussion stating the origin semantics: explicitly note that origin =
0 denotes a local write, whereas unavailable origin metadata is represented as
NULL (unknown) and these two states must not be conflated; reference the symbols
"origin = 0" and "NULL" in that sentence for clarity.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/recovery/catastrophic_node_failure.md`:
- Around line 25-31: The intro now lists three scenarios but nearby text still
uses the phrase "both cases"; update any occurrences of "both cases" in
docs/recovery/catastrophic_node_failure.md to a plural-consistent phrase such as
"these cases" or "any of these scenarios" so the wording matches the three-item
list, and scan surrounding sentences for pronouns that assume only two cases and
adjust them likewise.
- Around line 852-857: Update the failover retry step that instructs rerunning
table-diff and table-repair (mentions table-diff and table-repair and the
example --source-of-truth n4) to require reusing the original recovery flags:
add explicit text requiring --preserve-origin, --recovery-mode and the same
--until cutoff(s) in the command examples and prose so operators know to include
them alongside --source-of-truth to ensure equivalent repairs on retry.

---

Nitpick comments:
In `@docs/recovery/catastrophic_node_failure.md`:
- Around line 282-312: Add a one-line clarification to the "Risk of Lost
Transactions" / origin-filtered recovery discussion stating the origin
semantics: explicitly note that origin = 0 denotes a local write, whereas
unavailable origin metadata is represented as NULL (unknown) and these two
states must not be conflated; reference the symbols "origin = 0" and "NULL" in
that sentence for clarity.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7d89fc73-0b46-4e16-8ba5-07e6c2b1eaf1

📥 Commits

Reviewing files that changed from the base of the PR and between 0ae4205 and e5ccf42.

📒 Files selected for processing (1)
  • docs/recovery/catastrophic_node_failure.md

Comment on lines +25 to +31
This document covers three cases:

- **single node failure** (one node fails and another is lagging)
- **single node failure** (one node fails and another is lagging)
- **two of three nodes fail** (only one survivor remains; rebuild the failed
nodes from it)
- **multiple node failure** (two or more nodes fail and one or more survivors
are behind).
are behind).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Intro now lists three cases; avoid “both cases” wording nearby

Line 25–31 correctly introduces three scenarios, but nearby narrative still refers to “both cases,” which reads inconsistent after this change. Tightening that phrasing will reduce operator confusion.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/recovery/catastrophic_node_failure.md` around lines 25 - 31, The intro
now lists three scenarios but nearby text still uses the phrase "both cases";
update any occurrences of "both cases" in
docs/recovery/catastrophic_node_failure.md to a plural-consistent phrase such as
"these cases" or "any of these scenarios" so the wording matches the three-item
list, and scan surrounding sentences for pronouns that assume only two cases and
adjust them likewise.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/recovery/catastrophic_node_failure.md`:
- Around line 654-655: Typo fix: add a missing space between the inline code
token and the following word so "`--preserve-origin`clause" becomes
"`--preserve-origin` clause" in the docs; update the text near the occurrence of
`--preserve-origin` to insert the space for correct readability.
- Around line 122-123: The document contains duplicate section headings "###
Implementing a Recovery" which create identical anchors; locate both occurrences
of the heading text "### Implementing a Recovery" and make them unique by
appending a scenario-specific suffix (for example "### Implementing a Recovery —
Catastrophic Node Failure" and "### Implementing a Recovery — <other scenario
name>"), and then update any internal links/anchors that reference those
headings to use the new unique anchor names; ensure the two headings are
distinct and descriptive so anchors are no longer duplicated.
- Around line 127-136: The numbered list in the recovery doc skips from "2." to
"4."; update the markdown so the sequence is continuous by changing the "4."
line (the step starting "When the initial sync completes for each node, validate
with `table-diff`...") to "3." (or insert a proper step 3 if additional content
is needed) so the list reads 1, 2, 3; modify the stanza containing "Clean up
Spock..." / "Rebuild n2 and n3..." / "When the initial sync completes..."
accordingly.
- Line 839: Fix the typo in the markdown heading "Origin Netadata Missing for
Some Rows" by changing "Netadata" to "Metadata" so the heading reads "Origin
Metadata Missing for Some Rows"; update that exact heading string in
docs/recovery/catastrophic_node_failure.md.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 358af2bc-3dfc-4a38-a545-d50a5d6f42ec

📥 Commits

Reviewing files that changed from the base of the PR and between e5ccf42 and 8bf4b33.

📒 Files selected for processing (1)
  • docs/recovery/catastrophic_node_failure.md

Comment on lines +122 to +123
### Implementing a Recovery

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Duplicate ### Implementing a Recovery headings reduce navigability.

Lines 122 and 179 reuse the same heading text, which produces duplicate section names and awkward anchors in long docs. Please make these headings unique per scenario.

Also applies to: 179-180

🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 122-122: Multiple headings with the same content

(MD024, no-duplicate-heading)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/recovery/catastrophic_node_failure.md` around lines 122 - 123, The
document contains duplicate section headings "### Implementing a Recovery" which
create identical anchors; locate both occurrences of the heading text "###
Implementing a Recovery" and make them unique by appending a scenario-specific
suffix (for example "### Implementing a Recovery — Catastrophic Node Failure"
and "### Implementing a Recovery — <other scenario name>"), and then update any
internal links/anchors that reference those headings to use the new unique
anchor names; ensure the two headings are distinct and descriptive so anchors
are no longer duplicated.

Comment on lines +127 to +136
1. Clean up Spock by dropping subscriptions to n2 and n3 on n1, then
dropping the failed nodes from the cluster.
2. Rebuild n2 and n3 from n1; because the failed nodes are unavailable, you
cannot run ACE `table-diff` against them; instead, use ZODAN's
`add_node` procedure to add n2 and n3 back as fresh nodes, which
performs a full initial sync from n1; see the
[ZODAN documentation](../modify/zodan/index.md) for step-by-step
instructions.
4. When the initial sync completes for each node, validate with
`table-diff` across all three nodes to confirm they match.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Numbered steps skip from 2 to 4.

Line 135 starts at 4. with no 3.. Even if Markdown auto-numbers, this is confusing in raw/edit views.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/recovery/catastrophic_node_failure.md` around lines 127 - 136, The
numbered list in the recovery doc skips from "2." to "4."; update the markdown
so the sequence is continuous by changing the "4." line (the step starting "When
the initial sync completes for each node, validate with `table-diff`...") to
"3." (or insert a proper step 3 if additional content is needed) so the list
reads 1, 2, 3; modify the stanza containing "Clean up Spock..." / "Rebuild n2
and n3..." / "When the initial sync completes..." accordingly.

Comment on lines +654 to +655
of truth (n3) and including the `--preserve-origin`clause every time. ACE
writes a new filename for each table-diff run; use the file
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Minor typo: missing space before “clause”.

Line 654 has `--preserve-origin`clause; add a space for readability.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/recovery/catastrophic_node_failure.md` around lines 654 - 655, Typo fix:
add a missing space between the inline code token and the following word so
"`--preserve-origin`clause" becomes "`--preserve-origin` clause" in the docs;
update the text near the occurrence of `--preserve-origin` to insert the space
for correct readability.

the complete data.

### Origin metadata missing for some rows
### Origin Netadata Missing for Some Rows
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Typo in troubleshooting heading.

Line 839: “Netadata” should be “Metadata”.

🧰 Tools
🪛 LanguageTool

[grammar] ~839-~839: Ensure spelling is correct
Context: ...know has the complete data. ### Origin Netadata Missing for Some Rows ACE may log a wa...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/recovery/catastrophic_node_failure.md` at line 839, Fix the typo in the
markdown heading "Origin Netadata Missing for Some Rows" by changing "Netadata"
to "Metadata" so the heading reads "Origin Metadata Missing for Some Rows";
update that exact heading string in docs/recovery/catastrophic_node_failure.md.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants