Add multi-node failure, lost transactions, and source-failover guidance to CNF doc. by ibrarahmad · Pull Request #394 · pgEdge/spock

ibrarahmad · 2026-03-17T15:55:08Z

Add a new scenario for two-of-three-nodes-fail recovery via ZODAN rebuild, a section on the risk of unreplicated transactions due to asynchronous logical replication, and stronger troubleshooting steps when the source node fails mid-replay.

…ce to CNF doc. Add a new scenario for two-of-three-nodes-fail recovery via ZODAN rebuild, a section on the risk of unreplicated transactions due to asynchronous logical replication, and stronger troubleshooting steps when the source node fails mid-replay.

coderabbitai · 2026-03-17T15:55:26Z

📝 Walkthrough

Walkthrough

Adds a new “Two of Three Nodes Fail” recovery scenario, expands per-origin and per-table recovery workflows, introduces a “Risk of Lost Transactions” section with mitigations, renames/reorders phases (e.g., “Spock Cleanup”), and updates commands and warnings including --preserve-origin usage.

Changes

Cohort / File(s)	Summary
Disaster Recovery Documentation `docs/recovery/catastrophic_node_failure.md`	Adds a third scenario (“Two of Three Nodes Fail”) and full recovery workflow; inserts “Risk of Lost Transactions” and mitigation guidance; renames/reorders phases (Phase 2 → “Spock Cleanup”, updates sequencing); adds per-origin and multi-origin recovery steps, per-table diffs/repairs, stronger warnings, updated example commands (consistent `--preserve-origin`), mermaid diagram, and improved cross-references and editorial consistency.

Poem

🐰 I hopped through nodes both near and far,

Two fell silent — we studied each scar.
With diffs and spock-clean sweeps so clever,
--preserve-origin binds roots together.
Recovery carrots, safe and bright — hop on, forever!

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the three main additions: multi-node failure recovery, lost transactions guidance, and source-failover troubleshooting.
Description check	✅ Passed	The description is directly related to the changeset, covering the new two-of-three-nodes recovery scenario, unreplicated transactions risk, and source-failover troubleshooting steps.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch DOC-ACE

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

docs/recovery/catastrophic_node_failure.md (1)
282-312: Add a one-line origin-semantics note in the data-loss section

Since this section discusses origin-filtered recovery, consider explicitly clarifying that local writes are origin 0, while unavailable origin metadata is NULL/unknown—they are not the same state.

Based on learnings: origin = 0 means local change; NULL/unknown means origin metadata unavailable and must not be conflated.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/recovery/catastrophic_node_failure.md` around lines 282 - 312, Add a
one-line clarification to the "Risk of Lost Transactions" / origin-filtered
recovery discussion stating the origin semantics: explicitly note that origin =
0 denotes a local write, whereas unavailable origin metadata is represented as
NULL (unknown) and these two states must not be conflated; reference the symbols
"origin = 0" and "NULL" in that sentence for clarity.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/recovery/catastrophic_node_failure.md`:
- Around line 25-31: The intro now lists three scenarios but nearby text still
uses the phrase "both cases"; update any occurrences of "both cases" in
docs/recovery/catastrophic_node_failure.md to a plural-consistent phrase such as
"these cases" or "any of these scenarios" so the wording matches the three-item
list, and scan surrounding sentences for pronouns that assume only two cases and
adjust them likewise.
- Around line 852-857: Update the failover retry step that instructs rerunning
table-diff and table-repair (mentions table-diff and table-repair and the
example --source-of-truth n4) to require reusing the original recovery flags:
add explicit text requiring --preserve-origin, --recovery-mode and the same
--until cutoff(s) in the command examples and prose so operators know to include
them alongside --source-of-truth to ensure equivalent repairs on retry.

---

Nitpick comments:
In `@docs/recovery/catastrophic_node_failure.md`:
- Around line 282-312: Add a one-line clarification to the "Risk of Lost
Transactions" / origin-filtered recovery discussion stating the origin
semantics: explicitly note that origin = 0 denotes a local write, whereas
unavailable origin metadata is represented as NULL (unknown) and these two
states must not be conflated; reference the symbols "origin = 0" and "NULL" in
that sentence for clarity.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7d89fc73-0b46-4e16-8ba5-07e6c2b1eaf1

📥 Commits

Reviewing files that changed from the base of the PR and between 0ae4205 and e5ccf42.

📒 Files selected for processing (1)

docs/recovery/catastrophic_node_failure.md

coderabbitai · 2026-03-17T15:58:25Z

docs/recovery/catastrophic_node_failure.md

+This document covers three cases:

- **single node failure** (one node fails and another is lagging) 
+- **single node failure** (one node fails and another is lagging)
+- **two of three nodes fail** (only one survivor remains; rebuild the failed
+  nodes from it)
 - **multiple node failure** (two or more nodes fail and one or more survivors
-  are behind). 
+  are behind).


⚠️ Potential issue | 🟡 Minor

Intro now lists three cases; avoid “both cases” wording nearby

Line 25–31 correctly introduces three scenarios, but nearby narrative still refers to “both cases,” which reads inconsistent after this change. Tightening that phrasing will reduce operator confusion.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/recovery/catastrophic_node_failure.md` around lines 25 - 31, The intro now lists three scenarios but nearby text still uses the phrase "both cases"; update any occurrences of "both cases" in docs/recovery/catastrophic_node_failure.md to a plural-consistent phrase such as "these cases" or "any of these scenarios" so the wording matches the three-item list, and scan surrounding sentences for pronouns that assume only two cases and adjust them likewise.

docs/recovery/catastrophic_node_failure.md

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/recovery/catastrophic_node_failure.md`:
- Around line 654-655: Typo fix: add a missing space between the inline code
token and the following word so "`--preserve-origin`clause" becomes
"`--preserve-origin` clause" in the docs; update the text near the occurrence of
`--preserve-origin` to insert the space for correct readability.
- Around line 122-123: The document contains duplicate section headings "###
Implementing a Recovery" which create identical anchors; locate both occurrences
of the heading text "### Implementing a Recovery" and make them unique by
appending a scenario-specific suffix (for example "### Implementing a Recovery —
Catastrophic Node Failure" and "### Implementing a Recovery — <other scenario
name>"), and then update any internal links/anchors that reference those
headings to use the new unique anchor names; ensure the two headings are
distinct and descriptive so anchors are no longer duplicated.
- Around line 127-136: The numbered list in the recovery doc skips from "2." to
"4."; update the markdown so the sequence is continuous by changing the "4."
line (the step starting "When the initial sync completes for each node, validate
with `table-diff`...") to "3." (or insert a proper step 3 if additional content
is needed) so the list reads 1, 2, 3; modify the stanza containing "Clean up
Spock..." / "Rebuild n2 and n3..." / "When the initial sync completes..."
accordingly.
- Line 839: Fix the typo in the markdown heading "Origin Netadata Missing for
Some Rows" by changing "Netadata" to "Metadata" so the heading reads "Origin
Metadata Missing for Some Rows"; update that exact heading string in
docs/recovery/catastrophic_node_failure.md.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 358af2bc-3dfc-4a38-a545-d50a5d6f42ec

📥 Commits

Reviewing files that changed from the base of the PR and between e5ccf42 and 8bf4b33.

📒 Files selected for processing (1)

docs/recovery/catastrophic_node_failure.md

coderabbitai · 2026-03-18T12:44:22Z

docs/recovery/catastrophic_node_failure.md

+### Implementing a Recovery
+


⚠️ Potential issue | 🟡 Minor

Duplicate ### Implementing a Recovery headings reduce navigability.

Lines 122 and 179 reuse the same heading text, which produces duplicate section names and awkward anchors in long docs. Please make these headings unique per scenario.

Also applies to: 179-180

🧰 Tools

🪛 markdownlint-cli2 (0.21.0)

[warning] 122-122: Multiple headings with the same content

(MD024, no-duplicate-heading)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/recovery/catastrophic_node_failure.md` around lines 122 - 123, The document contains duplicate section headings "### Implementing a Recovery" which create identical anchors; locate both occurrences of the heading text "### Implementing a Recovery" and make them unique by appending a scenario-specific suffix (for example "### Implementing a Recovery — Catastrophic Node Failure" and "### Implementing a Recovery — <other scenario name>"), and then update any internal links/anchors that reference those headings to use the new unique anchor names; ensure the two headings are distinct and descriptive so anchors are no longer duplicated.

coderabbitai · 2026-03-18T12:44:22Z

docs/recovery/catastrophic_node_failure.md

+1. Clean up Spock by dropping subscriptions to n2 and n3 on n1, then
+   dropping the failed nodes from the cluster.
+2. Rebuild n2 and n3 from n1; because the failed nodes are unavailable, you
+   cannot run ACE `table-diff` against them; instead, use ZODAN's
+   `add_node` procedure to add n2 and n3 back as fresh nodes, which
+   performs a full initial sync from n1; see the
+   [ZODAN documentation](../modify/zodan/index.md) for step-by-step
+   instructions.
+4. When the initial sync completes for each node, validate with
+   `table-diff` across all three nodes to confirm they match.


⚠️ Potential issue | 🟡 Minor

Numbered steps skip from 2 to 4.

Line 135 starts at 4. with no 3.. Even if Markdown auto-numbers, this is confusing in raw/edit views.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/recovery/catastrophic_node_failure.md` around lines 127 - 136, The numbered list in the recovery doc skips from "2." to "4."; update the markdown so the sequence is continuous by changing the "4." line (the step starting "When the initial sync completes for each node, validate with `table-diff`...") to "3." (or insert a proper step 3 if additional content is needed) so the list reads 1, 2, 3; modify the stanza containing "Clean up Spock..." / "Rebuild n2 and n3..." / "When the initial sync completes..." accordingly.

coderabbitai · 2026-03-18T12:44:22Z

docs/recovery/catastrophic_node_failure.md

+of truth (n3) and including the `--preserve-origin`clause every time. ACE
+writes a new filename for each table-diff run; use the file


⚠️ Potential issue | 🟡 Minor

Minor typo: missing space before “clause”.

Line 654 has `--preserve-origin`clause; add a space for readability.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/recovery/catastrophic_node_failure.md` around lines 654 - 655, Typo fix: add a missing space between the inline code token and the following word so "`--preserve-origin`clause" becomes "`--preserve-origin` clause" in the docs; update the text near the occurrence of `--preserve-origin` to insert the space for correct readability.

coderabbitai · 2026-03-18T12:44:22Z

docs/recovery/catastrophic_node_failure.md

 the complete data.

-### Origin metadata missing for some rows
+### Origin Netadata Missing for Some Rows


⚠️ Potential issue | 🟡 Minor

Typo in troubleshooting heading.

Line 839: “Netadata” should be “Metadata”.

🧰 Tools

🪛 LanguageTool

[grammar] ~839-~839: Ensure spelling is correct
Context: ...know has the complete data. ### Origin Netadata Missing for Some Rows ACE may log a wa...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/recovery/catastrophic_node_failure.md` at line 839, Fix the typo in the markdown heading "Origin Netadata Missing for Some Rows" by changing "Netadata" to "Metadata" so the heading reads "Origin Metadata Missing for Some Rows"; update that exact heading string in docs/recovery/catastrophic_node_failure.md.

ibrarahmad requested a review from mason-sharp March 17, 2026 15:55

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

Update catastrophic_node_failure.md

8bf4b33

coderabbitai bot reviewed Mar 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi-node failure, lost transactions, and source-failover guidance to CNF doc.#394

Add multi-node failure, lost transactions, and source-failover guidance to CNF doc.#394
ibrarahmad wants to merge 2 commits intomainfrom
DOC-ACE

ibrarahmad commented Mar 17, 2026

Uh oh!

coderabbitai bot commented Mar 17, 2026 •

edited

Loading

Walkthrough

Changes

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 17, 2026

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 18, 2026

Uh oh!

coderabbitai bot Mar 18, 2026

Uh oh!

coderabbitai bot Mar 18, 2026

Uh oh!

coderabbitai bot Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		of truth (n3) and including the `--preserve-origin`clause every time. ACE
		writes a new filename for each table-diff run; use the file

Conversation

ibrarahmad commented Mar 17, 2026

Uh oh!

coderabbitai bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Mar 17, 2026 •

edited

Loading