Add multi-node failure, lost transactions, and source-failover guidance to CNF doc.#394
Add multi-node failure, lost transactions, and source-failover guidance to CNF doc.#394ibrarahmad wants to merge 2 commits intomainfrom
Conversation
…ce to CNF doc. Add a new scenario for two-of-three-nodes-fail recovery via ZODAN rebuild, a section on the risk of unreplicated transactions due to asynchronous logical replication, and stronger troubleshooting steps when the source node fails mid-replay.
📝 WalkthroughWalkthroughAdds a new “Two of Three Nodes Fail” recovery scenario, expands per-origin and per-table recovery workflows, introduces a “Risk of Lost Transactions” section with mitigations, renames/reorders phases (e.g., “Spock Cleanup”), and updates commands and warnings including Changes
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
docs/recovery/catastrophic_node_failure.md (1)
282-312: Add a one-line origin-semantics note in the data-loss sectionSince this section discusses origin-filtered recovery, consider explicitly clarifying that local writes are origin
0, while unavailable origin metadata isNULL/unknown—they are not the same state.Based on learnings: origin = 0 means local change; NULL/unknown means origin metadata unavailable and must not be conflated.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/recovery/catastrophic_node_failure.md` around lines 282 - 312, Add a one-line clarification to the "Risk of Lost Transactions" / origin-filtered recovery discussion stating the origin semantics: explicitly note that origin = 0 denotes a local write, whereas unavailable origin metadata is represented as NULL (unknown) and these two states must not be conflated; reference the symbols "origin = 0" and "NULL" in that sentence for clarity.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/recovery/catastrophic_node_failure.md`:
- Around line 25-31: The intro now lists three scenarios but nearby text still
uses the phrase "both cases"; update any occurrences of "both cases" in
docs/recovery/catastrophic_node_failure.md to a plural-consistent phrase such as
"these cases" or "any of these scenarios" so the wording matches the three-item
list, and scan surrounding sentences for pronouns that assume only two cases and
adjust them likewise.
- Around line 852-857: Update the failover retry step that instructs rerunning
table-diff and table-repair (mentions table-diff and table-repair and the
example --source-of-truth n4) to require reusing the original recovery flags:
add explicit text requiring --preserve-origin, --recovery-mode and the same
--until cutoff(s) in the command examples and prose so operators know to include
them alongside --source-of-truth to ensure equivalent repairs on retry.
---
Nitpick comments:
In `@docs/recovery/catastrophic_node_failure.md`:
- Around line 282-312: Add a one-line clarification to the "Risk of Lost
Transactions" / origin-filtered recovery discussion stating the origin
semantics: explicitly note that origin = 0 denotes a local write, whereas
unavailable origin metadata is represented as NULL (unknown) and these two
states must not be conflated; reference the symbols "origin = 0" and "NULL" in
that sentence for clarity.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 7d89fc73-0b46-4e16-8ba5-07e6c2b1eaf1
📒 Files selected for processing (1)
docs/recovery/catastrophic_node_failure.md
| This document covers three cases: | ||
|
|
||
| - **single node failure** (one node fails and another is lagging) | ||
| - **single node failure** (one node fails and another is lagging) | ||
| - **two of three nodes fail** (only one survivor remains; rebuild the failed | ||
| nodes from it) | ||
| - **multiple node failure** (two or more nodes fail and one or more survivors | ||
| are behind). | ||
| are behind). |
There was a problem hiding this comment.
Intro now lists three cases; avoid “both cases” wording nearby
Line 25–31 correctly introduces three scenarios, but nearby narrative still refers to “both cases,” which reads inconsistent after this change. Tightening that phrasing will reduce operator confusion.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/recovery/catastrophic_node_failure.md` around lines 25 - 31, The intro
now lists three scenarios but nearby text still uses the phrase "both cases";
update any occurrences of "both cases" in
docs/recovery/catastrophic_node_failure.md to a plural-consistent phrase such as
"these cases" or "any of these scenarios" so the wording matches the three-item
list, and scan surrounding sentences for pronouns that assume only two cases and
adjust them likewise.
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/recovery/catastrophic_node_failure.md`:
- Around line 654-655: Typo fix: add a missing space between the inline code
token and the following word so "`--preserve-origin`clause" becomes
"`--preserve-origin` clause" in the docs; update the text near the occurrence of
`--preserve-origin` to insert the space for correct readability.
- Around line 122-123: The document contains duplicate section headings "###
Implementing a Recovery" which create identical anchors; locate both occurrences
of the heading text "### Implementing a Recovery" and make them unique by
appending a scenario-specific suffix (for example "### Implementing a Recovery —
Catastrophic Node Failure" and "### Implementing a Recovery — <other scenario
name>"), and then update any internal links/anchors that reference those
headings to use the new unique anchor names; ensure the two headings are
distinct and descriptive so anchors are no longer duplicated.
- Around line 127-136: The numbered list in the recovery doc skips from "2." to
"4."; update the markdown so the sequence is continuous by changing the "4."
line (the step starting "When the initial sync completes for each node, validate
with `table-diff`...") to "3." (or insert a proper step 3 if additional content
is needed) so the list reads 1, 2, 3; modify the stanza containing "Clean up
Spock..." / "Rebuild n2 and n3..." / "When the initial sync completes..."
accordingly.
- Line 839: Fix the typo in the markdown heading "Origin Netadata Missing for
Some Rows" by changing "Netadata" to "Metadata" so the heading reads "Origin
Metadata Missing for Some Rows"; update that exact heading string in
docs/recovery/catastrophic_node_failure.md.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 358af2bc-3dfc-4a38-a545-d50a5d6f42ec
📒 Files selected for processing (1)
docs/recovery/catastrophic_node_failure.md
| ### Implementing a Recovery | ||
|
|
There was a problem hiding this comment.
Duplicate ### Implementing a Recovery headings reduce navigability.
Lines 122 and 179 reuse the same heading text, which produces duplicate section names and awkward anchors in long docs. Please make these headings unique per scenario.
Also applies to: 179-180
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)
[warning] 122-122: Multiple headings with the same content
(MD024, no-duplicate-heading)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/recovery/catastrophic_node_failure.md` around lines 122 - 123, The
document contains duplicate section headings "### Implementing a Recovery" which
create identical anchors; locate both occurrences of the heading text "###
Implementing a Recovery" and make them unique by appending a scenario-specific
suffix (for example "### Implementing a Recovery — Catastrophic Node Failure"
and "### Implementing a Recovery — <other scenario name>"), and then update any
internal links/anchors that reference those headings to use the new unique
anchor names; ensure the two headings are distinct and descriptive so anchors
are no longer duplicated.
| 1. Clean up Spock by dropping subscriptions to n2 and n3 on n1, then | ||
| dropping the failed nodes from the cluster. | ||
| 2. Rebuild n2 and n3 from n1; because the failed nodes are unavailable, you | ||
| cannot run ACE `table-diff` against them; instead, use ZODAN's | ||
| `add_node` procedure to add n2 and n3 back as fresh nodes, which | ||
| performs a full initial sync from n1; see the | ||
| [ZODAN documentation](../modify/zodan/index.md) for step-by-step | ||
| instructions. | ||
| 4. When the initial sync completes for each node, validate with | ||
| `table-diff` across all three nodes to confirm they match. |
There was a problem hiding this comment.
Numbered steps skip from 2 to 4.
Line 135 starts at 4. with no 3.. Even if Markdown auto-numbers, this is confusing in raw/edit views.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/recovery/catastrophic_node_failure.md` around lines 127 - 136, The
numbered list in the recovery doc skips from "2." to "4."; update the markdown
so the sequence is continuous by changing the "4." line (the step starting "When
the initial sync completes for each node, validate with `table-diff`...") to
"3." (or insert a proper step 3 if additional content is needed) so the list
reads 1, 2, 3; modify the stanza containing "Clean up Spock..." / "Rebuild n2
and n3..." / "When the initial sync completes..." accordingly.
| of truth (n3) and including the `--preserve-origin`clause every time. ACE | ||
| writes a new filename for each table-diff run; use the file |
There was a problem hiding this comment.
Minor typo: missing space before “clause”.
Line 654 has `--preserve-origin`clause; add a space for readability.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/recovery/catastrophic_node_failure.md` around lines 654 - 655, Typo fix:
add a missing space between the inline code token and the following word so
"`--preserve-origin`clause" becomes "`--preserve-origin` clause" in the docs;
update the text near the occurrence of `--preserve-origin` to insert the space
for correct readability.
| the complete data. | ||
|
|
||
| ### Origin metadata missing for some rows | ||
| ### Origin Netadata Missing for Some Rows |
There was a problem hiding this comment.
Typo in troubleshooting heading.
Line 839: “Netadata” should be “Metadata”.
🧰 Tools
🪛 LanguageTool
[grammar] ~839-~839: Ensure spelling is correct
Context: ...know has the complete data. ### Origin Netadata Missing for Some Rows ACE may log a wa...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/recovery/catastrophic_node_failure.md` at line 839, Fix the typo in the
markdown heading "Origin Netadata Missing for Some Rows" by changing "Netadata"
to "Metadata" so the heading reads "Origin Metadata Missing for Some Rows";
update that exact heading string in docs/recovery/catastrophic_node_failure.md.
Add a new scenario for two-of-three-nodes-fail recovery via ZODAN rebuild, a section on the risk of unreplicated transactions due to asynchronous logical replication, and stronger troubleshooting steps when the source node fails mid-replay.