SPOC-311: Fix data loss when adding a node under write load.#397
Draft
ibrarahmad wants to merge 4 commits intov5_STABLEfrom
Draft
SPOC-311: Fix data loss when adding a node under write load.#397ibrarahmad wants to merge 4 commits intov5_STABLEfrom
ibrarahmad wants to merge 4 commits intov5_STABLEfrom
Conversation
Backport the ZODAN add_node protocol fixes from main to v5.
The root cause of the data loss was that disabled subscriptions
(e.g. sub_n2_n3) created by add_node started replicating from LSN 0/0
because the replication origin was never advanced. When the apply
worker received WAL from the slot's consistent point, any UPDATE on a
row not yet present on the new node caused repeated "row not found"
errors, which led to subscription disable and a 20-minute timeout.
Fix this by advancing both the replication slot and the replication
origin to resume_lsn (the last commit LSN from spock.progress) in
Phase 7 of the add_node protocol. Also add a source-node commit
catchup wait in Phase 3 to close the [resume_lsn, L_slot) gap.
Additional hardening of the ZODAN protocol:
- Check wait_for_sync_event() return value instead of discarding it
with PERFORM; raise on timeout.
- Pass wait_if_disabled=true to wait_for_sync_event() so the
procedure tolerates not-yet-enabled subscriptions.
- Reduce all wait timeouts from 1200s to 180s and use
clock_timestamp()-based bounds instead of iteration counters.
- Detect terminal subscription states (disabled/down) and fail fast
with RAISE EXCEPTION instead of silently continuing.
- Replace RAISE NOTICE + CONTINUE with RAISE EXCEPTION throughout,
so errors propagate instead of being swallowed.
- Add wait_for_replication_catchup_with_dblink() for bounded
lag-tracker polling.
- Add bounded SQL loop replacing C-level sub_wait_for_sync() in
Phase 7 to avoid rare hangs.
- Drop stale replication origins before creating disabled
subscriptions to prevent stale-LSN data loss.
- Advance replication origin on new node alongside slot advancement.
Extend 011_zodan_sync_third TAP test with a remove/re-add cycle to
exercise the full add_node protocol under pgbench write load, and
enable 012_zodan_basics in the TAP schedule. Increase the spockbench
CI TAP test timeout to 45 minutes.
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Add a dedicated GitHub Actions workflow that runs the long-running 011_zodan_sync_third TAP test on a weekly schedule and on manual dispatch. The test exercises the full add_node protocol under pgbench write load with 10 iterations per PG version. Add run-spock-tap.sh helper script for running selected TAP tests iteratively inside the Docker container, and wire it into the Dockerfile so the workflow can invoke it.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Backport the ZODAN add_node protocol fixes from main to v5.
The root cause of the data loss was that disabled subscriptions (e.g. sub_n2_n3) created by add_node started replicating from LSN 0/0 because the replication origin was never advanced. When the apply worker received WAL from the slot's consistent point, any UPDATE on a row not yet present on the new node caused repeated "row not found" errors, which led to subscription disable and a 20-minute timeout.
Fix this by advancing both the replication slot and the replication origin to resume_lsn (the last commit LSN from spock.progress) in Phase 7 of the add_node protocol. Also add a source-node commit catchup wait in Phase 3 to close the [resume_lsn, L_slot) gap.
Additional hardening of the ZODAN protocol:
Extend 011_zodan_sync_third TAP test with a remove/re-add cycle to exercise the full add_node protocol under pgbench write load, and enable 012_zodan_basics in the TAP schedule. Increase the spockbench CI TAP test timeout to 45 minutes.