docs: document backfill mode tradeoffs and silent delete loss#2812

Open

jwhartley wants to merge 7 commits intomasterfrom

docs/backfill-modes-delete-caveat

Contributor

jwhartley commented Mar 27, 2026

Summary

Document pros/cons for Normal and Precise backfill modes
Add warning that Precise mode can silently lose deletes during incremental backfills
Document the default Automatic behavior and how it selects between modes
Add comparison table for choosing between modes
Replace PostgreSQL-specific "WAL" terminology with generic "replication log"
Add binlog_row_metadata=FULL recommendation for MySQL/MariaDB

Context

Discovered during investigation of a customer issue where an ALTER TABLE triggered an automatic precise backfill, and deletes during the backfill window were silently dropped. The existing docs did not mention this tradeoff.


          docs: document backfill mode tradeoffs including silent delete loss i…

56159f0

…n Precise mode

Add pros/cons for Normal and Precise backfill modes, document the default
Automatic behavior, and add a comparison table for choosing between modes.
Key addition: Precise mode can silently lose deletes during incremental
backfills when rows are deleted before the scanner reaches them.

github-actions bot commented Mar 27, 2026 •

edited

Loading

PR Preview Action v1.8.1
🚀 View preview at https://estuary.github.io/flow/pr-preview/pr-2812/
Built to branch `gh-pages` at 2026-03-27 03:35 UTC. Preview will be ready when the GitHub Pages deployment is complete.


          docs: improve readability of backfill mode pros/cons

31cac1a

jwhartley requested a review from willdonnelly

March 31, 2026 00:13

willdonnelly reviewed

View reviewed changes

Member

willdonnelly left a comment

I have reservations about some of the edits made in this document, mostly that they allude to various edge cases and error scenarios without actually describing them fully or the situations under which they occur, and that can be just as misleading as not mentioning them at all.

I don't know if typical users are actually going to read this document closely enough to draw those sorts of inferences so maybe vague mentions are sufficient, but I'm especially concerned that this could contaminate the AI answers system and cause it to give less helpful responses rather than more. For example if the user asks something about schema migrations and the most salient snippet the RAG pipeline turns up is where this document mentions "backfills triggered by schema changes", it's liable to hallucinate a whole bunch of incorrect conclusions from that.

site/docs/guides/backfilling-data.md Outdated

    
                 In Precise mode, the connector fetches key-ordered chunks of the table for the backfill while performing reads of the WAL.

                 Any WAL changes for portions of the table that have already been backfilled are emitted. In contrast to Normal mode, however, WAL changes are suppressed if they relate to a part of the table that hasn't been backfilled yet.

                 **Cons:**

                 Duplicate events are possible during backfill. With standard (non-delta) materializations, duplicates are deduplicated by the runtime. With delta updates enabled, duplicates may result in duplicate records. Event ordering per key is also not guaranteed (e.g. you may see an update captured before the corresponding insert).

Member

willdonnelly Mar 31, 2026

The bit about event ordering is technically accurate but feels like it might give users a mistaken impression that event ordering is less reliable than it actually is. The most recent replication update will always come last and any ordering imprecision is solely about where the backfill insert will land relative to any other changes to a given row.

Contributor Author

jwhartley Mar 31, 2026

Good call — reworded to clarify that the imprecision is about where the backfill insert lands relative to later replication events for the same row, not that ordering is generally unreliable. The most recent replication event always arrives last.

site/docs/guides/backfilling-data.md Outdated

+                 Produces a logically consistent sequence of changes per key — no duplicates, correct ordering.
+                 **Cons:**
+                 **Deletes can be silently lost during incremental backfills** (where existing records are already present in the collection and destination). If a row is deleted while the backfill scanner has not yet reached it, the DELETE event is filtered out. When the scanner reaches that key range, the row no longer exists in the source table and is never seen — the old version remains in the destination without a delete marker. This does not affect full data flow resets, where the destination is rebuilt from scratch. Only rows deleted *during* the backfill are affected; once the backfill completes, all subsequent deletes are captured normally.

Member

willdonnelly Mar 31, 2026

Listing this as a con of the precise backfill mode specifically feels misleading, since incremental backfills in normal mode have the exact same caveat. More accurate would be to note that both modes have this issue when doing an incremental backfill, but for normal/unfiltered backfills the window of possibly-missing-deletes ends when the backfill starts and for precise backfills that window ends when the backfill finishes.

Contributor Author

jwhartley Mar 31, 2026

Agreed — reframed this so it's not a Precise-only con. The updated text notes that both modes have a delete loss window during incremental backfills, but the window ends when the backfill starts for Normal vs. when it finishes for Precise.

site/docs/guides/backfilling-data.md Outdated

    
              * **Only Changes:** skips backfilling the table entirely and jumps directly to replication streaming for the entire dataset.

                 No backfill of the table content is performed at all. Only WAL changes are emitted.

                 No backfill of the table content is performed at all. Only replication log changes are emitted. Use this mode when you only need new changes going forward and don't need historical data, or when you want to avoid the overhead of scanning a large table.

Member

willdonnelly Mar 31, 2026

That "when you want to avoid the overhead" seems like it might be a bit confusing to users. Like yes, you can use this to avoid the overhead, but only subject to the clause immediately prior to it. If you need the historical data then whether "you want to avoid the overhead" means precisely squat as it concerns your options.

Contributor Author

jwhartley Mar 31, 2026

Removed that clause — if you need historical data, wanting to avoid overhead is moot.

site/docs/guides/backfilling-data.md

               * **Without Primary Key:** can be used to capture tables without any form of unique primary key.
-                 The connector uses an alternative physical row identifier (such as a Postgres `ctid`) to scan backfill chunks, rather than walking the table in key order.
+                 The connector uses an alternative physical row identifier (such as a Postgres `ctid`) to scan backfill chunks, rather than walking the table in key order. Use this mode when a table lacks a usable primary key or unique index.

Member

willdonnelly Mar 31, 2026

Not sure if this is worth pointing out since this is a generic document (although it's in a kind of weird place since this whole discussion of backfill modes only applies to SQL CDC connectors and I don't feel like that's called out quite strongly enough upfront), but there's one additional reason one might want to use a keyless backfill in PostgreSQL specifically: when the table's primary key is uncorrelated with table insert order (such as a UUIDv4 or other random token), a keyless backfill may run significantly faster because it avoids a bunch of read amplification from random page fetches in key order.

Contributor Author

jwhartley Mar 31, 2026

Added a note to the PostgreSQL-specific section covering this — UUIDv4/random PKs cause random page fetches in key order, so Without Primary Key mode (ctid scan) can be significantly faster in those cases.

site/docs/guides/backfilling-data.md Outdated

+              - **Normal** is selected for tables where key ordering is unpredictable (e.g. certain character encodings or collations).
+              - **Without Primary Key** is selected for tables that lack a usable primary key or unique index.
+              For most SQL captures, Automatic will select **Precise**.

Member

willdonnelly Mar 31, 2026

I think "most" is an overstatement here but it really depends a lot on how the user keys their tables. I'd say "many" at the strongest.

Contributor Author

jwhartley Mar 31, 2026

Changed to "many".

site/docs/guides/backfilling-data.md Outdated


		If your workload includes hard deletes and you want to ensure no deletes are lost during incremental backfills (e.g. backfills triggered by schema changes), consider setting the backfill mode to Normal on affected bindings. The tradeoff is possible duplicate events during the backfill, which are deduplicated automatically unless you are using delta updates.

		For MySQL and MariaDB captures, setting `binlog_row_metadata=FULL` can prevent many unnecessary backfills from being triggered by schema changes, reducing the window in which this issue can occur regardless of backfill mode.

Member

willdonnelly Mar 31, 2026

The mention of "triggered by schema changes" as a thing which can happen feels like it might mislead users and/or cause the Kapa AI to give wrong answers going forward, and so if we're going to bring it up at all we ought to actually discuss the conditions in which they might occur. Those conditions are:

In general schema changes do not trigger backfills, with the following exceptions.
In SQL Server CDC, schema changes can trigger backfills when the "Automatic Capture Instance Management" setting is enabled.
In MySQL, schema changes can trigger backfills if they are executed with binlog writes deliberately turned off (either manually or by a schema migration tool which does so), or if the DDL query falls into an edge case which we currently cannot parse. If a user simply connects to the database and issues an ALTER TABLE foo ADD COLUMN bar VARCHAR(64) or whatever, no backfill will occur.

Contributor Author

jwhartley Mar 31, 2026

Expanded this with the specific conditions you outlined: SQL Server with Automatic Capture Instance Management, and MySQL/MariaDB where binlog writes are disabled during migration or the DDL is unparseable. Also removed the vague '(e.g. backfills triggered by schema changes)' from the recommendation text.

Contributor Author

jwhartley commented Mar 31, 2026 •

edited

Loading

Hey Will, I appreciate the detail in this review. I'd like to err on the side of more (accurate) information in docs. LLMs will only get better, and it's more defensible to have the extra detail there vs not if we have to expect some amount of hallucinations (kapa and customer's llms) for now.

All 6 comments addressed — replied inline to each.

jwhartley and others added 2 commits

April 1, 2026 10:43


          docs: address review feedback on backfill modes

060afac


          Merge branch 'master' into docs/backfill-modes-delete-caveat

74749f6

github-actions bot commented Mar 31, 2026 •

edited

Loading

🚀 Preview deployed to https://docs.estuary.dev/pr-preview/pr-2812/

📄 Changed pages:

jwhartley added 3 commits

April 1, 2026 10:56


          docs: clarify backfill modes section applies to SQL CDC captures

e068d37


          docs: fix internal anchor links after section title rename

7aff243


          docs: update backfill mode anchor links in PostgreSQL connector docs

44af2f1

jwhartley requested a review from willdonnelly

April 1, 2026 00:11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet