Skip to content

docs: document backfill mode tradeoffs and silent delete loss#2812

Open
jwhartley wants to merge 7 commits intomasterfrom
docs/backfill-modes-delete-caveat
Open

docs: document backfill mode tradeoffs and silent delete loss#2812
jwhartley wants to merge 7 commits intomasterfrom
docs/backfill-modes-delete-caveat

Conversation

@jwhartley
Copy link
Copy Markdown
Contributor

Summary

  • Document pros/cons for Normal and Precise backfill modes
  • Add warning that Precise mode can silently lose deletes during incremental backfills
  • Document the default Automatic behavior and how it selects between modes
  • Add comparison table for choosing between modes
  • Replace PostgreSQL-specific "WAL" terminology with generic "replication log"
  • Add binlog_row_metadata=FULL recommendation for MySQL/MariaDB

Context

Discovered during investigation of a customer issue where an ALTER TABLE triggered an automatic precise backfill, and deletes during the backfill window were silently dropped. The existing docs did not mention this tradeoff.

…n Precise mode

Add pros/cons for Normal and Precise backfill modes, document the default
Automatic behavior, and add a comparison table for choosing between modes.
Key addition: Precise mode can silently lose deletes during incremental
backfills when rows are deleted before the scanner reaches them.
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 27, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://estuary.github.io/flow/pr-preview/pr-2812/

Built to branch gh-pages at 2026-03-27 03:35 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@jwhartley jwhartley requested a review from willdonnelly March 31, 2026 00:13
Copy link
Copy Markdown
Member

@willdonnelly willdonnelly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reservations about some of the edits made in this document, mostly that they allude to various edge cases and error scenarios without actually describing them fully or the situations under which they occur, and that can be just as misleading as not mentioning them at all.

I don't know if typical users are actually going to read this document closely enough to draw those sorts of inferences so maybe vague mentions are sufficient, but I'm especially concerned that this could contaminate the AI answers system and cause it to give less helpful responses rather than more. For example if the user asks something about schema migrations and the most salient snippet the RAG pipeline turns up is where this document mentions "backfills triggered by schema changes", it's liable to hallucinate a whole bunch of incorrect conclusions from that.

In Precise mode, the connector fetches key-ordered chunks of the table for the backfill while performing reads of the WAL.
Any WAL changes for portions of the table that have already been backfilled are emitted. In contrast to Normal mode, however, WAL changes are suppressed if they relate to a part of the table that hasn't been backfilled yet.
**Cons:**
Duplicate events are possible during backfill. With standard (non-delta) materializations, duplicates are deduplicated by the runtime. With delta updates enabled, duplicates may result in duplicate records. Event ordering per key is also not guaranteed (e.g. you may see an update captured before the corresponding insert).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bit about event ordering is technically accurate but feels like it might give users a mistaken impression that event ordering is less reliable than it actually is. The most recent replication update will always come last and any ordering imprecision is solely about where the backfill insert will land relative to any other changes to a given row.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — reworded to clarify that the imprecision is about where the backfill insert lands relative to later replication events for the same row, not that ordering is generally unreliable. The most recent replication event always arrives last.

Produces a logically consistent sequence of changes per key — no duplicates, correct ordering.

**Cons:**
**Deletes can be silently lost during incremental backfills** (where existing records are already present in the collection and destination). If a row is deleted while the backfill scanner has not yet reached it, the DELETE event is filtered out. When the scanner reaches that key range, the row no longer exists in the source table and is never seen — the old version remains in the destination without a delete marker. This does not affect full data flow resets, where the destination is rebuilt from scratch. Only rows deleted *during* the backfill are affected; once the backfill completes, all subsequent deletes are captured normally.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Listing this as a con of the precise backfill mode specifically feels misleading, since incremental backfills in normal mode have the exact same caveat. More accurate would be to note that both modes have this issue when doing an incremental backfill, but for normal/unfiltered backfills the window of possibly-missing-deletes ends when the backfill starts and for precise backfills that window ends when the backfill finishes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — reframed this so it's not a Precise-only con. The updated text notes that both modes have a delete loss window during incremental backfills, but the window ends when the backfill starts for Normal vs. when it finishes for Precise.

* **Only Changes:** skips backfilling the table entirely and jumps directly to replication streaming for the entire dataset.

No backfill of the table content is performed at all. Only WAL changes are emitted.
No backfill of the table content is performed at all. Only replication log changes are emitted. Use this mode when you only need new changes going forward and don't need historical data, or when you want to avoid the overhead of scanning a large table.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That "when you want to avoid the overhead" seems like it might be a bit confusing to users. Like yes, you can use this to avoid the overhead, but only subject to the clause immediately prior to it. If you need the historical data then whether "you want to avoid the overhead" means precisely squat as it concerns your options.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed that clause — if you need historical data, wanting to avoid overhead is moot.

* **Without Primary Key:** can be used to capture tables without any form of unique primary key.

The connector uses an alternative physical row identifier (such as a Postgres `ctid`) to scan backfill chunks, rather than walking the table in key order.
The connector uses an alternative physical row identifier (such as a Postgres `ctid`) to scan backfill chunks, rather than walking the table in key order. Use this mode when a table lacks a usable primary key or unique index.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is worth pointing out since this is a generic document (although it's in a kind of weird place since this whole discussion of backfill modes only applies to SQL CDC connectors and I don't feel like that's called out quite strongly enough upfront), but there's one additional reason one might want to use a keyless backfill in PostgreSQL specifically: when the table's primary key is uncorrelated with table insert order (such as a UUIDv4 or other random token), a keyless backfill may run significantly faster because it avoids a bunch of read amplification from random page fetches in key order.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note to the PostgreSQL-specific section covering this — UUIDv4/random PKs cause random page fetches in key order, so Without Primary Key mode (ctid scan) can be significantly faster in those cases.

- **Normal** is selected for tables where key ordering is unpredictable (e.g. certain character encodings or collations).
- **Without Primary Key** is selected for tables that lack a usable primary key or unique index.

For most SQL captures, Automatic will select **Precise**.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "most" is an overstatement here but it really depends a lot on how the user keys their tables. I'd say "many" at the strongest.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to "many".


If your workload includes hard deletes and you want to ensure no deletes are lost during incremental backfills (e.g. backfills triggered by schema changes), consider setting the backfill mode to **Normal** on affected bindings. The tradeoff is possible duplicate events during the backfill, which are deduplicated automatically unless you are using delta updates.

For MySQL and MariaDB captures, setting `binlog_row_metadata=FULL` can prevent many unnecessary backfills from being triggered by schema changes, reducing the window in which this issue can occur regardless of backfill mode.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mention of "triggered by schema changes" as a thing which can happen feels like it might mislead users and/or cause the Kapa AI to give wrong answers going forward, and so if we're going to bring it up at all we ought to actually discuss the conditions in which they might occur. Those conditions are:

  • In general schema changes do not trigger backfills, with the following exceptions.
  • In SQL Server CDC, schema changes can trigger backfills when the "Automatic Capture Instance Management" setting is enabled.
  • In MySQL, schema changes can trigger backfills if they are executed with binlog writes deliberately turned off (either manually or by a schema migration tool which does so), or if the DDL query falls into an edge case which we currently cannot parse. If a user simply connects to the database and issues an ALTER TABLE foo ADD COLUMN bar VARCHAR(64) or whatever, no backfill will occur.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expanded this with the specific conditions you outlined: SQL Server with Automatic Capture Instance Management, and MySQL/MariaDB where binlog writes are disabled during migration or the DDL is unparseable. Also removed the vague '(e.g. backfills triggered by schema changes)' from the recommendation text.

@jwhartley
Copy link
Copy Markdown
Contributor Author

jwhartley commented Mar 31, 2026

Hey Will, I appreciate the detail in this review. I'd like to err on the side of more (accurate) information in docs. LLMs will only get better, and it's more defensible to have the extra detail there vs not if we have to expect some amount of hallucinations (kapa and customer's llms) for now.

All 6 comments addressed — replied inline to each.

@jwhartley jwhartley requested a review from willdonnelly April 1, 2026 00:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants