Skip to content

fix(firo): make Spark anon-set sync resumable and crash-safe#1298

Open
reubenyap wants to merge 2 commits intocypherstack:stagingfrom
reubenyap:fix/spark-anon-set-middle-path
Open

fix(firo): make Spark anon-set sync resumable and crash-safe#1298
reubenyap wants to merge 2 commits intocypherstack:stagingfrom
reubenyap:fix/spark-anon-set-middle-path

Conversation

@reubenyap
Copy link
Copy Markdown

Problem

runFetchAndUpdateSparkAnonSetCacheForGroupId accumulates every sector of a group's anonymity set in an in-memory Dart List and only writes to SQLite after the full delta has been fetched. A network drop, app kill, or power loss mid-download discards the in-memory buffer and leaves no persisted progress, so the next sync attempt recomputes the same meta.size - prevSize delta and re-downloads everything. For a large first-sync this burns the whole set repeatedly.

Additionally, a related design — persisting partial sectors directly to SparkSet with a complete flag but not filtering reader queries on it — has a second failure mode: a spend initiated while a sync is running can observe a half-populated anonymity set and hand libspark a setHash-mismatched set, failing the membership proof.

Fix

Per-sector commits against an in-progress SparkSet row that stays invisible to readers (complete = 0) until a strict integrity check passes, at which point a single atomic UPDATE flips it to complete = 1. Partial data is never observable by readers, and the next sync resumes from the count of already-linked coins.

Design

  • Schema: adds SparkSet.complete (integrity gate) and SparkSetCoins.orderKey (server-side delta index, for ordering preservation), plus a UNIQUE INDEX on SparkSetCoins(setId, coinId) so per-sector INSERT OR IGNORE dedupes correctly under crash-recovery replay.
  • Migration: idempotent, runs on every open, uses explicit PRAGMA table_info / sqlite_master presence checks rather than try/catch around ALTER (so unrelated SQLite errors don't get silently swallowed). Defensive dedup of any legacy duplicate (setId, coinId) rows before creating the UNIQUE index. Existing SparkSet rows default to complete = 1 (the pre-fix writer was all-or-nothing, so they represent finalized syncs); existing SparkSetCoins.orderKey defaults to 0 and the reader's ssc.id ASC tiebreaker reproduces the pre-fix layout byte-for-byte.
  • Writer: three focused functions — per-sector append, mark-complete (gated on COUNT(SparkSetCoins) == expectedDelta), and delete-incomplete (for blockHash shift / corruption reset). Each runs in its own SQLite transaction.
  • Reader: all anon-set queries gain WHERE ss.complete = 1. Coin ordering reconstructed via ORDER BY ss.id ASC, ssc.orderKey DESC, ssc.id ASC — matches pre-fix end-to-end behavior for both pre-migration rows (orderKey tied at 0, PK tiebreaker) and post-migration rows (orderKey meaningful).
  • Coordinator: resumable loop reads any in-progress row, verifies its blockHash and setHash match the current server meta, resumes at COUNT(SparkSetCoins) if they do, discards and restarts if they don't. Sector-size cross-check refuses to persist a partial/over-full server response. Reorg-shrink and blockHash-advanced-without-new-coins cases are handled explicitly to avoid corrupting finalized state.

Integrity guarantees

Scenario Behavior
Crash mid-sector SQLite transaction atomicity keeps cache consistent with last committed sector. Resume cursor derived from COUNT(SparkSetCoins).
Crash mid-finalize Either complete=1 flipped or not — no intermediate. Either way, next sync resumes correctly.
BlockHash shift between attempts In-progress row discarded; finalized rows untouched. Partial work re-downloaded at the new blockHash.
Server returns wrong sector size Abort before persisting. Cursor unchanged. Retry on next sync.
Server reports smaller size than cached (reorg) Skip sync, preserve cached state.
Server reports same blockHash with different size (anomaly) Skip sync, refuse to corrupt the already-finalized row.
Finalize integrity check mismatches Roll back, row stays complete=0. Next sync observes the over-linked row and resets.
Spend initiated during sync Reader's complete=1 filter keeps in-progress data invisible. Spend sees the last finalized state.

Upgrade path

Fully automatic — no manual cache clear, no wallet rescan, no user action. The migration runs on first open of the new version:

  1. ALTER TABLE SparkSet ADD COLUMN complete INTEGER NOT NULL DEFAULT 1 (legacy rows are finalized).
  2. ALTER TABLE SparkSetCoins ADD COLUMN orderKey INTEGER NOT NULL DEFAULT 0.
  3. Defensive dedup, then CREATE UNIQUE INDEX.

Verified against a realistic pre-migration DB (two finalized syncs, five link rows): reader output is byte-for-byte identical before and after migration; a subsequent post-migration incremental sync produces the expected hybrid layout.

Review strategy

The change is split into two commits with clear separation of concerns, so a reviewer can verify them independently:

1. 1d543d8 — schema + migration (+126 −5, 2 files)

Purely additive. Existing reader/writer/coordinator are untouched, so behavior for anyone running only this commit is unchanged. Verify the schema design and migration correctness in isolation before looking at any logic.

2. 90cd308 — behavioral fix (+424 −77, 4 files)

Writer, worker, reader filter/ordering, and coordinator rewrite. Public FiroCacheCoordinator signatures are preserved, so no callers in firo_wallet.dart, spark_interface.dart, or UI views need changes.

Test plan

Environment didn't have the Flutter/Dart toolchain available during development, so I verified by replaying the writer's exact SQL against an in-memory sqlite3 CLI instance across 18 scenarios:

  • First sync from scratch, single sector
  • Incremental sync (finalized + delta)
  • Crash mid-download, resume from committed cursor
  • BlockHash shift between resume attempts → discard in-progress, restart
  • Integrity gate rejects under-count finalize → row stays hidden
  • Idempotent per-sector replay → no duplicates
  • Cross-set coin reuse → reader filters correctly, no duplicates
  • End-to-end coin ordering matches pre-fix byte-for-byte
  • Pre-migration rows (orderKey=0 tied) read correctly via PK tiebreaker
  • Migration is idempotent (runs safely on every open)
  • Duplicate (setId, coinId) dedup keeps MIN(id) per group before CREATE UNIQUE INDEX
  • Same-blockHash/different-size anomaly skipped
  • Writer refuses to append to already-finalized row
  • Empty-delta (blockHash advanced without new coins) skipped without writing same-size row
  • Reorg-style shrink skipped
  • External callers (firo_wallet.dart, spark_interface.dart, UI) use unchanged public signatures
  • Needs reviewer to run: flutter analyze / dart analyze pass
  • Needs reviewer to run: existing test suites still pass

🤖 Generated with Claude Code

reubenyap and others added 2 commits April 20, 2026 16:21
Adds the storage substrate that the resumable-sync fix will use in the
next commit. This commit on its own is purely additive — existing
reader queries and the existing writer are untouched, so behavior for
anyone running only this commit is unchanged.

Schema additions (both in the fresh-DB CREATE TABLE and in the
idempotent `_migrateSparkSetCacheDb` migration that runs on every
open):

  1. `SparkSet.complete INTEGER NOT NULL`: will gate reader visibility
     once the next commit wires it up. 0 = in-progress, 1 = finalized.
     Pre-migration rows default to 1 (legacy writer was all-or-nothing,
     so any existing row represents a finalized sync); fresh-DB new
     rows default to 0 and the writer will explicitly set complete=0
     until finalize.

  2. `SparkSetCoins.orderKey INTEGER NOT NULL DEFAULT 0`: will hold the
     server-side delta index of each coin so the reader can reconstruct
     server newest-first ordering end-to-end. Pre-migration rows
     default to 0; the reader's `ssc.id ASC` tiebreaker (added next
     commit) then sorts them in PK order, which is exactly the layout
     the pre-fix writer produced — preserving ordering byte-for-byte.

  3. `UNIQUE INDEX idx_sparksetcoins_set_coin ON SparkSetCoins(setId,
     coinId)`: required for INSERT OR IGNORE on the link table during
     resumable per-sector writes (added next commit). Before creating
     the index, any pre-existing duplicate (setId, coinId) rows are
     removed — keeping MIN(id) per group — so a legacy DB with
     unexpected duplicates can still upgrade. The pre-fix writer
     shouldn't have produced any (its INSERT was not OR IGNORE and
     would have thrown on collision), but scrubbing once is cheaper
     than failing to open the DB.

Migration uses explicit `PRAGMA table_info` / `sqlite_master` presence
checks rather than `try/catch` around the ALTER statements, so unrelated
SQLite errors don't get silently swallowed.

New reader helpers used by the resume logic in the next commit:

  * `_getIncompleteSetForGroupId(groupId)`: returns the newest
    complete=0 row for a group, if any.
  * `_countSetCoins(setId)`: count of links attached to a SparkSet.

Verified via sqlite3 CLI simulation: a realistic pre-migration DB (two
finalized syncs, 5 link rows) is byte-for-byte round-tripped through
the migration with no change in reader output. Dedup correctly
collapses a constructed 5-row table with duplicates to 3 unique rows
keeping MIN(id).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Problem

runFetchAndUpdateSparkAnonSetCacheForGroupId previously accumulated
every sector of a group's anonymity set in an in-memory Dart list and
only wrote to SQLite after the full delta had been fetched. A network
drop, app kill, or power loss mid-download discarded the in-memory
buffer and left no persisted progress, so the next sync attempt
recomputed the same meta.size - prevSize delta and re-downloaded
everything. For a large first-sync this burned the whole set
repeatedly.

Fix

The previous commit added the storage substrate: `SparkSet.complete`,
`SparkSetCoins.orderKey`, a UNIQUE index on SparkSetCoins(setId,
coinId), and the helper reader queries. This commit wires all of them
together.

Writer (firo_cache_writer.dart, firo_cache_worker.dart):

 * Removes _updateSparkAnonSetCoinsWith.
 * Adds _insertSparkAnonSetCoinsIncremental — per-sector write. Each
   call is an atomic SQLite transaction: INSERT OR IGNORE SparkSet
   with complete=0, INSERT OR IGNORE each coin and SparkSetCoins link
   with orderKey = startIndex + i. Readers don't see the row until
   finalize flips complete to 1.
 * Adds _markSparkAnonSetComplete — gated on a strict integrity check
   (COUNT(SparkSetCoins WHERE setId=?) == expectedLinkedCount). If the
   count disagrees, rolls back and leaves complete=0 so the next sync
   observes and resets the state. A partial or over-full cache never
   becomes the current set.
 * Adds _deleteIncompleteSparkSetsForGroup — discards in-progress rows
   (and their SparkSetCoins links) for a group. Used on blockHash
   shift or corruption detection. Finalized rows are not touched.
 * Defensive: _insertSparkAnonSetCoinsIncremental refuses to append to
   a row with complete=1 (catches a pathological case where the server
   reports the same blockHash/setHash as an already-finalized set).

Reader (firo_cache_reader.dart):

 * All anon-set-visible reads gain `WHERE ss.complete = 1`, so
   in-progress rows never leak partial coins to callers. This is
   load-bearing for libspark membership-proof construction: a spend
   initiated mid-sync must not see half-populated SparkSetCoins.
 * Coin ordering is preserved end-to-end via
   `ORDER BY ss.id ASC, ssc.orderKey DESC, ssc.id ASC`. The orderKey
   DESC sort yields server newest-first after the coordinator's
   existing Dart `.reversed`, matching the pre-fix behavior. The
   `ssc.id ASC` tiebreaker covers pre-migration rows whose orderKey
   defaults to 0: their PK order is the exact layout the old writer
   produced (it inserted coins in globally-reversed RPC order).
 * _getLatestSetInfoForGroupId gains a deterministic `ss.id DESC`
   tiebreaker for the same-size edge case.

Coordinator (firo_cache_coordinator.dart):

 * Resumable sync loop: reads the in-progress row if any, verifies its
   blockHash and setHash match the current meta, then resumes at the
   count of already-linked coins. If blockHash/setHash disagree,
   discards the in-progress row and starts fresh. If the linked count
   exceeds the expected delta, treats as corrupt and resets.
 * Per-sector server response size is cross-checked against the
   request range; a mismatch aborts before persisting, leaving the
   cursor (and thus the resume point) unchanged.
 * Reorg defense: skip sync if the server reports a smaller size than
   the last finalized state.
 * Same-blockHash-different-size anomaly: skip sync rather than let
   INSERT OR IGNORE append unverified coins to the already-finalized
   row and leak them past its committed setHash.
 * Empty-delta case (blockHash advanced without new coins): do not
   write a new SparkSet row. A same-size row would create a tiebreaker
   ambiguity in _getLatestSetInfoForGroupId; any stray in-progress row
   from a prior attempt is cleared.

No changes required to external callers — FiroCacheCoordinator's
public signatures are preserved.

Verification

 * sqlite3 CLI simulation of 14 scenarios: first sync, incremental
   sync, integrity-gate rejection, resume from committed cursor,
   blockhash-shift reset, idempotent sector replay, cross-set no
   duplicates, end-to-end ordering equivalence with pre-fix,
   pre-migration data read unchanged, and five focused safety-check
   tests. All pass.
 * Upgrade-path simulation against a pre-migration DB (two finalized
   syncs, five coin link-rows): reader output is byte-for-byte
   identical before and after migration; a subsequent post-migration
   incremental sync produces the correct hybrid layout.
 * External-caller audit (firo_wallet.dart, spark_interface.dart,
   settings/UI views): all callers use unchanged public signatures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@reubenyap
Copy link
Copy Markdown
Author

Self-review — findings after cold adversarial read

Ran a second pass with a skeptical Dart/SQLite reviewer agent to catch things I missed. Below are all findings from both passes, triaged by actionability. TL;DR: no bugs found that this PR needs to fix before merge. Details:


False positives (analyzed and dismissed)

Claim: INSERT OR IGNORE on SparkSetCoins silently drops duplicates, wedging the sync.

The scenario requires the server to return the same coin at two different indices for a given blockHash. The Firo daemon's GetCoinsForRecovery in src/spark/state.cpp:1617-1674 walks blocks backward from the target blockHash and iterates block->sparkMintedCoins[id] (a vector of coins minted in that block). A counter variable is incremented once per coin, and a coin can only appear in one block's sparkMintedCoins (a mint has exactly one confirmation block). So within any single sector request, and across sectors of the same sync (which all use the same pinned blockHash), each coin has exactly one counter value — no duplicates.

For cross-sync consistency: the blockHash argument pins the server's index mapping (the walk starts by scrolling back from coinGroup.lastBlock to blockHash, then indexes from there). A reorg that invalidates our blockHash causes the server to throw "Incorrect blockHash provided", which our coordinator surfaces as a sync failure; on next sync the blockHash mismatch is detected and the in-progress row is discarded. No silent data mixing.

The INSERT OR IGNORE on SparkSetCoins is therefore belt-and-suspenders; switching to plain INSERT would fail loudly in the impossible case, but would also add risk (new failure modes from server hiccups that currently dedupe silently). Not changing.

Claim: Redundant coinGroupID assignment.

spark_interface.dart:608 and :614 both set setData["coinGroupID"] = i. Pre-existing, not touched by this PR. Worth cleaning up separately.

Claim: Reader ordering is only newest-first within a set, not across sets.

Traced end-to-end with the delta-storage model. For sets 1 (size 100 at H1) and 2 (delta 50 at H2=150):

  • Set 1 orderKeys 0..99 = H1-indices 0..99 = at H2 these map to indices 50..149.
  • Set 2 orderKeys 0..49 = H2-indices 0..49.
  • Reader emits [set1 orderKey 99..0, set2 orderKey 49..0] → H2-indices [149..50, 49..0].
  • Dart .reversed → H2-indices [0..49, 50..149] = server's newest-first at H2.

End-to-end order is correct across sets. The comment in the reader accurately describes the behavior.


Pre-existing issues not introduced by this PR

Concurrency: main-isolate reads vs worker-isolate writes on the same SQLite file.

The sqlite3 package opens in rollback-journal mode by default; two handles on the same file can contend on the database-level lock. In my sync flow, main-isolate reads happen before the first worker task is dispatched, so the sync itself doesn't contend. But UI-initiated reads (getSparkCacheSize, getSetCoinsForGroupId from a spend) can race with a worker write and hit SQLITE_BUSY. This is pre-existing and my per-sector writes increase the exposure window (more, shorter transactions rather than one long one). A follow-up that enables PRAGMA journal_mode=WAL on open would resolve this broadly — worth considering in a separate PR.

TOCTOU between getSetCoinsForGroupId and getLatestSetInfoForGroupId in spark_interface.dart.

spark_interface.dart:589 then :597 reads coins, then reads meta. Between the two awaits a concurrent sync could finalize a new SparkSet row. The read coins would be from the older set but info.setHash would be from the newer set, producing a hash mismatch in libspark. Pre-existing race. Worth fixing in a follow-up (either guard with the set lock or re-read consistency).

PRAGMA foreign_keys is not explicitly enabled.

The FK declarations on SparkSetCoins are decorative in rollback-journal mode with default pragmas. Pre-existing. _deleteIncompleteSparkSetsForGroup and _deleteAllCache both delete in the correct order (children before parents), so this doesn't bite today, but a future writer that deletes SparkSet without first clearing SparkSetCoins would leave dangling references silently.


Notes worth documenting but not changing

Progress-bar callback semantic inconsistency.

Initial call passes (prevSize, meta.size) — absolute ratio. Loop calls pass (cursor, numberOfCoinsToFetch) — delta-relative ratio. For a resumed sync with prevSize=0 and in-progress linked=1200 of 1500, the UI will display 0% initially and then jump to 100% after the first remaining sector. Pre-existing inconsistency in the callback shape; my PR preserves the shape to avoid changing caller expectations. A cleaner fix is (prevSize + linkedSoFar, meta.size) on the initial call, but that changes semantics.

DEFAULT 1 asymmetry between migration and fresh-DB CREATE TABLE.

Migrated DBs get complete INTEGER NOT NULL DEFAULT 1 (legacy rows are finalized). Fresh DBs get complete INTEGER NOT NULL DEFAULT 0. Any future writer that omits the complete column on INSERT will behave differently between the two paths. My writer always specifies complete = 0 explicitly, so this isn't a live hazard — but a footgun for future code.

Orphan SparkCoin rows on blockHash shift.

_deleteIncompleteSparkSetsForGroup cleans SparkSet and SparkSetCoins but leaves the underlying SparkCoin rows. On a subsequent sync at a new blockHash those coins are either re-linked (if still in the delta window, which they always are for normal sync patterns) or remain orphaned. Bounded disk-space impact. Could be swept by a GC pass.

_getIncompleteSetForGroupId LIMIT 1 masks multiple-incomplete-rows.

In pathological scenarios (multiple aborted syncs at different blockHashes) several incomplete rows could coexist. My query returns ORDER BY id DESC LIMIT 1 (newest) deterministically. The coordinator discards all mismatching incomplete rows via _deleteIncompleteSparkSetsForGroup when it sees a mismatch, so orphans are cleaned up on the next mismatched sync. Not an indefinite leak.


Testing asks for the reviewer

Three things this environment couldn't run:

  1. flutter analyze / dart analyze — to catch any Dart-level issues I missed.
  2. Existing test suite in test/ — none touch firo_cache_*, but the wallet-level tests may exercise the sync path.
  3. End-to-end regression test on a real testnet: sync, kill mid-download, restart, confirm resume. I simulated this in sqlite3 but a live run would exercise the isolate + electrumx + libspark integration.

Happy to address any of the above findings (e.g., enabling WAL, tightening the spark_interface race) in this PR or as follow-ups — let me know the preference.

🤖 Self-review generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant