fix(firo): make Spark anon-set sync resumable and crash-safe by reubenyap · Pull Request #1298 · cypherstack/stack_wallet

reubenyap · 2026-04-20T08:26:11Z

Problem

runFetchAndUpdateSparkAnonSetCacheForGroupId accumulates every sector of a group's anonymity set in an in-memory Dart List and only writes to SQLite after the full delta has been fetched. A network drop, app kill, or power loss mid-download discards the in-memory buffer and leaves no persisted progress, so the next sync attempt recomputes the same meta.size - prevSize delta and re-downloads everything. For a large first-sync this burns the whole set repeatedly.

Additionally, a related design — persisting partial sectors directly to SparkSet with a complete flag but not filtering reader queries on it — has a second failure mode: a spend initiated while a sync is running can observe a half-populated anonymity set and hand libspark a setHash-mismatched set, failing the membership proof.

Fix

Per-sector commits against an in-progress SparkSet row that stays invisible to readers (complete = 0) until a strict integrity check passes, at which point a single atomic UPDATE flips it to complete = 1. Partial data is never observable by readers, and the next sync resumes from the count of already-linked coins.

Design

Schema: adds SparkSet.complete (integrity gate) and SparkSetCoins.orderKey (server-side delta index, for ordering preservation), plus a UNIQUE INDEX on SparkSetCoins(setId, coinId) so per-sector INSERT OR IGNORE dedupes correctly under crash-recovery replay.
Migration: idempotent, runs on every open, uses explicit PRAGMA table_info / sqlite_master presence checks rather than try/catch around ALTER (so unrelated SQLite errors don't get silently swallowed). Defensive dedup of any legacy duplicate (setId, coinId) rows before creating the UNIQUE index. Existing SparkSet rows default to complete = 1 (the pre-fix writer was all-or-nothing, so they represent finalized syncs); existing SparkSetCoins.orderKey defaults to 0 and the reader's ssc.id ASC tiebreaker reproduces the pre-fix layout byte-for-byte.
Writer: three focused functions — per-sector append, mark-complete (gated on COUNT(SparkSetCoins) == expectedDelta), and delete-incomplete (for blockHash shift / corruption reset). Each runs in its own SQLite transaction.
Reader: all anon-set queries gain WHERE ss.complete = 1. Coin ordering reconstructed via ORDER BY ss.id ASC, ssc.orderKey DESC, ssc.id ASC — matches pre-fix end-to-end behavior for both pre-migration rows (orderKey tied at 0, PK tiebreaker) and post-migration rows (orderKey meaningful).
Coordinator: resumable loop reads any in-progress row, verifies its blockHash and setHash match the current server meta, resumes at COUNT(SparkSetCoins) if they do, discards and restarts if they don't. Sector-size cross-check refuses to persist a partial/over-full server response. Reorg-shrink and blockHash-advanced-without-new-coins cases are handled explicitly to avoid corrupting finalized state.

Integrity guarantees

Scenario	Behavior
Crash mid-sector	SQLite transaction atomicity keeps cache consistent with last committed sector. Resume cursor derived from `COUNT(SparkSetCoins)`.
Crash mid-finalize	Either `complete=1` flipped or not — no intermediate. Either way, next sync resumes correctly.
BlockHash shift between attempts	In-progress row discarded; finalized rows untouched. Partial work re-downloaded at the new blockHash.
Server returns wrong sector size	Abort before persisting. Cursor unchanged. Retry on next sync.
Server reports smaller size than cached (reorg)	Skip sync, preserve cached state.
Server reports same blockHash with different size (anomaly)	Skip sync, refuse to corrupt the already-finalized row.
Finalize integrity check mismatches	Roll back, row stays `complete=0`. Next sync observes the over-linked row and resets.
Spend initiated during sync	Reader's `complete=1` filter keeps in-progress data invisible. Spend sees the last finalized state.

Upgrade path

Fully automatic — no manual cache clear, no wallet rescan, no user action. The migration runs on first open of the new version:

ALTER TABLE SparkSet ADD COLUMN complete INTEGER NOT NULL DEFAULT 1 (legacy rows are finalized).
ALTER TABLE SparkSetCoins ADD COLUMN orderKey INTEGER NOT NULL DEFAULT 0.
Defensive dedup, then CREATE UNIQUE INDEX.

Verified against a realistic pre-migration DB (two finalized syncs, five link rows): reader output is byte-for-byte identical before and after migration; a subsequent post-migration incremental sync produces the expected hybrid layout.

Review strategy

The change is split into two commits with clear separation of concerns, so a reviewer can verify them independently:

1. `1d543d8` — schema + migration (+126 −5, 2 files)

Purely additive. Existing reader/writer/coordinator are untouched, so behavior for anyone running only this commit is unchanged. Verify the schema design and migration correctness in isolation before looking at any logic.

2. `90cd308` — behavioral fix (+424 −77, 4 files)

Writer, worker, reader filter/ordering, and coordinator rewrite. Public FiroCacheCoordinator signatures are preserved, so no callers in firo_wallet.dart, spark_interface.dart, or UI views need changes.

Test plan

Environment didn't have the Flutter/Dart toolchain available during development, so I verified by replaying the writer's exact SQL against an in-memory sqlite3 CLI instance across 18 scenarios:

🤖 Generated with Claude Code

Adds the storage substrate that the resumable-sync fix will use in the next commit. This commit on its own is purely additive — existing reader queries and the existing writer are untouched, so behavior for anyone running only this commit is unchanged. Schema additions (both in the fresh-DB CREATE TABLE and in the idempotent `_migrateSparkSetCacheDb` migration that runs on every open): 1. `SparkSet.complete INTEGER NOT NULL`: will gate reader visibility once the next commit wires it up. 0 = in-progress, 1 = finalized. Pre-migration rows default to 1 (legacy writer was all-or-nothing, so any existing row represents a finalized sync); fresh-DB new rows default to 0 and the writer will explicitly set complete=0 until finalize. 2. `SparkSetCoins.orderKey INTEGER NOT NULL DEFAULT 0`: will hold the server-side delta index of each coin so the reader can reconstruct server newest-first ordering end-to-end. Pre-migration rows default to 0; the reader's `ssc.id ASC` tiebreaker (added next commit) then sorts them in PK order, which is exactly the layout the pre-fix writer produced — preserving ordering byte-for-byte. 3. `UNIQUE INDEX idx_sparksetcoins_set_coin ON SparkSetCoins(setId, coinId)`: required for INSERT OR IGNORE on the link table during resumable per-sector writes (added next commit). Before creating the index, any pre-existing duplicate (setId, coinId) rows are removed — keeping MIN(id) per group — so a legacy DB with unexpected duplicates can still upgrade. The pre-fix writer shouldn't have produced any (its INSERT was not OR IGNORE and would have thrown on collision), but scrubbing once is cheaper than failing to open the DB. Migration uses explicit `PRAGMA table_info` / `sqlite_master` presence checks rather than `try/catch` around the ALTER statements, so unrelated SQLite errors don't get silently swallowed. New reader helpers used by the resume logic in the next commit: * `_getIncompleteSetForGroupId(groupId)`: returns the newest complete=0 row for a group, if any. * `_countSetCoins(setId)`: count of links attached to a SparkSet. Verified via sqlite3 CLI simulation: a realistic pre-migration DB (two finalized syncs, 5 link rows) is byte-for-byte round-tripped through the migration with no change in reader output. Dedup correctly collapses a constructed 5-row table with duplicates to 3 unique rows keeping MIN(id). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Problem runFetchAndUpdateSparkAnonSetCacheForGroupId previously accumulated every sector of a group's anonymity set in an in-memory Dart list and only wrote to SQLite after the full delta had been fetched. A network drop, app kill, or power loss mid-download discarded the in-memory buffer and left no persisted progress, so the next sync attempt recomputed the same meta.size - prevSize delta and re-downloaded everything. For a large first-sync this burned the whole set repeatedly. Fix The previous commit added the storage substrate: `SparkSet.complete`, `SparkSetCoins.orderKey`, a UNIQUE index on SparkSetCoins(setId, coinId), and the helper reader queries. This commit wires all of them together. Writer (firo_cache_writer.dart, firo_cache_worker.dart): * Removes _updateSparkAnonSetCoinsWith. * Adds _insertSparkAnonSetCoinsIncremental — per-sector write. Each call is an atomic SQLite transaction: INSERT OR IGNORE SparkSet with complete=0, INSERT OR IGNORE each coin and SparkSetCoins link with orderKey = startIndex + i. Readers don't see the row until finalize flips complete to 1. * Adds _markSparkAnonSetComplete — gated on a strict integrity check (COUNT(SparkSetCoins WHERE setId=?) == expectedLinkedCount). If the count disagrees, rolls back and leaves complete=0 so the next sync observes and resets the state. A partial or over-full cache never becomes the current set. * Adds _deleteIncompleteSparkSetsForGroup — discards in-progress rows (and their SparkSetCoins links) for a group. Used on blockHash shift or corruption detection. Finalized rows are not touched. * Defensive: _insertSparkAnonSetCoinsIncremental refuses to append to a row with complete=1 (catches a pathological case where the server reports the same blockHash/setHash as an already-finalized set). Reader (firo_cache_reader.dart): * All anon-set-visible reads gain `WHERE ss.complete = 1`, so in-progress rows never leak partial coins to callers. This is load-bearing for libspark membership-proof construction: a spend initiated mid-sync must not see half-populated SparkSetCoins. * Coin ordering is preserved end-to-end via `ORDER BY ss.id ASC, ssc.orderKey DESC, ssc.id ASC`. The orderKey DESC sort yields server newest-first after the coordinator's existing Dart `.reversed`, matching the pre-fix behavior. The `ssc.id ASC` tiebreaker covers pre-migration rows whose orderKey defaults to 0: their PK order is the exact layout the old writer produced (it inserted coins in globally-reversed RPC order). * _getLatestSetInfoForGroupId gains a deterministic `ss.id DESC` tiebreaker for the same-size edge case. Coordinator (firo_cache_coordinator.dart): * Resumable sync loop: reads the in-progress row if any, verifies its blockHash and setHash match the current meta, then resumes at the count of already-linked coins. If blockHash/setHash disagree, discards the in-progress row and starts fresh. If the linked count exceeds the expected delta, treats as corrupt and resets. * Per-sector server response size is cross-checked against the request range; a mismatch aborts before persisting, leaving the cursor (and thus the resume point) unchanged. * Reorg defense: skip sync if the server reports a smaller size than the last finalized state. * Same-blockHash-different-size anomaly: skip sync rather than let INSERT OR IGNORE append unverified coins to the already-finalized row and leak them past its committed setHash. * Empty-delta case (blockHash advanced without new coins): do not write a new SparkSet row. A same-size row would create a tiebreaker ambiguity in _getLatestSetInfoForGroupId; any stray in-progress row from a prior attempt is cleared. No changes required to external callers — FiroCacheCoordinator's public signatures are preserved. Verification * sqlite3 CLI simulation of 14 scenarios: first sync, incremental sync, integrity-gate rejection, resume from committed cursor, blockhash-shift reset, idempotent sector replay, cross-set no duplicates, end-to-end ordering equivalence with pre-fix, pre-migration data read unchanged, and five focused safety-check tests. All pass. * Upgrade-path simulation against a pre-migration DB (two finalized syncs, five coin link-rows): reader output is byte-for-byte identical before and after migration; a subsequent post-migration incremental sync produces the correct hybrid layout. * External-caller audit (firo_wallet.dart, spark_interface.dart, settings/UI views): all callers use unchanged public signatures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

reubenyap · 2026-04-20T08:41:16Z

Self-review — findings after cold adversarial read

Ran a second pass with a skeptical Dart/SQLite reviewer agent to catch things I missed. Below are all findings from both passes, triaged by actionability. TL;DR: no bugs found that this PR needs to fix before merge. Details:

False positives (analyzed and dismissed)

Claim: INSERT OR IGNORE on SparkSetCoins silently drops duplicates, wedging the sync.

The scenario requires the server to return the same coin at two different indices for a given blockHash. The Firo daemon's GetCoinsForRecovery in src/spark/state.cpp:1617-1674 walks blocks backward from the target blockHash and iterates block->sparkMintedCoins[id] (a vector of coins minted in that block). A counter variable is incremented once per coin, and a coin can only appear in one block's sparkMintedCoins (a mint has exactly one confirmation block). So within any single sector request, and across sectors of the same sync (which all use the same pinned blockHash), each coin has exactly one counter value — no duplicates.

For cross-sync consistency: the blockHash argument pins the server's index mapping (the walk starts by scrolling back from coinGroup.lastBlock to blockHash, then indexes from there). A reorg that invalidates our blockHash causes the server to throw "Incorrect blockHash provided", which our coordinator surfaces as a sync failure; on next sync the blockHash mismatch is detected and the in-progress row is discarded. No silent data mixing.

The INSERT OR IGNORE on SparkSetCoins is therefore belt-and-suspenders; switching to plain INSERT would fail loudly in the impossible case, but would also add risk (new failure modes from server hiccups that currently dedupe silently). Not changing.

Claim: Redundant coinGroupID assignment.

spark_interface.dart:608 and :614 both set setData["coinGroupID"] = i. Pre-existing, not touched by this PR. Worth cleaning up separately.

Claim: Reader ordering is only newest-first within a set, not across sets.

Traced end-to-end with the delta-storage model. For sets 1 (size 100 at H1) and 2 (delta 50 at H2=150):

Set 1 orderKeys 0..99 = H1-indices 0..99 = at H2 these map to indices 50..149.
Set 2 orderKeys 0..49 = H2-indices 0..49.
Reader emits [set1 orderKey 99..0, set2 orderKey 49..0] → H2-indices [149..50, 49..0].
Dart .reversed → H2-indices [0..49, 50..149] = server's newest-first at H2.

End-to-end order is correct across sets. The comment in the reader accurately describes the behavior.

Pre-existing issues not introduced by this PR

Concurrency: main-isolate reads vs worker-isolate writes on the same SQLite file.

The sqlite3 package opens in rollback-journal mode by default; two handles on the same file can contend on the database-level lock. In my sync flow, main-isolate reads happen before the first worker task is dispatched, so the sync itself doesn't contend. But UI-initiated reads (getSparkCacheSize, getSetCoinsForGroupId from a spend) can race with a worker write and hit SQLITE_BUSY. This is pre-existing and my per-sector writes increase the exposure window (more, shorter transactions rather than one long one). A follow-up that enables PRAGMA journal_mode=WAL on open would resolve this broadly — worth considering in a separate PR.

TOCTOU between getSetCoinsForGroupId and getLatestSetInfoForGroupId in spark_interface.dart.

spark_interface.dart:589 then :597 reads coins, then reads meta. Between the two awaits a concurrent sync could finalize a new SparkSet row. The read coins would be from the older set but info.setHash would be from the newer set, producing a hash mismatch in libspark. Pre-existing race. Worth fixing in a follow-up (either guard with the set lock or re-read consistency).

PRAGMA foreign_keys is not explicitly enabled.

The FK declarations on SparkSetCoins are decorative in rollback-journal mode with default pragmas. Pre-existing. _deleteIncompleteSparkSetsForGroup and _deleteAllCache both delete in the correct order (children before parents), so this doesn't bite today, but a future writer that deletes SparkSet without first clearing SparkSetCoins would leave dangling references silently.

Notes worth documenting but not changing

Progress-bar callback semantic inconsistency.

Initial call passes (prevSize, meta.size) — absolute ratio. Loop calls pass (cursor, numberOfCoinsToFetch) — delta-relative ratio. For a resumed sync with prevSize=0 and in-progress linked=1200 of 1500, the UI will display 0% initially and then jump to 100% after the first remaining sector. Pre-existing inconsistency in the callback shape; my PR preserves the shape to avoid changing caller expectations. A cleaner fix is (prevSize + linkedSoFar, meta.size) on the initial call, but that changes semantics.

DEFAULT 1 asymmetry between migration and fresh-DB CREATE TABLE.

Migrated DBs get complete INTEGER NOT NULL DEFAULT 1 (legacy rows are finalized). Fresh DBs get complete INTEGER NOT NULL DEFAULT 0. Any future writer that omits the complete column on INSERT will behave differently between the two paths. My writer always specifies complete = 0 explicitly, so this isn't a live hazard — but a footgun for future code.

Orphan SparkCoin rows on blockHash shift.

_deleteIncompleteSparkSetsForGroup cleans SparkSet and SparkSetCoins but leaves the underlying SparkCoin rows. On a subsequent sync at a new blockHash those coins are either re-linked (if still in the delta window, which they always are for normal sync patterns) or remain orphaned. Bounded disk-space impact. Could be swept by a GC pass.

_getIncompleteSetForGroupId LIMIT 1 masks multiple-incomplete-rows.

In pathological scenarios (multiple aborted syncs at different blockHashes) several incomplete rows could coexist. My query returns ORDER BY id DESC LIMIT 1 (newest) deterministically. The coordinator discards all mismatching incomplete rows via _deleteIncompleteSparkSetsForGroup when it sees a mismatch, so orphans are cleaned up on the next mismatched sync. Not an indefinite leak.

Testing asks for the reviewer

Three things this environment couldn't run:

flutter analyze / dart analyze — to catch any Dart-level issues I missed.
Existing test suite in test/ — none touch firo_cache_*, but the wallet-level tests may exercise the sync path.
End-to-end regression test on a real testnet: sync, kill mid-download, restart, confirm resume. I simulated this in sqlite3 but a live run would exercise the isolate + electrumx + libspark integration.

Happy to address any of the above findings (e.g., enabling WAL, tightening the spark_interface race) in this PR or as follow-ups — let me know the preference.

🤖 Self-review generated with Claude Code

reubenyap and others added 2 commits April 20, 2026 16:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(firo): make Spark anon-set sync resumable and crash-safe#1298

fix(firo): make Spark anon-set sync resumable and crash-safe#1298
reubenyap wants to merge 2 commits intocypherstack:stagingfrom
reubenyap:fix/spark-anon-set-middle-path

reubenyap commented Apr 20, 2026

Uh oh!

reubenyap commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

reubenyap commented Apr 20, 2026

Problem

Fix

Design

Integrity guarantees

Upgrade path

Review strategy

1. 1d543d8 — schema + migration (+126 −5, 2 files)

2. 90cd308 — behavioral fix (+424 −77, 4 files)

Test plan

Uh oh!

reubenyap commented Apr 20, 2026

Self-review — findings after cold adversarial read

False positives (analyzed and dismissed)

Pre-existing issues not introduced by this PR

Notes worth documenting but not changing

Testing asks for the reviewer

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `1d543d8` — schema + migration (+126 −5, 2 files)

2. `90cd308` — behavioral fix (+424 −77, 4 files)