Skip to content

fix: recover stalled wallet syncs via idle watchdog and progress reorder#1296

Open
reubenyap wants to merge 7 commits intocypherstack:stagingfrom
reubenyap:claude/fix-firo-sync-stall-dP7vn
Open

fix: recover stalled wallet syncs via idle watchdog and progress reorder#1296
reubenyap wants to merge 7 commits intocypherstack:stagingfrom
reubenyap:claude/fix-firo-sync-stall-dP7vn

Conversation

@reubenyap
Copy link
Copy Markdown

@reubenyap reubenyap commented Apr 7, 2026

Summary

The base Wallet<T>._refresh() is shared across many wallet types. When any awaited sub-operation hangs indefinitely (unresponsive server, OS-suspended socket, wedged native-lib callback, etc.), refreshMutex is never released — the catch/finally blocks only run when the future completes or throws, and a permanently pending future does neither. Every subsequent periodic sync then bails out at the if (refreshMutex.isLocked) return check, leaving the wallet unable to sync until the app is force-closed.

The most visible manifestation is Firo getting stuck at ~65% (which is what prompted this fix), but the underlying cause is generic and affects all wallets routing through the shared _refresh() path.

Changes

1. Idle watchdog on _refresh()lib/wallets/wallet/wallet.dart

The refresh body is extracted to a new _doRefreshWork(viewOnly) helper. _refresh() races it against a Timer.periodic watchdog that trips if no refresh activity has been observed for 5 minutes (_refreshIdleThreshold). On trip, the watchdog throws TimeoutException → existing catchcompleter.completeError()finally releases refreshMutex. The next periodic sync (every 150s) gets a fresh attempt.

Activity is fed from two sources:

  • _fireRefreshPercentChange ticks — coarse phase checkpoints, plus per-sector progress during Spark anon-set downloads.
  • Successful electrum RPCs, via a new onRequestComplete hook on ElectrumXClient (see change Change name in README. #2).

This matters because updateTransactions() on a wallet with large history does hundreds of sequential getTransaction RPCs with no _fireRefreshPercentChange calls between 0.65 and 0.70. Without the electrum hook, a slow server could legitimately trigger a false timeout on an initial restore of a big wallet. With the hook, the watchdog stays fed as long as electrum is answering.

Caveats:

  • The watchdog does not cancel in-flight work; it only unblocks the outer future so the mutex can be released. The abandoned _doRefreshWork future keeps running. If it eventually throws, the error is swallowed via a no-op catchError attached in the finally block so it does not surface as an uncaught async error. Per-call hang detection remains the job of the underlying adapters (e.g. electrum's connectionTimeout / aliveTimerDuration at 60s each).

2. onRequestComplete hook on ElectrumXClientlib/electrumx_rpc/electrumx_client.dart

New nullable void Function()? onRequestComplete field, called after every successful RPC in both request() and batchRequest(). Invocations go through _fireOnRequestComplete(), which wraps the call in try/catch so a faulty hook cannot be misattributed as an RPC failure by the surrounding catch block (which would otherwise increment currentFailoverIndex and trigger a failover retry). _refresh() wires it to reset the watchdog timestamp while a refresh is active, then clears it in finally. Covers every current and future electrum-routed call automatically without instrumenting each wallet's updateTransactions.

3. Fix misleading progress reporting — lib/wallets/wallet/wallet.dart

Before: 0.6 → 0.65 → [await UTXOs] → 0.70 → [await txns]
After: 0.6 → [await UTXOs] → 0.65 → [await txns] → 0.70

Progress now reflects actual completion rather than jumping ahead before the awaited work finishes. Applies to all non-Namecoin wallets (the else branch).

Scope

Affected (wallet types routing through the base _refresh()):
BTC, LTC, BCH, Dash, Doge, Ecash, Ethereum, Firo, Cardano, Solana, Stellar, Tezos, Xelis, Namecoin, Bitcoin Frost, Fact0rn, Particl, Peercoin, Nano, Banano.

Unaffected (override refresh() entirely and bypass _refresh()):

  • Monero / Wownero / Salvium (each has its own refresh() override in lib_monero_wallet.dart, lib_wownero_wallet.dart, lib_salvium_wallet.dart)
  • MimbleWimbleCoin / Epic Cash
  • ETH tokens
  • SOL tokens

The same watchdog pattern could be applied to the Monero-family overrides in a follow-up if they exhibit similar mutex-stuck symptoms.

Commits

For easier review each concern is isolated to its own commit:

  1. refactor: extract _refresh() body into _doRefreshWork helper — pure mechanical extraction, no behavior change.
  2. fix: fire refresh progress updates after awaited work completes — progress reorder.
  3. fix: add idle watchdog to _refresh() so hung syncs release the mutex — core watchdog + onRequestComplete hook.
  4. fix: isolate onRequestComplete hook from RPC error handling — defensive try/catch around the hook invocation.
  5. fix: suppress uncaught errors from detached refresh work after watchdog trip — no-op catchError on the abandoned work future.
  6. refactor: promote refresh watchdog intervals to named constants_refreshIdleThreshold / _refreshWatchdogTick.

Test plan

  • Sync a Firo wallet and verify it completes past 65%
  • Verify progress bar advances smoothly without premature jumps
  • Kill the ElectrumX server mid-sync and verify the wallet recovers on next periodic refresh (mutex released)
  • Verify long legitimate initial syncs complete without the watchdog firing prematurely:
    • Firo: full Spark anon-set download on a constrained network (Spark per-sector progress keeps watchdog alive)
    • BTC/LTC/Doge/Dash: restore of a wallet with thousands of historical transactions (electrum onRequestComplete hook keeps watchdog alive)
  • Sanity check other coin types (ETH, Solana, Cardano) still sync normally

@reubenyap reubenyap force-pushed the claude/fix-firo-sync-stall-dP7vn branch 5 times, most recently from 3d943be to 9248798 Compare April 7, 2026 21:21
@reubenyap reubenyap force-pushed the claude/fix-firo-sync-stall-dP7vn branch from 9248798 to 835b5ab Compare April 20, 2026 07:00
@reubenyap reubenyap changed the title fix: prevent Firo sync stall at ~65% by adding timeouts and retry logic fix: prevent Firo sync stall at ~65% with refresh timeout and progress reorder Apr 20, 2026
@reubenyap reubenyap force-pushed the claude/fix-firo-sync-stall-dP7vn branch from 835b5ab to 712763c Compare April 20, 2026 07:13
@reubenyap reubenyap marked this pull request as ready for review April 20, 2026 07:15
@reubenyap reubenyap force-pushed the claude/fix-firo-sync-stall-dP7vn branch from 712763c to d96fffc Compare April 20, 2026 07:23
@reubenyap reubenyap changed the title fix: prevent Firo sync stall at ~65% with refresh timeout and progress reorder fix: recover stalled wallet syncs via refresh timeout and progress reorder Apr 20, 2026
@reubenyap reubenyap force-pushed the claude/fix-firo-sync-stall-dP7vn branch from d96fffc to 4ed8a8d Compare April 20, 2026 07:39
@reubenyap reubenyap changed the title fix: recover stalled wallet syncs via refresh timeout and progress reorder fix: recover stalled wallet syncs via idle watchdog and progress reorder Apr 20, 2026
@reubenyap reubenyap force-pushed the claude/fix-firo-sync-stall-dP7vn branch from 4ed8a8d to 4e43243 Compare April 20, 2026 07:54
reubenyap and others added 3 commits April 20, 2026 16:01
Pure refactor — no behaviour change. _refresh() is now focused on
mutex acquisition, event firing, and error/completion handling;
the sequenced work lives in _doRefreshWork(viewOnly).

This lets subsequent changes (e.g. wrapping the work in a watchdog)
touch only the helper call site instead of forcing a large
indentation shift of the entire body.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Before: 0.6 → 0.65 → [await UTXOs] → 0.70 → [await txns]
After:  0.6 → [await UTXOs] → 0.65 → [await txns] → 0.70

Previously the 0.65 and 0.70 updates fired immediately after kicking
off updateUTXOs and updateTransactions rather than after awaiting
them, making the progress bar appear stuck at those values while the
real work was still running. The Firo sync "stuck at 65%" reports
were the most visible manifestation.

Namecoin branch is unaffected (it already awaits before firing).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously, if any awaited sub-operation inside _refresh() hung
indefinitely (unresponsive server, OS-suspended socket, wedged
native-lib callback, etc.), refreshMutex was never released: the
catch/finally blocks only run when the future completes or throws,
and a permanently pending future does neither. Every subsequent
periodic sync then bailed out at the isLocked check, leaving the
wallet unable to sync until the app was force-closed. The Firo
"stuck at ~65%" reports were the most visible manifestation.

_refresh() now races _doRefreshWork(viewOnly) against a
Timer.periodic watchdog. The watchdog trips if no refresh activity
has been observed for 5 minutes, throwing TimeoutException through
the existing catch/finally so the mutex is released and the next
periodic sync can proceed.

"Activity" is fed from two sources:
  - _fireRefreshPercentChange calls (coarse phase checkpoints plus
    Spark per-sector progress during anon-set downloads), via
    _lastRefreshProgress.
  - Successful electrum RPCs, via a new onRequestComplete hook on
    ElectrumXClient. This covers the case where updateTransactions()
    on a big wallet does hundreds of sequential getTransaction RPCs
    with no _fireRefreshPercentChange calls between 0.65 and 0.70 —
    on a slow server that would otherwise look idle for >5 min and
    trip the watchdog falsely.

The hook is wired only when the wallet implements ElectrumXInterface,
cleared in _refresh()'s inner finally, and covers all current and
future electrum-routed calls automatically (request + batchRequest).

Per-call hang detection remains the responsibility of the underlying
adapters (e.g. electrum's connectionTimeout / aliveTimerDuration at
60s each). The watchdog only catches what slips through those layers.

Note: the watchdog does not cancel in-flight work; it only unblocks
the outer future so the mutex can be released. The orphaned work
eventually resolves or errors on its own.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@reubenyap reubenyap force-pushed the claude/fix-firo-sync-stall-dP7vn branch from 4e43243 to b91464a Compare April 20, 2026 08:04
Reuben Yap added 4 commits April 22, 2026 22:48
Wrap the hook call in _fireOnRequestComplete() which swallows (and
logs) exceptions. Previously a throw from the hook would land in the
surrounding catch block and be misattributed as an RPC failure,
incrementing currentFailoverIndex and triggering a failover retry
against the next server despite the RPC having already succeeded.
…og trip

When the watchdog wins the Future.any race, _doRefreshWork is abandoned
but keeps running (Dart has no cooperative cancellation). If it later
throws, the error lands on a detached future and surfaces as an
uncaught async error. Attach a no-op catchError to the work future in
the finally block so late failures are silently dropped.
Pull the 5-minute idle threshold and 30-second watchdog tick out of
the function body and onto the class as _refreshIdleThreshold and
_refreshWatchdogTick. Makes both values greppable and trivially
tunable.
…ot percent ticks

spark_interface.dart's _triggerEventHelper fires progress events directly
(refreshingPercent + GlobalEventBus) and does not call
_fireRefreshPercentChange, so it does not update _lastRefreshProgress.
Spark anon-set downloads are covered by the electrumXClient.onRequestComplete
hook because they use electrumXClient internally. Correct the comment so
future readers understand which path actually covers Spark.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant