fix: recover stalled wallet syncs via idle watchdog and progress reorder#1296
Open
reubenyap wants to merge 7 commits intocypherstack:stagingfrom
Open
fix: recover stalled wallet syncs via idle watchdog and progress reorder#1296reubenyap wants to merge 7 commits intocypherstack:stagingfrom
reubenyap wants to merge 7 commits intocypherstack:stagingfrom
Conversation
3d943be to
9248798
Compare
9248798 to
835b5ab
Compare
835b5ab to
712763c
Compare
712763c to
d96fffc
Compare
d96fffc to
4ed8a8d
Compare
4ed8a8d to
4e43243
Compare
Pure refactor — no behaviour change. _refresh() is now focused on mutex acquisition, event firing, and error/completion handling; the sequenced work lives in _doRefreshWork(viewOnly). This lets subsequent changes (e.g. wrapping the work in a watchdog) touch only the helper call site instead of forcing a large indentation shift of the entire body. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Before: 0.6 → 0.65 → [await UTXOs] → 0.70 → [await txns] After: 0.6 → [await UTXOs] → 0.65 → [await txns] → 0.70 Previously the 0.65 and 0.70 updates fired immediately after kicking off updateUTXOs and updateTransactions rather than after awaiting them, making the progress bar appear stuck at those values while the real work was still running. The Firo sync "stuck at 65%" reports were the most visible manifestation. Namecoin branch is unaffected (it already awaits before firing). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously, if any awaited sub-operation inside _refresh() hung
indefinitely (unresponsive server, OS-suspended socket, wedged
native-lib callback, etc.), refreshMutex was never released: the
catch/finally blocks only run when the future completes or throws,
and a permanently pending future does neither. Every subsequent
periodic sync then bailed out at the isLocked check, leaving the
wallet unable to sync until the app was force-closed. The Firo
"stuck at ~65%" reports were the most visible manifestation.
_refresh() now races _doRefreshWork(viewOnly) against a
Timer.periodic watchdog. The watchdog trips if no refresh activity
has been observed for 5 minutes, throwing TimeoutException through
the existing catch/finally so the mutex is released and the next
periodic sync can proceed.
"Activity" is fed from two sources:
- _fireRefreshPercentChange calls (coarse phase checkpoints plus
Spark per-sector progress during anon-set downloads), via
_lastRefreshProgress.
- Successful electrum RPCs, via a new onRequestComplete hook on
ElectrumXClient. This covers the case where updateTransactions()
on a big wallet does hundreds of sequential getTransaction RPCs
with no _fireRefreshPercentChange calls between 0.65 and 0.70 —
on a slow server that would otherwise look idle for >5 min and
trip the watchdog falsely.
The hook is wired only when the wallet implements ElectrumXInterface,
cleared in _refresh()'s inner finally, and covers all current and
future electrum-routed calls automatically (request + batchRequest).
Per-call hang detection remains the responsibility of the underlying
adapters (e.g. electrum's connectionTimeout / aliveTimerDuration at
60s each). The watchdog only catches what slips through those layers.
Note: the watchdog does not cancel in-flight work; it only unblocks
the outer future so the mutex can be released. The orphaned work
eventually resolves or errors on its own.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4e43243 to
b91464a
Compare
added 4 commits
April 22, 2026 22:48
Wrap the hook call in _fireOnRequestComplete() which swallows (and logs) exceptions. Previously a throw from the hook would land in the surrounding catch block and be misattributed as an RPC failure, incrementing currentFailoverIndex and triggering a failover retry against the next server despite the RPC having already succeeded.
…og trip When the watchdog wins the Future.any race, _doRefreshWork is abandoned but keeps running (Dart has no cooperative cancellation). If it later throws, the error lands on a detached future and surfaces as an uncaught async error. Attach a no-op catchError to the work future in the finally block so late failures are silently dropped.
Pull the 5-minute idle threshold and 30-second watchdog tick out of the function body and onto the class as _refreshIdleThreshold and _refreshWatchdogTick. Makes both values greppable and trivially tunable.
…ot percent ticks spark_interface.dart's _triggerEventHelper fires progress events directly (refreshingPercent + GlobalEventBus) and does not call _fireRefreshPercentChange, so it does not update _lastRefreshProgress. Spark anon-set downloads are covered by the electrumXClient.onRequestComplete hook because they use electrumXClient internally. Correct the comment so future readers understand which path actually covers Spark.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The base
Wallet<T>._refresh()is shared across many wallet types. When any awaited sub-operation hangs indefinitely (unresponsive server, OS-suspended socket, wedged native-lib callback, etc.),refreshMutexis never released — thecatch/finallyblocks only run when the future completes or throws, and a permanently pending future does neither. Every subsequent periodic sync then bails out at theif (refreshMutex.isLocked) returncheck, leaving the wallet unable to sync until the app is force-closed.The most visible manifestation is Firo getting stuck at ~65% (which is what prompted this fix), but the underlying cause is generic and affects all wallets routing through the shared
_refresh()path.Changes
1. Idle watchdog on
_refresh()—lib/wallets/wallet/wallet.dartThe refresh body is extracted to a new
_doRefreshWork(viewOnly)helper._refresh()races it against aTimer.periodicwatchdog that trips if no refresh activity has been observed for 5 minutes (_refreshIdleThreshold). On trip, the watchdog throwsTimeoutException→ existingcatch→completer.completeError()→finallyreleasesrefreshMutex. The next periodic sync (every 150s) gets a fresh attempt.Activity is fed from two sources:
_fireRefreshPercentChangeticks — coarse phase checkpoints, plus per-sector progress during Spark anon-set downloads.onRequestCompletehook onElectrumXClient(see change Change name in README. #2).This matters because
updateTransactions()on a wallet with large history does hundreds of sequentialgetTransactionRPCs with no_fireRefreshPercentChangecalls between 0.65 and 0.70. Without the electrum hook, a slow server could legitimately trigger a false timeout on an initial restore of a big wallet. With the hook, the watchdog stays fed as long as electrum is answering.Caveats:
_doRefreshWorkfuture keeps running. If it eventually throws, the error is swallowed via a no-opcatchErrorattached in thefinallyblock so it does not surface as an uncaught async error. Per-call hang detection remains the job of the underlying adapters (e.g. electrum'sconnectionTimeout/aliveTimerDurationat 60s each).2.
onRequestCompletehook onElectrumXClient—lib/electrumx_rpc/electrumx_client.dartNew nullable
void Function()? onRequestCompletefield, called after every successful RPC in bothrequest()andbatchRequest(). Invocations go through_fireOnRequestComplete(), which wraps the call in try/catch so a faulty hook cannot be misattributed as an RPC failure by the surrounding catch block (which would otherwise incrementcurrentFailoverIndexand trigger a failover retry)._refresh()wires it to reset the watchdog timestamp while a refresh is active, then clears it infinally. Covers every current and future electrum-routed call automatically without instrumenting each wallet'supdateTransactions.3. Fix misleading progress reporting —
lib/wallets/wallet/wallet.dartBefore:
0.6 → 0.65 → [await UTXOs] → 0.70 → [await txns]After:
0.6 → [await UTXOs] → 0.65 → [await txns] → 0.70Progress now reflects actual completion rather than jumping ahead before the awaited work finishes. Applies to all non-Namecoin wallets (the
elsebranch).Scope
Affected (wallet types routing through the base
_refresh()):BTC, LTC, BCH, Dash, Doge, Ecash, Ethereum, Firo, Cardano, Solana, Stellar, Tezos, Xelis, Namecoin, Bitcoin Frost, Fact0rn, Particl, Peercoin, Nano, Banano.
Unaffected (override
refresh()entirely and bypass_refresh()):refresh()override inlib_monero_wallet.dart,lib_wownero_wallet.dart,lib_salvium_wallet.dart)The same watchdog pattern could be applied to the Monero-family overrides in a follow-up if they exhibit similar mutex-stuck symptoms.
Commits
For easier review each concern is isolated to its own commit:
refactor: extract _refresh() body into _doRefreshWork helper— pure mechanical extraction, no behavior change.fix: fire refresh progress updates after awaited work completes— progress reorder.fix: add idle watchdog to _refresh() so hung syncs release the mutex— core watchdog +onRequestCompletehook.fix: isolate onRequestComplete hook from RPC error handling— defensive try/catch around the hook invocation.fix: suppress uncaught errors from detached refresh work after watchdog trip— no-opcatchErroron the abandoned work future.refactor: promote refresh watchdog intervals to named constants—_refreshIdleThreshold/_refreshWatchdogTick.Test plan
onRequestCompletehook keeps watchdog alive)