feat: centralized peer registry, transport abstraction, and LEGACYSYNCING removal#565
Draft
oskarszoon wants to merge 42 commits intobsv-blockchain:mainfrom
Draft
feat: centralized peer registry, transport abstraction, and LEGACYSYNCING removal#565oskarszoon wants to merge 42 commits intobsv-blockchain:mainfrom
oskarszoon wants to merge 42 commits intobsv-blockchain:mainfrom
Conversation
cd79d69 to
37eeff4
Compare
Contributor
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-03-30 18:57 UTC |
571adfd to
6a0b37c
Compare
53e28e8 to
3ed504d
Compare
…CING removal Introduce a centralized peer registry in the blockchain service that tracks peers across both HTTP (P2P/DataHub) and wire protocol (legacy Bitcoin) transports. This replaces the fragmented per-service peer tracking with a single source of truth for peer information, reputation scoring, and transport-aware catchup orchestration. Key changes: Centralized Peer Registry (blockchain service): - Thread-safe in-memory registry with reputation scoring algorithm - gRPC API: RegisterPeer, UpdatePeerMetrics, RemovePeer, ListPeers, GetPeer - JSON file persistence with atomic writes and TTL-based cleanup - PeerRegistryClientI interface with connection ownership tracking Transport Abstraction (blockvalidation): - CatchupTransport interface abstracting HTTP and wire protocol fetching - HTTPTransport: extracts existing DataHub HTTP fetch logic - WireTransport: delegates to Legacy service via gRPC for wire protocol - Central registry polling for autonomous catchup orchestration Legacy Service Integration: - FetchHeadersFromPeer/FetchBlockFromPeer gRPC endpoints for wire protocol - Dual-write to central registry on peer connect/disconnect/metrics - One-shot request pattern with LoadOrStore for concurrency safety FSM Simplification: - Remove LEGACYSYNCING state — consolidate into CATCHINGBLOCKS - Simplify P2P SyncCoordinator (remove Kafka catchup publishing) - Legacy SyncManager now delegates catchup to BlockValidation - All FSM transitions and references updated across codebase Proto Changes: - Add PeerRegistryService to blockchain_api.proto - Add FetchHeadersFromPeer/FetchBlockFromPeer to legacy peer_api.proto - Reserve removed LEGACYSYNCING/LEGACYSYNC enum values - Regenerate all protobuf Go code
- Remove all LEGACYSYNCING/LEGACYSYNC references from miner guides, CLI docs, sync tutorials, dashboard docs, and protobuf docs - Update mermaid sync state flow diagram for 3-state FSM - Remove dead LEGACYSYNC event handling from dashboard UI - Remove legacy sync button, state colors, and API functions from dashboard - Fix pre-existing markdown lint issues (MD025, MD036, MD051) - Delete obsolete fsm_legacy_sync PlantUML diagram and SVG
3ed504d to
b7cd309
Compare
…s, sync coordinator - Transition FSM to CATCHINGBLOCKS immediately when catchup starts (Step 1.5) so subtree validation stops processing network messages during sync - Central registry poller: fast 3s initial interval for 10 attempts, then 30s - Central registry poller: skip poll when catchup already in progress - Central registry poller: prefer full nodes over pruned for catchup - Propagate storage mode (full/pruned) to central registry via updateStorage - SyncCoordinator: stop rotating peers when behind (defer to central poller) - SyncCoordinator: don't clear sync peer in handleRunningState if already set - Raise max accumulated headers to reach next checkpoint for quick validation - Pass maxHeadersOverride through catchupGetBlockHeaders without mutating settings
- Add ban scoring with decay, threshold, auto-unban to CentralizedPeerRegistry - Add BanConfig with defaults matching existing P2P BanManager - Add gRPC RPCs: AddBanScore, IsPeerBanned, ListBannedPeers, ClearBannedPeers - Add ReconsiderBadPeers for reputation recovery after cooldown - Add StartBanDecay background goroutine for score decay - Update PeerRegistryClientI with ban methods - Update all mock implementations
- Track per-peer cooldowns after failed catchup attempts - Exponential backoff: 30s, 60s, 120s, up to 5min max per peer - Skip peers on cooldown, try next best peer instead - Clear all cooldowns on successful catchup
- Remove peerRegistry, syncCoordinator, banManager, peerSelector from Server - centralRegistry is now REQUIRED (checked in Init) - All peer/ban/metrics ops go through centralRegistry gRPC - Rewire all test files to use central registry mocks - Add //go:build ignore to SyncCoordinator test files (code being removed) - Adapt Server_test.go, server_handler_test.go, report_invalid_block_test.go
BlockValidation now reports catchup metrics (success, failure, malicious, attempt) directly to the central registry via UpdatePeerMetrics instead of routing through P2P service RPCs. p2pClient is retained for non-metric operations (GetPeersForCatchup, RecordBytesDownloaded, parallel fetch).
Call peerRegistry.StartBanDecay(ctx) during blockchain service startup so ban scores automatically decay over time (1 point/minute).
… PeerSelector Remove files replaced by the centralized registry in blockchain service: - peer_registry.go, peer_registry_cache.go and their tests - sync_coordinator.go and all related tests - BanManager.go, BanManager_test.go - peer_selector.go, peer_selector_test.go - peer_registry_reputation_test.go Fix remaining references: remove MockPeerBanManager, BanReason refs, use string reason constants, skip tests needing local registry rewrite.
Ban management tests (18 tests in peer_registry_ban_test.go): - AddBanScore: scoring, threshold, decay, peer info sync, config lookup - IsBannedPeer: not banned, banned, auto-unban on expiry - ListBannedPeers: empty, returns only banned - ClearBannedPeers: clears all, resets peer info - ReconsiderBadPeers: old failures reset, recent failures kept, count - decayBanScores: decay over time, zero-score cleanup, banned entry kept - StartBanDecay: context cancellation Catchup poller tests (20 tests in central_registry_poller_test.go): - nextCooldownForPeer: exponential backoff 30s-5min, per-peer tracking - selectBestPeersFromCentralRegistry: height filter, full>pruned sort, wire protocol - pollCentralRegistry: no peers, isCatchingUp skip, cooldown skip, nil hash, error handling, expired cooldown retry
…s wiring - Fix TestCatchup_FSMStateManagement: setFSMCatchingBlocks now calls GetFSMCurrentState to check if already in CATCHINGBLOCKS before transitioning. Update mock to expect RUNNING state first. - Fix TestServerInit* tests: Init() now requires centralRegistry to be set. Add newPermissiveMockRegistry() helper and SetCentralPeerRegistry calls. - Wire P2P ban settings (BanThreshold, BanDuration) from settings.conf to the central registry's BanConfig instead of using 24h default. Fixes TestPeerIDBanExpirationE2E smoketest.
…eer ID encoding
- Add GetFSMCurrentState mock to setupTestCatchupServer (both instances)
for early Step 1.5 FSM transition in catchup
- Add centralRegistry to all 47 Server struct literals in P2P tests
- Fix TestServer_GetPeer: use peerID.String() for mock expectations
(peer.ID("non-existent").String() != "non-existent" due to base58 encoding)
- Fix mockPeerRegistryClient ban methods to actually call m.Called() instead of returning hardcoded values (IsPeerBanned, AddBanScore, etc) - Fix TestIsBannedChecksBothBanSystems: remove extra context arg from IsPeerBanned mock expectation - Fix TestCatchup_FSMStateManagement: filter permissive GetFSMCurrentState before setting ordered .Once() expectations - Fix TestCatchup/Empty_Catchup_Headers: add FSM mocks to standalone Server setup (CatchUpBlocks, GetFSMCurrentState, Run)
- Fix gci formatting in central_registry_test.go - Add FSM mocks (CatchUpBlocks, GetFSMCurrentState, Run) to createServerWithEnhancedCatchup helper in catchup_test.go for TestCatchupIntegrationScenarios/Context_Cancellation_During_Catchup
Must-fix: - Bounded worker pool (4 workers, chan size 256) replaces unbounded fire-and-forget goroutines for central registry updates in P2P - List() now checks ban expiry via banScores instead of stale IsBanned field - TransportType only updated when TransportTypeSet=true (fixes wire peer reset) - blockHashToBytes returns defensive copy to avoid slice aliasing Should-fix: - Throttle updatePeerLastMessageTime (30s cooldown per peer) - Remove duplicate addConnectedPeer (identical to addPeer) - Simplify nextCooldownForPeer with bit shift math - Document single-goroutine invariant on cooldown maps - Add TODO for getPeerIDFromDataHubURL efficiency - Fix waitForLegacyMockCalls race condition with atomic counter - Remove dead shouldSkipDuringSync and its test Nice-to-have: - Remove redundant nil checks in handle_catchup_metrics.go - Improve WireTransport error messages - Rename baseURL to peerEndpoint in CatchupTransport interface - Improve stub RPC logging (Debug -> Info for no-ops)
100 new test cases across 6 test files: P2P handle_catchup_metrics_test.go (30 tests): - All 12 RPC handlers tested with success, nil registry, error propagation Blockchain peer_registry_grpc_test.go (27 tests): - All 9 gRPC handlers, proto conversion round-trips, hash aliasing Blockchain peer_registry_client_test.go (6 tests): - ownsConn pattern: close with/without ownership, idempotent Blockchain peer_registry_persistence_additional_test.go (8 tests): - Full field round-trip, atomic rename, file permissions, TTL expiry BlockValidation http_transport_additional_test.go (15 tests): - FetchBlock/FetchBlocks/FetchSubtree/FetchSubtreeData coverage BlockValidation wire_transport_additional_test.go (14 tests): - FetchBlock/FetchHeaders, maxHeaders truncation, unsupported methods
peer_metrics_helpers_test.go (35 tests): - All 8 helper functions: reportCatchup*, isPeerMalicious, isPeerBad, reportValidBlockForPeers with nil/empty/error coverage peer_registry_client_additional_test.go (22 tests): - Full gRPC client-server integration via real TCP listener - All 9 client methods, full lifecycle, context cancellation server_registry_test.go (8 tests): - Stop saves registry, savePeerRegistryPeriodically, reload round-trip
…trict Two issues: - Central registry Register() never set LastMessageTime on peers - Dashboard filtered peers with last_message_time > 1min ago, but the field was always 0, filtering out every peer Fixes: - Set LastMessageTime=now on both new and updated peers in Register() - Relax dashboard filter: include peers with last_message_time=0 (recently registered), extend timeout to 5 minutes
…ng 100k The checkpoint-aware header cap only raised the limit but never lowered it. With default of 100k and first testnet checkpoint at height 546, catchup always downloaded 100k headers before starting block validation. Now the cap is SET to the checkpoint-based value (not just raised), so: - Height 0, checkpoint at 546: cap = 10,546 (~1 iteration, blocks start in ~9s) - Height 546, checkpoint at 100k: cap = 109,454 (~11 iterations) - Height 100k, checkpoint at 200k: cap = 110,000 (~11 iterations) If checkpoint is further than 100k, cap is raised as before.
…s during catchup Header optimization: - Request only as many headers as needed to reach the accumulated cap, instead of always requesting maxBlockHeadersPerRequest (10,000). For checkpoint at height 546, first request asks for 546 headers instead of 10,000 — single iteration instead of two. Quick validation fix: - Skip checkOldBlockIDs during catchup mode (IsCatchupMode=true). Blocks verified by checkpoint chain are guaranteed to be on the correct chain. The check was failing because quickValidateBlockAsync partially commits blocks (AddBlock) before UTXO processing, making the block ID chain temporarily inconsistent when quick validation falls back to normal validation.
…ng catchup Header cap: change checkpoint buffer from +10,000 to +100. For checkpoint at height 546 from height 0, cap is now 646 (single request) instead of 10,546 (two requests with 10k first). Pruned peer handling: BLOCK_INCOMPLETE errors (pruned node can't provide full block data) now put the peer on 24h cooldown instead of 30s. A pruned node will never have genesis block data, so retrying is pointless. The poller immediately tries the next peer on its 3s tick.
Don't toggle FSM CATCHING→RUNNING→CATCHING between consecutive catchup rounds when the node is still significantly behind the target. This avoids unnecessary FSM state change notifications to all subscribers and prevents subtree validation from briefly activating between rounds. Only restore to RUNNING when within 100 blocks of the target.
When a peer returns incomplete blocks (pruned node), instead of failing the entire catchup round and re-downloading headers from scratch, try alternative peers from the central registry using the same validated headers. This avoids wasting the header fetch time (~8s per checkpoint) and keeps the node in CATCHINGBLOCKS without FSM toggling. Flow: headers fetched once → try best peer for blocks → if pruned, try next peer → continue until a full node is found or all peers fail.
highestCheckpointHeight was set from ALL configured checkpoints (up to 900k+), causing every block to be quick-validated regardless of which checkpoints were actually verified in the current header batch. Now tracks the highest checkpoint actually verified within the fetched header range. For the first catchup round (headers 1-646, checkpoint at 546), only blocks 1-546 get quick validation. Blocks 547+ get normal validation until the next checkpoint is verified.
The poller switched to 30s interval after the first triggered catchup, causing a 30s gap between successive catchup rounds even when the node was still far behind. Now the poller checks if we're still behind after each completed round and resets to the fast 3s interval if so. The 30s interval is only used for the steady-state monitoring once we've caught up with all peers.
When a catchup round targets a known checkpoint, fetch headers by walking
backwards from the checkpoint hash using /headers/{hash}?n=1000 instead
of the expensive headers_from_common_ancestor endpoint.
The /headers endpoint walks the parent chain directly (~0.4s per 1000
headers) while headers_from_common_ancestor computes a common ancestor
first (0.4-16s unpredictably). For checkpoint 546 from genesis, this
is a single 0.4s request vs 8-16s.
For blocks beyond the last checkpoint, falls back to the original
headers_from_common_ancestor method since there's no known hash
to walk backwards from.
Headers are fetched in reverse (newest→oldest) then reversed to
produce the forward-order chain needed for validation.
…fetching" This reverts commit c97a654.
Catchup rounds take 30-60 seconds (headers + block validation). Reporting this as the peer's response time gives a 36s avg which triggers the 0.6x speed penalty in the reputation algorithm, dropping a 100% success peer from ~80 to ~54 reputation. Pass 0 for responseTimeMs so catchup duration doesn't skew the speed-based reputation multiplier.
- Use UTXO store height as CurrentHeight fallback (was showing 0 during header fetch because catchupCtx.currentHeight is only set after findCommonAncestor) - Add Phase field to catchup status: "downloading_headers", "validating_blocks", or "finalizing" - Dashboard shows "Downloading headers..." during header fetch phase instead of "0 / 0 blocks" - Rename "Starting Height" to "Current Height" in dashboard
…try wiring - Asset GetPeers now queries blockchain PeerRegistryService directly instead of routing through P2P service. Works with legacy-only mode. - Fix legacy SetCentralPeerRegistry: store client on outer Server struct and apply to inner server during Start() (was nil because inner server doesn't exist when daemon calls SetCentralPeerRegistry before Start) - Legacy height updates now record success interactions for reputation - Add debug logging for legacy peer registration success/failure
…ming Instead of BlockValidation driving wire-protocol sync block-by-block through FetchHeaders/FetchBlock RPCs (WireTransport), catchup is now delegated to Legacy's existing sync pipeline via a new DelegateCatchup streaming RPC. Legacy runs its normal headers-first sync and reports progress back to BlockValidation, which manages FSM transitions, the catchup lock, and dashboard progress display. If Legacy is already syncing when the delegation request arrives, it attaches to the running sync. Also fixes: peer reputation for legacy peers (records interaction success on each accepted block), peer name prefix (legacy: prefix in central registry), and BlockHash population for wire-protocol peers.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


Summary
DelegateCatchupstreaming gRPC. Legacy runs its existing headers-first sync pipeline and streams progress back. If Legacy is already syncing when the request arrives, it attaches to the running sync rather than starting a duplicateblocksValidated/blocksFetched), current height, and phase are reported to the dashboard during delegated catchup. Legacy peers are prefixed withlegacy:in the peer registry for easy identificationonBlockAcceptedcallback, driving reputation above the 50.0 baselineFetchHeadersFromPeer/FetchBlockFromPeergRPC endpoints for ad-hoc fetches,DelegateCatchupstreaming RPC for full sync delegation, dual-write to central registry on connect/disconnect/metricsLEGACYSYNCINGstate, consolidated intoCATCHINGBLOCKS, simplified P2P SyncCoordinator and Legacy SyncManager delegationcatchupViaLegacy, HTTP peers to existingcatchuppipelineKey Design Decisions
WireTransport), catchup is delegated to Legacy existing sync pipeline. This avoids reimplementing headers-first sync in BlockValidation and lets Legacy use its battle-tested wire protocol handlingstartSyncoften fires before BlockValidation poller, soRunDelegatedCatchupattaches to the running sync by settingdelegated.activeand waiting for completion — no restart neededstartSync,handleCheckSyncPeer,blockHandlerRUN events) since BlockValidation owns FSM stateArchitecture
Test Plan
make testpasses (unit tests including peer registry, transport, catchup, legacy registry)make smoketestpassesLEGACYSYNCINGfully removed:grep -r LEGACYSYNCINGreturns no Go/proto hitslegacy:/Bitcoin SV:x.x.x/with increasing reputation