Skip to content

Draft: Fix deadlock between DataNode createDataRegion and ConfigNode PipeTaskCoordinatorLock by delegating consensus pipe lifecycle to ConfigNode#17233

Open
Pengzna wants to merge 8 commits intoapache:masterfrom
Pengzna:latest-it

Conversation

@Pengzna
Copy link
Collaborator

@Pengzna Pengzna commented Feb 28, 2026

NOTE: This PR is genereated by Claude, I proposed the entire plans and prompts and have carefully reviewed this PR.

Problem

When using IoTConsensusV2, cluster initialization may deadlock during DataRegion creation (e.g., for root.__audit), causing "After 30 times retry, the cluster can't work!".

Root cause: Circular dependency between DataNode and ConfigNode:

  1. ConfigNode holds PipeTaskCoordinatorLock**pushSinglePipeMeta** to DataNode (blocking)
    DataNodeInternalRPCServiceImpl.pushSinglePipeMeta
    → PipeTaskAgent.handleSinglePipeMetaChanges
    → PipeTaskAgent.createPipe
    → PipeDataNodeTaskBuilder.build
    → IoTDBDataRegionSource.customize
    → IoTDBDataRegionSource.login
    → SessionManager.login
    → DataNodeAuthUtils.recordPasswordHistory
    → Coordinator.executeForTreeModel (INSERT audit data)
    → ClusterPartitionFetcher.getOrCreateDataPartition
    → ConfigNodeClient.getOrCreateDataPartitionTable (sync RPC call ConfigNode)
    → SocketTimeoutException: Read timed out
  2. DataNode's **createDataRegion** handler → PipeConsensusServerImpl constructor → synchronous createPipe RPC back to ConfigNode → blocked by the same lock → timeout
    ConfigNode. CreateRegionGroupsProcedure ->DataNodeRegionManager.createDataRegion (line 157)
    → PipeConsensus.createLocalPeer
    → PipeConsensusServerImpl. (line 124)
    → createConsensusPipes
    → ConsensusPipeDataNodeDispatcher.createPipe (line 66)
    → ConfigNodeClient.createPipe (sync RPC call ConfigNode)

Solution

Delegate all consensus pipe lifecycle management to ConfigNode, eliminating DN→CN synchronous pipe RPCs entirely.

ConfigNode side:

  • AddRegionPeerProcedure: add CREATE_CONSENSUS_PIPES state — creates bidirectional pipes between new peer and existing peers before DO_ADD_REGION_PEER
  • RemoveRegionPeerProcedure: add DROP_CONSENSUS_PIPES state — drops related pipes after DELETE_OLD_REGION_PEER
  • RegionMaintainHandler: add helper methods to build TCreatePipeReq and invoke ProcedureManager.createConsensusPipe/dropConsensusPipe

DataNode side:

  • PipeConsensusServerImpl: remove createConsensusPipes call from constructor (pipes are created by ConfigNode's procedure)
  • ConsensusPipeDataNodeDispatcher: convert to no-op (all 4 methods just return immediately)

Consistency guarantees:

  • New DataRegion creation: ConfigNode already creates pipes via CreatePipeProcedureV2
  • Region migration: ConfigNode creates pipes deterministically in procedure before coordinator starts data transfer
  • checkConsensusPipe guardian still detects inconsistencies via logging; peerManager updates are preserved through existing P2P notification flow

@Pengzna Pengzna changed the title Fix deadlock between DataNode createDataRegion and ConfigNode PipeTaskCoordinatorLock by delegating consensus pipe lifecycle to ConfigNode Draft: Fix deadlock between DataNode createDataRegion and ConfigNode PipeTaskCoordinatorLock by delegating consensus pipe lifecycle to ConfigNode Feb 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant